                           Natural Language Processing 

In [61]:
## STEP:1  Sentence Tokenization 

import nltk
text = "Backgammon is one of the oldest known board games. Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East. It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice."
sentences = nltk.sent_tokenize(text)
for sentence in sentences:
    print(sentence)
    print()

Backgammon is one of the oldest known board games.

Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East.

It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice.



In [62]:
## STEP :2 Word Tokenization 

for sentence in sentences:
    words = nltk.word_tokenize(sentence)
    print(words)
    print()

['Backgammon', 'is', 'one', 'of', 'the', 'oldest', 'known', 'board', 'games', '.']

['Its', 'history', 'can', 'be', 'traced', 'back', 'nearly', '5,000', 'years', 'to', 'archeological', 'discoveries', 'in', 'the', 'Middle', 'East', '.']

['It', 'is', 'a', 'two', 'player', 'game', 'where', 'each', 'player', 'has', 'fifteen', 'checkers', 'which', 'move', 'between', 'twenty-four', 'points', 'according', 'to', 'the', 'roll', 'of', 'two', 'dice', '.']



In [63]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

def compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word, pos):
    """
    Print the results of stemmind and lemmitization using the passed stemmer, lemmatizer, word and pos (part of speech)
    """
    print("Stemmer:", stemmer.stem(word))
    print("Lemmatizer:", lemmatizer.lemmatize(word, pos))
    print()

lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word = "seen", pos = wordnet.VERB)
compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word = "drove", pos = wordnet.VERB)

Stemmer: seen
Lemmatizer: see

Stemmer: drove
Lemmatizer: drive



In [64]:
###If you use it for your first time, you need to download the stop words using this code:
##  nltk.download(“stopwords”)

from nltk.corpus import stopwords
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [65]:

stop_words = set(stopwords.words("english"))
sentence = "Backgammon is one of the oldest known board games."

words = nltk.word_tokenize(sentence)
print(words)
## list comprehensions in Python.
without_stop_words = [word for word in words if not word in stop_words]
print(without_stop_words)

['Backgammon', 'is', 'one', 'of', 'the', 'oldest', 'known', 'board', 'games', '.']
['Backgammon', 'one', 'oldest', 'known', 'board', 'games', '.']


In [66]:
stop_words = set(stopwords.words("english"))
sentence = "Backgammon is one of the oldest known board games."
words = nltk.word_tokenize(sentence)
print(words)

without_stop_words = []
for word in words:
    if word not in stop_words:
        without_stop_words.append(word)

print(without_stop_words)

['Backgammon', 'is', 'one', 'of', 'the', 'oldest', 'known', 'board', 'games', '.']
['Backgammon', 'one', 'oldest', 'known', 'board', 'games', '.']


### A regular expression, regex, or regexp is a sequence of characters that define a search pattern. Let’s see some basics.

1. - match any character except newline
2. \w - match word
3. \d - match digit
4. \s - match whitespace
5. \W - match not word
6. \D - match not digit
7. \S - match not whitespace
8. [abc] - match any of a, b, or c
9. [^abc] - not match a, b, or c
10. [a-g] - match a character between a & g



In [67]:
import re
sentence = "The development of snowboarding was inspired by skateboarding, sledding, surfing and skiing."
pattern = r"[^\w]"
print(re.sub(pattern, " ", sentence))

The development of snowboarding was inspired by skateboarding  sledding  surfing and skiing 


### Summary 

you learn the basics of the NLP for text. More specifically you have learned the following concepts with additional details:

1. NLP is used to apply machine learning algorithms to text and speech.
2. NLTK (Natural Language Toolkit) is a leading platform for building Python programs to work with human language    data
3. Sentence tokenization is the problem of dividing a string of written language into its component sentences
4. Word tokenization is the problem of dividing a string of written language into its component words
5. The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally     related forms of a word to a common base form.
6. Stop words are words which are filtered out before or after processing of text. They usually refer to the most    common words in a language.
7. A regular expression is a sequence of characters that define a search pattern.
8. The bag-of-words model is a popular and simple feature extraction technique used when we work with text. It      describes the occurrence of each word within a document.
9. TF-IDF is a statistical measure used to evaluate the importance of a word to a document in a collection or    corpus.

