# **Natural Language Processing**

## **NLP** application : 
- Text translation
- Speech recognition
- Natural language understanding
- Sentiment analysis
- Topic modelling
- Summarization
- Grammatical correction

**NLTK** Natural Language Toolkit

In [5]:
from nltk.tokenize import word_tokenize

## **Tokenization** :  is the process of splitting a text into words

In [8]:
document = "Ain't no sunshine when she's gone. And she's always gone too long. Anytime she goes away."

In [10]:
tokens = word_tokenize(document)
print(tokens)

['Ai', "n't", 'no', 'sunshine', 'when', 'she', "'s", 'gone', '.', 'And', 'she', "'s", 'always', 'gone', 'too', 'long', '.', 'Anytime', 'she', 'goes', 'away', '.']


### remove punctuation

In [11]:
tokens = [word for word in tokens if word.isalpha()]
print(tokens)

['Ai', 'no', 'sunshine', 'when', 'she', 'gone', 'And', 'she', 'always', 'gone', 'too', 'long', 'Anytime', 'she', 'goes', 'away']


## remove stopwords

Stop words are words that typically add no value to the text, but are only present for grammatical reasons

In [13]:
from nltk.corpus import stopwords

eng_stopwords = stopwords.words("english")

In [14]:
tokens = [word for word in tokens if word not in eng_stopwords]
print(tokens)

['Ai', 'sunshine', 'gone', 'And', 'always', 'gone', 'long', 'Anytime', 'goes', 'away']


## python is case sensitive
'And' == 'and' # False

In [16]:
tokens = [word for word in tokens if word.lower() not in eng_stopwords]
print(tokens)

['Ai', 'sunshine', 'gone', 'always', 'gone', 'long', 'Anytime', 'goes', 'away']


## **Stemming** and **Lemmatization**

Stemming and lemmatization are basically the action of keeping only the root of words.
- Stemming : merely truncating the word
- Lemmatization : use context to get the relevant root of a word

In [18]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

In [None]:
print(stemmer.stem("connection"), stemmer.stem("connected"), stemmer.stem("connective"))

connect connect connect


In [None]:
# but what if the words don't mean the same thing once truncated
print(stemmer.stem("meaning"), stemmer.stem("meanness"), stemmer.stem("is"))
print(lemmatizer.lemmatize("meaning", "n"), lemmatizer.lemmatize("meanness", "n"), lemmatizer.lemmatize("is", "v")) # better model


mean mean is
meaning meanness be


**POS** tagging : Part Of Speech

## **Ngrams**

In [24]:
import nltk 

In [30]:
# 2 word per token
tokens = word_tokenize(document)
list(nltk.bigrams(tokens))

[('Ai', "n't"),
 ("n't", 'no'),
 ('no', 'sunshine'),
 ('sunshine', 'when'),
 ('when', 'she'),
 ('she', "'s"),
 ("'s", 'gone'),
 ('gone', '.'),
 ('.', 'And'),
 ('And', 'she'),
 ('she', "'s"),
 ("'s", 'always'),
 ('always', 'gone'),
 ('gone', 'too'),
 ('too', 'long'),
 ('long', '.'),
 ('.', 'Anytime'),
 ('Anytime', 'she'),
 ('she', 'goes'),
 ('goes', 'away'),
 ('away', '.')]

In [31]:
# 3 word per token
list(nltk.trigrams(tokens))

[('Ai', "n't", 'no'),
 ("n't", 'no', 'sunshine'),
 ('no', 'sunshine', 'when'),
 ('sunshine', 'when', 'she'),
 ('when', 'she', "'s"),
 ('she', "'s", 'gone'),
 ("'s", 'gone', '.'),
 ('gone', '.', 'And'),
 ('.', 'And', 'she'),
 ('And', 'she', "'s"),
 ('she', "'s", 'always'),
 ("'s", 'always', 'gone'),
 ('always', 'gone', 'too'),
 ('gone', 'too', 'long'),
 ('too', 'long', '.'),
 ('long', '.', 'Anytime'),
 ('.', 'Anytime', 'she'),
 ('Anytime', 'she', 'goes'),
 ('she', 'goes', 'away'),
 ('goes', 'away', '.')]

In [32]:
# n word per token
list(nltk.ngrams(tokens, 5))

[('Ai', "n't", 'no', 'sunshine', 'when'),
 ("n't", 'no', 'sunshine', 'when', 'she'),
 ('no', 'sunshine', 'when', 'she', "'s"),
 ('sunshine', 'when', 'she', "'s", 'gone'),
 ('when', 'she', "'s", 'gone', '.'),
 ('she', "'s", 'gone', '.', 'And'),
 ("'s", 'gone', '.', 'And', 'she'),
 ('gone', '.', 'And', 'she', "'s"),
 ('.', 'And', 'she', "'s", 'always'),
 ('And', 'she', "'s", 'always', 'gone'),
 ('she', "'s", 'always', 'gone', 'too'),
 ("'s", 'always', 'gone', 'too', 'long'),
 ('always', 'gone', 'too', 'long', '.'),
 ('gone', 'too', 'long', '.', 'Anytime'),
 ('too', 'long', '.', 'Anytime', 'she'),
 ('long', '.', 'Anytime', 'she', 'goes'),
 ('.', 'Anytime', 'she', 'goes', 'away'),
 ('Anytime', 'she', 'goes', 'away', '.')]

## **BOW** : Bag of Words

a bow is just a vector keeping track of how many times each word has been encountered in a text

⚠️ It does not keep any information about the grammar or the order of the words in a sentence.

### the bow would be 
"Nicolas loves to watch Disney movies but everybody loves Disney movies. Pierre loves football, unlike Nicolas."

BOW = {Nicolas: 2, loves: 3, to: 1, watch: 1, Disney: 2, movies: 2,
       but: 1, everybody: 1, Pierre: 1, football: 1, unlike: 1}

In [36]:
from sklearn.feature_extraction.text import CountVectorizer

countVectorizer = CountVectorizer(max_features=1000, stop_words="english")

In [43]:
document = ["Nicolas loves to watch Disney movies but everybody loves Disney movies.", "Helene loves football, unlike Nicolas."]
# We create the BOW, we also can directly remove the stopwords and the punctuation
bow = countVectorizer.fit_transform(document).toarray()
bow

array([[2, 1, 0, 0, 2, 2, 1, 0, 1],
       [0, 0, 1, 1, 1, 0, 1, 1, 0]])

In [44]:
import pandas as pd

In [None]:
# Get the words associated to the vectors
tokens = countVectorizer.get_feature_names_out()
tokens

array(['disney', 'everybody', 'football', 'helene', 'loves', 'movies',
       'nicolas', 'unlike', 'watch'], dtype=object)

In [None]:
pd.DataFrame(data=bow, columns=tokens)

Unnamed: 0,disney,everybody,football,helene,loves,movies,nicolas,unlike,watch
0,2,1,0,0,2,2,1,0,1
1,0,0,1,1,1,0,1,1,0
