Based on the article from https://www.kdnuggets.com/2020/01/intro-guide-nlp-data-scientists.html

# 1. Tokenization

Tokenization is the process of cutting text in sentences or words. This can be done with nltk, but first let's download what we need...

In [1]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Mikael\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
import nltk

sentence = "My name is Mike and I love NLP, I live in New Orleans. New York is a beautiful place!"
tokens = nltk.word_tokenize(sentence)
print(tokens)

['My', 'name', 'is', 'Mike', 'and', 'I', 'love', 'NLP', ',', 'I', 'live', 'in', 'New', 'Orleans', '.', 'New', 'York', 'is', 'a', 'beautiful', 'place', '!']


This split the text into words, and ponctuations characters. 
Note: New York the city was "understood" as two words, there is not much of intelligence here...

Interesting if we do it in French...

In [9]:
sentence = "Mon nom est Mike et j'aime le NLP, j'habite a Paris mais j'aime aussi beaucoup New-York!"
tokens = nltk.word_tokenize(sentence)
print(tokens)

['Mon', 'nom', 'est', 'Mike', 'et', "j'aime", 'le', 'NLP', ',', "j'habite", 'a', 'Paris', 'mais', "j'aime", 'aussi', 'beaucoup', 'New-York', '!']


Interesting this time New-York was with a dash and was not split in two words, on the other hand "j'aime" should have been marked as two words but was seen as single one!

__TODO: need to check how to process other than english texts...__

# 2. Stop word removal
This step allows us to remove the noise from the text by removing the common words like "the", "a", "and"... to reduce the noise. Note that the punctuation is still present but "is", "and", "in","a" were removed since defined as stopwords


In [6]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords

sentence = "My name is Mike and I love NLP, I live in New Orleans. New York is a beautiful place!"
tokens = nltk.word_tokenize(sentence)

stop_words = stopwords.words('english')
filtered_tokens = [w for w in tokens if w not in stop_words]
print(filtered_tokens)

['My', 'name', 'Mike', 'I', 'love', 'NLP', ',', 'I', 'live', 'New', 'Orleans', '.', 'New', 'York', 'beautiful', 'place', '!']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Mikael\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Let's try with the french sentence and a french dictionnary....

In [8]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords

sentence = "Mon nom est Mike et j'aime le NLP, j'habite a Paris mais j'aime aussi beaucoup New-York!"
tokens = nltk.word_tokenize(sentence)

stop_words = stopwords.words('french')
filtered_tokens = [w for w in tokens if w not in stop_words]
print(filtered_tokens)

['Mon', 'nom', 'Mike', "j'aime", 'NLP', ',', "j'habite", 'a', 'Paris', "j'aime", 'aussi', 'beaucoup', 'New-York', '!']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Mikael\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


So "est", "et", "le" , "mais" were removed, but was also expected to remove the "a" that indicate location and to have the "j'" as stand alone...


# 3. Stemming
Stemming is the process of reducing words to their root. For example eat, eats, eating, ate are all from the eat root. Stemming simplifies the analysis of the text, since it reduce the number of word variations.


In [15]:
import nltk

snowball_stemmer = nltk.stem.SnowballStemmer('english')

words = ("eat" ,"eats", "eating", "eaten", "ate", "cook", "cooked", "cooking")

for w in words:
        print(f"{w} --> {snowball_stemmer.stem(w)}")

eat --> eat
eats --> eat
eating --> eat
eaten --> eaten
ate --> ate
cook --> cook
cooked --> cook
cooking --> cook


Now the question: Why 'eaten' and 'ate' were not recognized? It seems that the problem is related to the fact that __stemming__ is based on an algorythm and therefore not all the form of the words are correctly captured.

# 4. Word Embedding

Word embedding allows us to move word representation to a format that is processable (e.g. a set of numerical values). But of course numberical values only are not enough (on ehot encoding is not really useful here for example) we need the representation to catch the similarity between words. So 2 words are similar if their numerical representation is close to each other.
