# NLP - Tokenization       

By [Leonardo Tozo](https://www.linkedin.com/in/leotozo/)

****************************
Hello,
<br>This is part of my personal portfolio, my intention with this series of notebooks is to keep practicing and improving my A.I & Machine Learning skills.
 
*Leonardo Tozo Bisinoto*
<br>*MBA in Artificial Intelligence & Machine Learning*
<br>*LinkedIn: https://www.linkedin.com/in/leotozo/*
<br>*Github: https://github.com/leotozo*
**************************** 

This data analysis uses the IMDB reviews dataset. I will perform a basic NLP techniques.

In [3]:
import pandas as pd
import numpy as np

# Reading the IMDB dataset.

In [5]:
df = pd.read_csv(
    './imdb-dataset.csv',encoding='utf-8'
).sample(1000)

In [6]:
df.head()

Unnamed: 0,review,sentiment
49338,What's to like about this movie???<br /><br />...,negative
43748,"Having watched the show for about four weeks, ...",positive
42998,Not a bad word to say about this film really. ...,positive
31133,"I don't know why, but for some sick reason, I ...",negative
13154,this movie is the worst EVER!!! sorry but this...,negative


In [9]:
text = df["review"].iloc[0]

In [17]:
df["review"] = df['review'].replace(r"[^a-zA-Z0-9]+",' ',regex=True) 

In [18]:
df["review"].iloc[0]

'What s to like about this movie br br It is in colour br br It has some impressive underwater photography br br It has a rhythmic musical score in the background that works well at times br br So 3 out of 10 br br Sometimes the music is speeded up Especially when the shark or the baddies are about to move in br br Sometimes it is slowed As if to convey to the audience it s about to be time for sympathy br br As another one bites the dust As if in a spagetti Western this has much similarity to br br It s not that the Italians can t produce quality productions There was a series of TV movies with a heading like Octopus numbered about 1 to 7 screened on SBS TV in Australia in the 1990s about mafia type conflicts And they were excellent But alas you won t find it here br br I assumed it was made about 1960s Sadly it was 20 years out of date as evidenced by a funeral scene near the end br br Then there was the razor sharp bite of the speedy shark that makes for a red dust repeatedly emergi

In [19]:
import nltk

In [20]:
wt = nltk.word_tokenize(df["review"].iloc[0])

np.array(wt)

array(['What', 's', 'to', 'like', 'about', 'this', 'movie', 'br', 'br',
       'It', 'is', 'in', 'colour', 'br', 'br', 'It', 'has', 'some',
       'impressive', 'underwater', 'photography', 'br', 'br', 'It', 'has',
       'a', 'rhythmic', 'musical', 'score', 'in', 'the', 'background',
       'that', 'works', 'well', 'at', 'times', 'br', 'br', 'So', '3',
       'out', 'of', '10', 'br', 'br', 'Sometimes', 'the', 'music', 'is',
       'speeded', 'up', 'Especially', 'when', 'the', 'shark', 'or', 'the',
       'baddies', 'are', 'about', 'to', 'move', 'in', 'br', 'br',
       'Sometimes', 'it', 'is', 'slowed', 'As', 'if', 'to', 'convey',
       'to', 'the', 'audience', 'it', 's', 'about', 'to', 'be', 'time',
       'for', 'sympathy', 'br', 'br', 'As', 'another', 'one', 'bites',
       'the', 'dust', 'As', 'if', 'in', 'a', 'spagetti', 'Western',
       'this', 'has', 'much', 'similarity', 'to', 'br', 'br', 'It', 's',
       'not', 'that', 'the', 'Italians', 'can', 't', 'produce', 'quality',
 

In [21]:
df["review"].iloc[0]

'What s to like about this movie br br It is in colour br br It has some impressive underwater photography br br It has a rhythmic musical score in the background that works well at times br br So 3 out of 10 br br Sometimes the music is speeded up Especially when the shark or the baddies are about to move in br br Sometimes it is slowed As if to convey to the audience it s about to be time for sympathy br br As another one bites the dust As if in a spagetti Western this has much similarity to br br It s not that the Italians can t produce quality productions There was a series of TV movies with a heading like Octopus numbered about 1 to 7 screened on SBS TV in Australia in the 1990s about mafia type conflicts And they were excellent But alas you won t find it here br br I assumed it was made about 1960s Sadly it was 20 years out of date as evidenced by a funeral scene near the end br br Then there was the razor sharp bite of the speedy shark that makes for a red dust repeatedly emergi

# Tokenize using the white spaces

In [22]:
ws = nltk.tokenize.WhitespaceTokenizer().tokenize(df["review"].iloc[0])

np.array(ws)

array(['What', 's', 'to', 'like', 'about', 'this', 'movie', 'br', 'br',
       'It', 'is', 'in', 'colour', 'br', 'br', 'It', 'has', 'some',
       'impressive', 'underwater', 'photography', 'br', 'br', 'It', 'has',
       'a', 'rhythmic', 'musical', 'score', 'in', 'the', 'background',
       'that', 'works', 'well', 'at', 'times', 'br', 'br', 'So', '3',
       'out', 'of', '10', 'br', 'br', 'Sometimes', 'the', 'music', 'is',
       'speeded', 'up', 'Especially', 'when', 'the', 'shark', 'or', 'the',
       'baddies', 'are', 'about', 'to', 'move', 'in', 'br', 'br',
       'Sometimes', 'it', 'is', 'slowed', 'As', 'if', 'to', 'convey',
       'to', 'the', 'audience', 'it', 's', 'about', 'to', 'be', 'time',
       'for', 'sympathy', 'br', 'br', 'As', 'another', 'one', 'bites',
       'the', 'dust', 'As', 'if', 'in', 'a', 'spagetti', 'Western',
       'this', 'has', 'much', 'similarity', 'to', 'br', 'br', 'It', 's',
       'not', 'that', 'the', 'Italians', 'can', 't', 'produce', 'quality',
 

# Tokenize using Punctuations

In [24]:
punct = nltk.tokenize.WordPunctTokenizer().tokenize(df["review"].iloc[0])

np.array(punct)


array(['What', 's', 'to', 'like', 'about', 'this', 'movie', 'br', 'br',
       'It', 'is', 'in', 'colour', 'br', 'br', 'It', 'has', 'some',
       'impressive', 'underwater', 'photography', 'br', 'br', 'It', 'has',
       'a', 'rhythmic', 'musical', 'score', 'in', 'the', 'background',
       'that', 'works', 'well', 'at', 'times', 'br', 'br', 'So', '3',
       'out', 'of', '10', 'br', 'br', 'Sometimes', 'the', 'music', 'is',
       'speeded', 'up', 'Especially', 'when', 'the', 'shark', 'or', 'the',
       'baddies', 'are', 'about', 'to', 'move', 'in', 'br', 'br',
       'Sometimes', 'it', 'is', 'slowed', 'As', 'if', 'to', 'convey',
       'to', 'the', 'audience', 'it', 's', 'about', 'to', 'be', 'time',
       'for', 'sympathy', 'br', 'br', 'As', 'another', 'one', 'bites',
       'the', 'dust', 'As', 'if', 'in', 'a', 'spagetti', 'Western',
       'this', 'has', 'much', 'similarity', 'to', 'br', 'br', 'It', 's',
       'not', 'that', 'the', 'Italians', 'can', 't', 'produce', 'quality',
 

# Tokenization using grammer rules

In [25]:
gr = nltk.tokenize.TreebankWordTokenizer().tokenize(df["review"].iloc[0])

np.array(gr)

array(['What', 's', 'to', 'like', 'about', 'this', 'movie', 'br', 'br',
       'It', 'is', 'in', 'colour', 'br', 'br', 'It', 'has', 'some',
       'impressive', 'underwater', 'photography', 'br', 'br', 'It', 'has',
       'a', 'rhythmic', 'musical', 'score', 'in', 'the', 'background',
       'that', 'works', 'well', 'at', 'times', 'br', 'br', 'So', '3',
       'out', 'of', '10', 'br', 'br', 'Sometimes', 'the', 'music', 'is',
       'speeded', 'up', 'Especially', 'when', 'the', 'shark', 'or', 'the',
       'baddies', 'are', 'about', 'to', 'move', 'in', 'br', 'br',
       'Sometimes', 'it', 'is', 'slowed', 'As', 'if', 'to', 'convey',
       'to', 'the', 'audience', 'it', 's', 'about', 'to', 'be', 'time',
       'for', 'sympathy', 'br', 'br', 'As', 'another', 'one', 'bites',
       'the', 'dust', 'As', 'if', 'in', 'a', 'spagetti', 'Western',
       'this', 'has', 'much', 'similarity', 'to', 'br', 'br', 'It', 's',
       'not', 'that', 'the', 'Italians', 'can', 't', 'produce', 'quality',
 

In [26]:
df["review"].iloc[0]

'What s to like about this movie br br It is in colour br br It has some impressive underwater photography br br It has a rhythmic musical score in the background that works well at times br br So 3 out of 10 br br Sometimes the music is speeded up Especially when the shark or the baddies are about to move in br br Sometimes it is slowed As if to convey to the audience it s about to be time for sympathy br br As another one bites the dust As if in a spagetti Western this has much similarity to br br It s not that the Italians can t produce quality productions There was a series of TV movies with a heading like Octopus numbered about 1 to 7 screened on SBS TV in Australia in the 1990s about mafia type conflicts And they were excellent But alas you won t find it here br br I assumed it was made about 1960s Sadly it was 20 years out of date as evidenced by a funeral scene near the end br br Then there was the razor sharp bite of the speedy shark that makes for a red dust repeatedly emergi

# STEMMING

In [27]:
# Original Words
words  = nltk.tokenize.WhitespaceTokenizer().tokenize(df["review"].iloc[0])
stemm = pd.DataFrame()
stemm['OriginalWords'] = pd.Series(words)

# Porter's Stemmer
porterStemmedWords = [nltk.stem.PorterStemmer().stem(word) for word in words]
stemm['PorterStemmedWords'] = pd.Series(porterStemmedWords)

# SnowBall Stemmer
snowballStemmedWords = [nltk.stem.SnowballStemmer("english").stem(word) for word in words]
stemm['SnowballStemmedWords'] = pd.Series(snowballStemmedWords)
stemm

Unnamed: 0,OriginalWords,PorterStemmedWords,SnowballStemmedWords
0,What,what,what
1,s,s,s
2,to,to,to
3,like,like,like
4,about,about,about
...,...,...,...
410,reality,realiti,realiti
411,when,when,when
412,this,thi,this
413,is,is,is


# LEMMATIZATION

In [28]:
words  = nltk.tokenize.WhitespaceTokenizer().tokenize(df["review"].iloc[0])
lemm = pd.DataFrame()
lemm['OriginalWords'] = pd.Series(words)
# WordNet Lemmatization
wordNetLemmatizedWords = [nltk.stem.WordNetLemmatizer().lemmatize(word) for word in words]
lemm['WordNetLemmatizer'] = pd.Series(wordNetLemmatizedWords)
lemm

Unnamed: 0,OriginalWords,WordNetLemmatizer
0,What,What
1,s,s
2,to,to
3,like,like
4,about,about
...,...,...
410,reality,reality
411,when,when
412,this,this
413,is,is
