# NLP - Data Cleaning        

By [Leonardo Tozo](https://www.linkedin.com/in/leotozo/)

****************************
Hello,
<br>This is part of my personal portfolio, my intention with this series of notebooks is to keep practicing and improving my A.I & Machine Learning skills.
 
*Leonardo Tozo Bisinoto*
<br>*MBA in Artificial Intelligence & Machine Learning*
<br>*LinkedIn: https://www.linkedin.com/in/leotozo/*
<br>*Github: https://github.com/leotozo*
**************************** 

This data analysis uses the IMDB reviews dataset. I will perform a basic Data Cleaning using NLP techniques.

In [76]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Reading the IMDB dataset.

In [77]:
df = pd.read_csv(
    './imdb-dataset.csv',encoding='utf-8'
).sample(10000)

# Creating a copy of the dataset.

In [78]:
df_clean = df.copy()

# Selecting only the 'review' column.

In [79]:
df_clean = df_clean.loc[:, ['review']]

# Displaying the first 5 rows of the dataset.

In [80]:
df_clean.head()

Unnamed: 0,review
36431,Jane Eyre has always been my favorite novel! W...
21308,"Skenbart takes place in the 1940s, right after..."
46795,I saw this move several years ago at the Centr...
25041,Make no bones about it. There are a lot of thi...
2926,Oh it really really is. I've seen films that I...


# Displaying the last 5 rows of the dataset.

In [81]:
df_clean.tail()

Unnamed: 0,review
16191,This movie was talked about in Fangoria where ...
39737,I couldn't disagree more with those who says t...
46990,"""The Bone Snatcher"" starts out extremely promi..."
42086,This film is my favorite comedy of all time an...
22680,Yes there are great performances here. Unfortu...


# Counting the dataset rows.

In [82]:
df_clean.count()

review    10000
dtype: int64

# Displaying basic info about the dataset.

In [83]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 36431 to 22680
Data columns (total 1 columns):
review    10000 non-null object
dtypes: object(1)
memory usage: 156.2+ KB


# Displaying the dataset shape (# of rows, # of columns)
 

In [84]:
df_clean.shape

(10000, 1)

# Removing stopwords [Cleaning]
 

In [85]:
df_clean.head()

Unnamed: 0,review
36431,Jane Eyre has always been my favorite novel! W...
21308,"Skenbart takes place in the 1940s, right after..."
46795,I saw this move several years ago at the Centr...
25041,Make no bones about it. There are a lot of thi...
2926,Oh it really really is. I've seen films that I...


In [86]:
len(df_clean)

10000

# Extracting the stopwords from nltk library and displaying  


In [87]:
import nltk
from nltk.corpus import stopwords

sw = stopwords.words('english')
np.array(sw)


array(['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you',
       "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself',
       'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her',
       'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them',
       'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom',
       'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are',
       'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
       'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and',
       'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at',
       'by', 'for', 'with', 'about', 'against', 'between', 'into',
       'through', 'during', 'before', 'after', 'above', 'below', 'to',
       'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under',
       'again', 'further', 'then', 'once', 'here', 'there', 'when',
       'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'm

In [88]:
print("Number of stopwords: ", len(sw))

Number of stopwords:  179


In [89]:
print(nltk.corpus.stopwords.words('english')[:10])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


# A function for removing the stopwords

In [90]:
def stopwords(text):
    '''a function for removing the stopword'''
    # removing the stop words and lowercasing the selected words
    text = [word.lower() for word in text.split() if word.lower() not in sw]
    # joining the list of words with space separator
    return " ".join(text)

In [91]:
df_clean['review'] = df_clean['review'].apply(stopwords)
df_clean['review'].head(10)

36431    jane eyre always favorite novel! stumbled upon...
21308    skenbart takes place 1940s, right second world...
46795    saw move several years ago central florida fil...
25041    make bones it. lot things wrong movie. clichéd...
2926     oh really really is. i've seen films disliked ...
38519    heck about? kelly (jennifer) seems drop moral ...
4833     kids hiking mountains, one goes large tunnel d...
42799    watched part first part movie, tiny little bit...
22679    new approach comedy. funny.<br /><br />the jok...
17989    oh, man, sure knew make back then. hollywood f...
Name: review, dtype: object

In [92]:
from sklearn.feature_extraction.text import CountVectorizer

df_withstopwords = df.copy()
vect = CountVectorizer(ngram_range=(1,1))
vect.fit(df_withstopwords.review)
text_vect = vect.transform(df_withstopwords.review)

print('UNIGRAMS with STOPWORDS', text_vect.shape[1])

UNIGRAMS with STOPWORDS 52370


In [93]:
from sklearn.feature_extraction.text import CountVectorizer

df_withoutstopwords = df_clean.copy()
vect = CountVectorizer(ngram_range=(1,1), stop_words=sw)
vect.fit(df_withoutstopwords.review)
text_vect = vect.transform(df_withoutstopwords.review)

print('UNIGRAMS without STOPWORDS', text_vect.shape[1])

UNIGRAMS without STOPWORDS 52226


In [94]:
df_clean.head(10)

Unnamed: 0,review
36431,jane eyre always favorite novel! stumbled upon...
21308,"skenbart takes place 1940s, right second world..."
46795,saw move several years ago central florida fil...
25041,make bones it. lot things wrong movie. clichéd...
2926,oh really really is. i've seen films disliked ...
38519,heck about? kelly (jennifer) seems drop moral ...
4833,"kids hiking mountains, one goes large tunnel d..."
42799,"watched part first part movie, tiny little bit..."
22679,new approach comedy. funny.<br /><br />the jok...
17989,"oh, man, sure knew make back then. hollywood f..."


# Show the 10 most occurring words

In [95]:
from collections import Counter
c = Counter()

In [96]:
df_clean.review.str.lower().str.split(" ").apply(c.update)
c.most_common(10)

[('/><br', 19571),
 ('movie', 12202),
 ('film', 10903),
 ('one', 8882),
 ('like', 7496),
 ('would', 4781),
 ('even', 4730),
 ('good', 4604),
 ('really', 4306),
 ('see', 4209)]

# Stemming

In [97]:
from nltk.stem.porter import *
pd.options.mode.chained_assignment = None

short_data = df_clean.head()

ps = PorterStemmer()

short_data['Stemming'] = short_data['review'].apply(lambda x: ' '.join([ps.stem(word) for word in x.split() ]))
print('\n--------Before Stemming--------\n')
print(short_data['review'])
print('\n--------Stemming--------\n')
short_data['Stemming']


--------Before Stemming--------

36431    jane eyre always favorite novel! stumbled upon...
21308    skenbart takes place 1940s, right second world...
46795    saw move several years ago central florida fil...
25041    make bones it. lot things wrong movie. clichéd...
2926     oh really really is. i've seen films disliked ...
Name: review, dtype: object

--------Stemming--------



36431    jane eyr alway favorit novel! stumbl upon movi...
21308    skenbart take place 1940s, right second world ...
46795    saw move sever year ago central florida film f...
25041    make bone it. lot thing wrong movie. clichéd w...
2926     oh realli realli is. i'v seen film dislik more...
Name: Stemming, dtype: object

# Lemmazation

In [98]:
from nltk.stem.wordnet import WordNetLemmatizer
lmtzr = WordNetLemmatizer()

print('\n--------Before Lemmazation--------\n')
print(short_data['review'])
print('\n--------Lemmazation--------\n')
short_data['Lemmazation'] = short_data['review'].apply(lambda x: ' '.join([lmtzr.lemmatize(word,'v') for word in x.split() ]))
short_data['Lemmazation']


--------Before Lemmazation--------

36431    jane eyre always favorite novel! stumbled upon...
21308    skenbart takes place 1940s, right second world...
46795    saw move several years ago central florida fil...
25041    make bones it. lot things wrong movie. clichéd...
2926     oh really really is. i've seen films disliked ...
Name: review, dtype: object

--------Lemmazation--------



36431    jane eyre always favorite novel! stumble upon ...
21308    skenbart take place 1940s, right second world ...
46795    saw move several years ago central florida fil...
25041    make bone it. lot things wrong movie. clichéd ...
2926     oh really really is. i've see film dislike mor...
Name: Lemmazation, dtype: object

In [99]:
print('\n--------Before Part of Speech Tagging--------\n')
print(short_data['review'])
print('\n--------Part of Speech Tagging--------\n')
short_data['SpeechTagging'] = short_data['review'].apply(lambda x: nltk.pos_tag(nltk.word_tokenize(x)))
print(short_data['SpeechTagging'])


--------Before Part of Speech Tagging--------

36431    jane eyre always favorite novel! stumbled upon...
21308    skenbart takes place 1940s, right second world...
46795    saw move several years ago central florida fil...
25041    make bones it. lot things wrong movie. clichéd...
2926     oh really really is. i've seen films disliked ...
Name: review, dtype: object

--------Part of Speech Tagging--------

36431    [(jane, NN), (eyre, NN), (always, RB), (favori...
21308    [(skenbart, NN), (takes, VBZ), (place, NN), (1...
46795    [(saw, JJ), (move, IN), (several, JJ), (years,...
25041    [(make, VB), (bones, NNS), (it, PRP), (., .), ...
2926     [(oh, UH), (really, RB), (really, RB), (is, VB...
Name: SpeechTagging, dtype: object


# Capitalization

In [100]:
print('\n--------Before Capitalization--------\n')
print(short_data['review'])
print('\n--------Capitalization--------\n')
short_data['Capitalization'] = short_data['review'].apply(  lambda x: ' '.join( [ word.upper() for word in x.split() ] ) )
print(short_data['Capitalization'])


--------Before Capitalization--------

36431    jane eyre always favorite novel! stumbled upon...
21308    skenbart takes place 1940s, right second world...
46795    saw move several years ago central florida fil...
25041    make bones it. lot things wrong movie. clichéd...
2926     oh really really is. i've seen films disliked ...
Name: review, dtype: object

--------Capitalization--------

36431    JANE EYRE ALWAYS FAVORITE NOVEL! STUMBLED UPON...
21308    SKENBART TAKES PLACE 1940S, RIGHT SECOND WORLD...
46795    SAW MOVE SEVERAL YEARS AGO CENTRAL FLORIDA FIL...
25041    MAKE BONES IT. LOT THINGS WRONG MOVIE. CLICHÉD...
2926     OH REALLY REALLY IS. I'VE SEEN FILMS DISLIKED ...
Name: Capitalization, dtype: object
