<a href="https://colab.research.google.com/github/mohanrajmit/ML-training/blob/master/Data_cleaning_sentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Basics: Implementing a pipeline to clean text

### Pre-processing text data

Cleaning up the text data is necessary to highlight attributes that you're going to want your machine learning system to pick up on. Cleaning (or pre-processing) the data typically consists of a number of steps:
1. **Remove punctuation**
2. **Tokenization**
3. **Remove stopwords**
4. Lemmatize/Stem

The first three steps are covered in this chapter as they're implemented in pretty much any text cleaning pipeline. Lemmatizing and stemming are covered in the next chapter as they're helpful but not critical.

In [1]:
!git clone https://github.com/mohanrajmit/Sentiment-Analsysis.git

Cloning into 'Sentiment-Analsysis'...
remote: Enumerating objects: 9, done.[K
remote: Counting objects:  11% (1/9)[Kremote: Counting objects:  22% (2/9)[Kremote: Counting objects:  33% (3/9)[Kremote: Counting objects:  44% (4/9)[Kremote: Counting objects:  55% (5/9)[Kremote: Counting objects:  66% (6/9)[Kremote: Counting objects:  77% (7/9)[Kremote: Counting objects:  88% (8/9)[Kremote: Counting objects: 100% (9/9)[Kremote: Counting objects: 100% (9/9), done.[K
remote: Compressing objects:  12% (1/8)[Kremote: Compressing objects:  25% (2/8)[Kremote: Compressing objects:  37% (3/8)[Kremote: Compressing objects:  50% (4/8)[Kremote: Compressing objects:  62% (5/8)[Kremote: Compressing objects:  75% (6/8)[Kremote: Compressing objects:  87% (7/8)[Kremote: Compressing objects: 100% (8/8)[Kremote: Compressing objects: 100% (8/8), done.[K
remote: Total 9 (delta 0), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (9/9), done.


In [4]:
import pandas as pd
pd.set_option('display.max_colwidth', 300)

data = pd.read_csv("/content/Sentiment-Analsysis/train.csv")
#data.columns = ['label', 'body_text']

data.head()

Unnamed: 0,id,label,tweet
0,1,0,#fingerprint #Pregnancy Test https://goo.gl/h1MfQV #android #apps #beautiful #cute #health #igers #iphoneonly #iphonesia #iphone
1,2,0,Finally a transparant silicon case ^^ Thanks to my uncle :) #yay #Sony #Xperia #S #sonyexperias… http://instagram.com/p/YGEt5JC6JM/
2,3,0,We love this! Would you go? #talk #makememories #unplug #relax #iphone #smartphone #wifi #connect... http://fb.me/6N3LsUpCu
3,4,0,I'm wired I know I'm George I was made that way ;) #iphone #cute #daventry #home http://instagr.am/p/Li_5_ujS4k/
4,5,1,What amazing service! Apple won't even talk to me about a question I have unless I pay them $19.95 for their stupid support!


### Remove punctuation

In [5]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [6]:
"I like NLP." == "I like NLP"

False

In [8]:
def remove_punct(text):
    text_nopunct = "".join([char for char in text if char not in string.punctuation])
    return text_nopunct

data['tweet_clean'] = data['tweet'].apply(lambda x: remove_punct(x))

data.head()

Unnamed: 0,id,label,tweet,tweet_clean
0,1,0,#fingerprint #Pregnancy Test https://goo.gl/h1MfQV #android #apps #beautiful #cute #health #igers #iphoneonly #iphonesia #iphone,fingerprint Pregnancy Test httpsgooglh1MfQV android apps beautiful cute health igers iphoneonly iphonesia iphone
1,2,0,Finally a transparant silicon case ^^ Thanks to my uncle :) #yay #Sony #Xperia #S #sonyexperias… http://instagram.com/p/YGEt5JC6JM/,Finally a transparant silicon case Thanks to my uncle yay Sony Xperia S sonyexperias… httpinstagramcompYGEt5JC6JM
2,3,0,We love this! Would you go? #talk #makememories #unplug #relax #iphone #smartphone #wifi #connect... http://fb.me/6N3LsUpCu,We love this Would you go talk makememories unplug relax iphone smartphone wifi connect httpfbme6N3LsUpCu
3,4,0,I'm wired I know I'm George I was made that way ;) #iphone #cute #daventry #home http://instagr.am/p/Li_5_ujS4k/,Im wired I know Im George I was made that way iphone cute daventry home httpinstagrampLi5ujS4k
4,5,1,What amazing service! Apple won't even talk to me about a question I have unless I pay them $19.95 for their stupid support!,What amazing service Apple wont even talk to me about a question I have unless I pay them 1995 for their stupid support


### Tokenization

In [9]:
import re

def tokenize(text):
    tokens = re.split('\W+', text)
    return tokens

data['tweet_tokenized'] = data['tweet_clean'].apply(lambda x: tokenize(x.lower()))

data.head()

Unnamed: 0,id,label,tweet,tweet_clean,tweet_tokenized
0,1,0,#fingerprint #Pregnancy Test https://goo.gl/h1MfQV #android #apps #beautiful #cute #health #igers #iphoneonly #iphonesia #iphone,fingerprint Pregnancy Test httpsgooglh1MfQV android apps beautiful cute health igers iphoneonly iphonesia iphone,"[fingerprint, pregnancy, test, httpsgooglh1mfqv, android, apps, beautiful, cute, health, igers, iphoneonly, iphonesia, iphone]"
1,2,0,Finally a transparant silicon case ^^ Thanks to my uncle :) #yay #Sony #Xperia #S #sonyexperias… http://instagram.com/p/YGEt5JC6JM/,Finally a transparant silicon case Thanks to my uncle yay Sony Xperia S sonyexperias… httpinstagramcompYGEt5JC6JM,"[finally, a, transparant, silicon, case, thanks, to, my, uncle, yay, sony, xperia, s, sonyexperias, httpinstagramcompyget5jc6jm]"
2,3,0,We love this! Would you go? #talk #makememories #unplug #relax #iphone #smartphone #wifi #connect... http://fb.me/6N3LsUpCu,We love this Would you go talk makememories unplug relax iphone smartphone wifi connect httpfbme6N3LsUpCu,"[we, love, this, would, you, go, talk, makememories, unplug, relax, iphone, smartphone, wifi, connect, httpfbme6n3lsupcu]"
3,4,0,I'm wired I know I'm George I was made that way ;) #iphone #cute #daventry #home http://instagr.am/p/Li_5_ujS4k/,Im wired I know Im George I was made that way iphone cute daventry home httpinstagrampLi5ujS4k,"[im, wired, i, know, im, george, i, was, made, that, way, iphone, cute, daventry, home, httpinstagrampli5ujs4k]"
4,5,1,What amazing service! Apple won't even talk to me about a question I have unless I pay them $19.95 for their stupid support!,What amazing service Apple wont even talk to me about a question I have unless I pay them 1995 for their stupid support,"[what, amazing, service, apple, wont, even, talk, to, me, about, a, question, i, have, unless, i, pay, them, 1995, for, their, stupid, support]"


In [10]:
'NLP' == 'nlp'

False

### Remove stopwords

In [11]:
import nltk
nltk.download('stopwords')

stopword = nltk.corpus.stopwords.words('english')
print(len(stopword))
print(stopword)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both',

In [12]:
def remove_stopwords(tokenized_list):
    text = [word for word in tokenized_list if word not in stopword]
    return text

data['tweet_nostop'] = data['tweet_tokenized'].apply(lambda x: remove_stopwords(x))

data.head()

Unnamed: 0,id,label,tweet,tweet_clean,tweet_tokenized,tweet_nostop
0,1,0,#fingerprint #Pregnancy Test https://goo.gl/h1MfQV #android #apps #beautiful #cute #health #igers #iphoneonly #iphonesia #iphone,fingerprint Pregnancy Test httpsgooglh1MfQV android apps beautiful cute health igers iphoneonly iphonesia iphone,"[fingerprint, pregnancy, test, httpsgooglh1mfqv, android, apps, beautiful, cute, health, igers, iphoneonly, iphonesia, iphone]","[fingerprint, pregnancy, test, httpsgooglh1mfqv, android, apps, beautiful, cute, health, igers, iphoneonly, iphonesia, iphone]"
1,2,0,Finally a transparant silicon case ^^ Thanks to my uncle :) #yay #Sony #Xperia #S #sonyexperias… http://instagram.com/p/YGEt5JC6JM/,Finally a transparant silicon case Thanks to my uncle yay Sony Xperia S sonyexperias… httpinstagramcompYGEt5JC6JM,"[finally, a, transparant, silicon, case, thanks, to, my, uncle, yay, sony, xperia, s, sonyexperias, httpinstagramcompyget5jc6jm]","[finally, transparant, silicon, case, thanks, uncle, yay, sony, xperia, sonyexperias, httpinstagramcompyget5jc6jm]"
2,3,0,We love this! Would you go? #talk #makememories #unplug #relax #iphone #smartphone #wifi #connect... http://fb.me/6N3LsUpCu,We love this Would you go talk makememories unplug relax iphone smartphone wifi connect httpfbme6N3LsUpCu,"[we, love, this, would, you, go, talk, makememories, unplug, relax, iphone, smartphone, wifi, connect, httpfbme6n3lsupcu]","[love, would, go, talk, makememories, unplug, relax, iphone, smartphone, wifi, connect, httpfbme6n3lsupcu]"
3,4,0,I'm wired I know I'm George I was made that way ;) #iphone #cute #daventry #home http://instagr.am/p/Li_5_ujS4k/,Im wired I know Im George I was made that way iphone cute daventry home httpinstagrampLi5ujS4k,"[im, wired, i, know, im, george, i, was, made, that, way, iphone, cute, daventry, home, httpinstagrampli5ujs4k]","[im, wired, know, im, george, made, way, iphone, cute, daventry, home, httpinstagrampli5ujs4k]"
4,5,1,What amazing service! Apple won't even talk to me about a question I have unless I pay them $19.95 for their stupid support!,What amazing service Apple wont even talk to me about a question I have unless I pay them 1995 for their stupid support,"[what, amazing, service, apple, wont, even, talk, to, me, about, a, question, i, have, unless, i, pay, them, 1995, for, their, stupid, support]","[amazing, service, apple, wont, even, talk, question, unless, pay, 1995, stupid, support]"
