### NLP: Cleaning & Preprocessing Text Data
* NLP: Natural Language Processing. As you probably know, computers are not as great at understanding words as they are numbers. This is all changing though as advances in NLP are happening everyday. The fact that devices like Apple’s Siri and Amazon’s Alexa can (usually) comprehend when we ask the weather, for directions, or to play a certain genre of music are all examples of NLP. The spam filter in your email and the spellcheck you’ve used since you learned to type in elementary school are some other basic examples of when your computer is understanding language.
* As a data scientist, we may use NLP for sentiment analysis (classifying words to have positive or negative connotation) or to make predictions in classification models, among other things. Typically, whether we’re given the data or have to scrape it, the text will be in its natural human format of sentences, paragraphs, tweets, etc. From there, before we can dig into analyzing, we will have to do some cleaning to break the text down into a format the computer can easily understand.
#### Tokenization + Remove punctuation
#### Remove stop words
#### Lemmatization or Stemming


In [58]:
import pandas as pd
import nltk
import string
# import the stop word list
from nltk.corpus import stopwords
# tokenizer
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

In [59]:
df= pd.read_csv(r"twitter_train.csv")
df.head(2)

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...


In [60]:
df1=df.drop("id",axis=1)
df1.head()

Unnamed: 0,label,tweet
0,0,@user when a father is dysfunctional and is s...
1,0,@user @user thanks for #lyft credit i can't us...
2,0,bihday your majesty
3,0,#model i love u take with u all the time in ...
4,0,factsguide: society now #motivation


In [61]:
df1.shape

(31962, 2)

In [62]:
df1["tweet"][1:3]

1    @user @user thanks for #lyft credit i can't us...
2                                  bihday your majesty
Name: tweet, dtype: object

In [63]:
# tweet = df1["tweet"].str.split("@",n=1, expand =True)
# tweet.head()

In [64]:
tweet = df1["tweet"].str.split("#",n=1, expand =True)
tweet.head()

Unnamed: 0,0,1
0,@user when a father is dysfunctional and is s...,run
1,@user @user thanks for,lyft credit i can't use cause they don't offer...
2,bihday your majesty,
3,,model i love u take with u all the time in u...
4,factsguide: society now,motivation


## Remove punctuation: 
* Have to need import string
* stirng punctuation
* “ “.join will join the list of letters back together as words where there are no spaces.

In [65]:
def remove_punctuation(text):
    no_punct = "".join([c for c in text if c not in string.punctuation])
    return no_punct

In [66]:
df1["tweet"] = df1["tweet"].apply(lambda x : remove_punctuation(x))
df1["tweet"].head(10)

0     user when a father is dysfunctional and is so...
1    user user thanks for lyft credit i cant use ca...
2                                  bihday your majesty
3    model   i love u take with u all the time in u...
4                 factsguide society now    motivation
5    22 huge fan fare and big talking before they l...
6     user camping tomorrow user user user user use...
7    the next school year is the year for examsð¯...
8    we won love the land allin cavs champions clev...
9           user user welcome here   im   its so gr8  
Name: tweet, dtype: object

In [67]:
# function to remove non-ASCII
# def remove_non_ascii(text):
#     return ''.join(i for i in text if ord(i)<128)

In [68]:
# df1['tweet'] = df1['tweet'].apply(lambda x: remove_non_ascii(x))
# df1["tweet"].head(10)

In [69]:
df1.loc[7:7]

Unnamed: 0,label,tweet
7,0,the next school year is the year for examsð¯...


### Tokenize:
* This breaks up the strings into a list of words or pieces based on a specified pattern using Regular Expressions aka RegEx. The pattern I chose to use this time (r'\w') also removes punctuation and is a better option for this data in particular. We can also add.lower() in the lambda function to make everything lowercase. 

In [70]:
tokenizer = RegexpTokenizer(f'\w+')

In [71]:
df1['tweet'] = df1["tweet"].apply(lambda x: tokenizer.tokenize(x.lower()))
df1["tweet"].head()

0    [user, when, a, father, is, dysfunctional, and...
1    [user, user, thanks, for, lyft, credit, i, can...
2                              [bihday, your, majesty]
3    [model, i, love, u, take, with, u, all, the, t...
4               [factsguide, society, now, motivation]
Name: tweet, dtype: object

In [72]:
lemmatizer = WordNetLemmatizer()
def word_lemma(text):
    lemm_text = [lemmatizer.lemmatize(i) for i in text]
    return lemm_text

In [73]:
df1["tweet"].apply(lambda x:word_lemma(x))

0        [user, when, a, father, is, dysfunctional, and...
1        [user, user, thanks, for, lyft, credit, i, can...
2                                  [bihday, your, majesty]
3        [model, i, love, u, take, with, u, all, the, t...
4                   [factsguide, society, now, motivation]
                               ...                        
31957    [ate, user, isz, that, youuuð, ð, ð, ð, ð, ð, ...
31958    [to, see, nina, turner, on, the, airwave, tryi...
31959    [listening, to, sad, song, on, a, monday, morn...
31960    [user, sikh, temple, vandalised, in, in, calga...
31961                 [thank, you, user, for, you, follow]
Name: tweet, Length: 31962, dtype: object

In [74]:
a = df1['tweet'].values.tolist()

f = []
import string
s = string.ascii_letters

for i in a:
    c  = []
    b = ''
    for j in i:
        for k in j:
            if k in s:
                b += k
        if len(b)>0:
            c.append(b)
            b = ''
    f.append(c)

In [75]:
# df1["tweet"] = df1["tweet"].apply(lambda x: remove_sw(x))
# df1["tweet"].head()

In [76]:
df1['clean_tweet'] = f

In [77]:
# df1.loc[7:8]

In [78]:
df2=df1.drop("tweet",axis=1)
# df2

# Remove stop words:
* We imported a list of the most frequently used words from the NL Toolkit at the beginning with from nltk.corpus import stopwords. You can run stopwords.word(insert language) to get a full list for every language..
* There are 197 English words including in stopwords.

In [79]:
# create a function for remove the stop words.
def remove_sw(text):
    words = [w for w in text if w not in stopwords.words("english")]
    return words

In [80]:
df2["clean_tweet"]=df2["clean_tweet"].apply(lambda x: remove_sw(x))
# df2["clean_tweet"].head(10)

# stemming and lemmatizing:
* Both tools shorten words back to their root form. Stemming is a little more aggressive. It cuts off prefixes and/or endings of words based on common ones.

In [81]:
# instantiate lemmatizer
lemmatizer = WordNetLemmatizer()
def word_lemmatizer(text):
    lemm_text=[lemmatizer.lemmatize(i) for i in text]
    return lemm_text

In [82]:
df2["clean_tweet"].apply(lambda x: word_lemmatizer(x))
# df2.head()

0        [user, father, dysfunctional, selfish, drag, k...
1        [user, user, thanks, lyft, credit, cant, use, ...
2                                        [bihday, majesty]
3                      [model, love, u, take, u, time, ur]
4                        [factsguide, society, motivation]
                               ...                        
31957                              [ate, user, isz, youuu]
31958    [see, nina, turner, airwave, trying, wrap, man...
31959    [listening, sad, song, monday, morning, otw, w...
31960    [user, sikh, temple, vandalised, calgary, wso,...
31961                                [thank, user, follow]
Name: clean_tweet, Length: 31962, dtype: object

In [83]:
stemmer = PorterStemmer()
def word_stemmer(text):
    stem_text = " ".join([stemmer.stem(i) for i in text])
    return stem_text
df2["clean_tweet"] = df2["clean_tweet"].apply(lambda x: word_stemmer(x))
df2["clean_tweet"]

0        user father dysfunct selfish drag kid dysfunct...
1        user user thank lyft credit cant use caus dont...
2                                           bihday majesti
3                              model love u take u time ur
4                                  factsguid societi motiv
                               ...                        
31957                                   ate user isz youuu
31958    see nina turner airwav tri wrap mantl genuin h...
31959             listen sad song monday morn otw work sad
31960     user sikh templ vandalis calgari wso condemn act
31961                                    thank user follow
Name: clean_tweet, Length: 31962, dtype: object

In [84]:
df2.head()

Unnamed: 0,label,clean_tweet
0,0,user father dysfunct selfish drag kid dysfunct...
1,0,user user thank lyft credit cant use caus dont...
2,0,bihday majesti
3,0,model love u take u time ur
4,0,factsguid societi motiv


In [192]:
df2.to_csv("clean_train_tweet.csv",index=False)
df2

Unnamed: 0,label,clean_tweet
0,0,user father dysfunct selfish drag kid dysfunct...
1,0,user user thank lyft credit cant use caus dont...
2,0,bihday majesti
3,0,model love u take u time ur
4,0,factsguid societi motiv
...,...,...
31957,0,ate user isz youuu
31958,0,see nina turner airwav tri wrap mantl genuin h...
31959,0,listen sad song monday morn otw work sad
31960,1,user sikh templ vandalis calgari wso condemn act


###### Now i have clean data ready for to analyzed, I could go on to use this data for sentiment analysis, could use the ratings or manufacture columns as target variable based on word correlations. Maybe build a recommender system based on user purchases or item reviews or customer segmentation with clustering. The possibilities are endless!  

In [85]:
# we will encoding the label "positive or negative".
df3=df2.replace({"label":{0: "Negative"}})
final_df=df3.replace({"label":{1: "Positive"}})

In [111]:
final_df.head()
# final_df.loc[500:550]

Unnamed: 0,label,clean_tweet
0,Negative,user father dysfunct selfish drag kid dysfunct...
1,Negative,user user thank lyft credit cant use caus dont...
2,Negative,bihday majesti
3,Negative,model love u take u time ur
4,Negative,factsguid societi motiv


In [112]:
final_df.describe()

Unnamed: 0,label,clean_tweet
count,31962,31962
unique,2,29217
top,Negative,model love u take u time ur
freq,29720,319


In [113]:
x=final_df.clean_tweet
y=final_df.label

In [151]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=False)

In [174]:
from sklearn.feature_extraction.text import CountVectorizer
# create obj countvectoer
cv= CountVectorizer()
cv=CountVectorizer(stop_words="english")

In [175]:
x_traincv=cv.fit_transform(x_train)
x_testcv =cv.transform(x_test)

In [176]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
model=MultinomialNB()
model.fit(x_traincv, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [177]:
yp=model.predict(x_testcv)

In [178]:
yp

array(['Negative', 'Negative', 'Negative', ..., 'Negative', 'Negative',
       'Negative'], dtype='<U8')

In [179]:
model.score(x_traincv,y_train)

0.9691032109194728

In [180]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

In [181]:
accuracy_score(y_test,yp)

0.9619896762083529

In [183]:
cm= confusion_matrix(y_test,yp)

In [184]:
cm

array([[5946,   39],
       [ 204,  204]], dtype=int64)