## Lab NLP


# Challenge 1 - Installations-

In [1]:
import pandas as pd
import re

import nltk
from nltk.stem import SnowballStemmer                          # Derivación regresiva de palabras
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.probability import FreqDist

from sklearn.model_selection import train_test_split

In [2]:
#nltk.download()

In [3]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\paola\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\paola\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [5]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\paola\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [6]:
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\paola\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

## Challenge 2 - Preparing Text Data For Analysis

In [7]:
# Creating functions first (before loading data)

# Function #1 to clean the text: Removes mentions, especial characters and url's

clean_up = lambda tw: ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",tw.lower()).split()) 

In [8]:
#test
clean_up('@switchfoot http://twitpic.com/2y1zl - Awww, #tetet')

'awww tetet'

In [9]:
# Function #2 to tokenize the text: divides a string into substrings

tokenize = lambda w: nltk.word_tokenize(w)

In [10]:
# Function #3 to stemming and lemmatization: which is a sort of normalizing method for words.

english_stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

def stem_and_lemmatize(tw):
    return list(map(lambda l: lemmatizer.lemmatize(english_stemmer.stem(l)), tw))

In [11]:
# Function #4 that removes stopwords

pass_stopwords = lambda x: [w for w in x if not w in stopwords.words('english')]

In [12]:
# Reading data

tweets = pd.read_csv('Sentiment140.csv/Sentiment140.csv', nrows=10000)
tweets.head()

# Exploring data: already checked.
#tweets.isna().sum()
#tweets.info()tweets.info()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [13]:
# Applying cleanning function

tweets['text'] = tweets['text'].apply(clean_up)
tweets.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,awww that s a bummer you shoulda got david car...
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can t update his facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,i dived many times for the ball managed to sav...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,no it s not behaving at all i m mad why am i h...


## Tokenization

In [14]:
# Tokenize
tweets['cleaned_text'] = tweets['text'].apply(tokenize)

In [15]:
tweets.head()

Unnamed: 0,target,id,date,flag,user,text,cleaned_text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,awww that s a bummer you shoulda got david car...,"[awww, that, s, a, bummer, you, shoulda, got, ..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can t update his facebook by ...,"[is, upset, that, he, can, t, update, his, fac..."
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,i dived many times for the ball managed to sav...,"[i, dived, many, times, for, the, ball, manage..."
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,"[my, whole, body, feels, itchy, and, like, its..."
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,no it s not behaving at all i m mad why am i h...,"[no, it, s, not, behaving, at, all, i, m, mad,..."


## Stemming and Lemmatization

In NLTK, there are three stemming libraries: Porter, Snowball, and Lancaster. The difference among the three is the agressiveness with which they 
perform stemming. Porter is the most gentle stemmer that preserves the word's original form if it has doubts. In contrast, Lancaster is the most aggressive one that sometimes produces wrong outputs. And Snowball is in between. **In most cases you will use either Porter or Snowball**.


In [16]:
tweets['cleaned_text'] = tweets['cleaned_text'].apply(stem_and_lemmatize)
tweets.head()

Unnamed: 0,target,id,date,flag,user,text,cleaned_text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,awww that s a bummer you shoulda got david car...,"[awww, that, s, a, bummer, you, shoulda, got, ..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can t update his facebook by ...,"[is, upset, that, he, can, t, updat, his, face..."
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,i dived many times for the ball managed to sav...,"[i, dive, mani, time, for, the, ball, manag, t..."
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,"[my, whole, bodi, feel, itchi, and, like, it, ..."
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,no it s not behaving at all i m mad why am i h...,"[no, it, s, not, behav, at, all, i, m, mad, wh..."



## Stop Words Removal

Stop Words are the most commonly used words in a language that don't contribute to the main meaning of the texts. Examples of English stop words are i, me, is, and, the, but, and here. We want to remove stop words from analysis because otherwise stop words will take the overwhelming portion in our tokenized word list and the NLP algorithms will have problems in identifying the truely important words.

NLTK has a stopwords package that allows us to import the most common stop words in over a dozen langauges including English, Spanish, French, German, Dutch, Portuguese, Italian, etc. These are the bare minimum stop words (100-150 words in each language) that can get beginners started. Some other NLP packages such as stop-words and wordcloud provide bigger lists of stop words.

Now in your Jupyter Notebook, create a function called remove_stopwords that loop through a list of words that have been stemmed and lemmatized to check and remove stop words. Return a new list where stop words have been removed.


In [17]:
tweets['cleaned_text'] = tweets['cleaned_text'].apply(pass_stopwords)
tweets.head()

Unnamed: 0,target,id,date,flag,user,text,cleaned_text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,awww that s a bummer you shoulda got david car...,"[awww, bummer, shoulda, got, david, carr, thir..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can t update his facebook by ...,"[upset, updat, facebook, text, might, cri, res..."
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,i dived many times for the ball managed to sav...,"[dive, mani, time, ball, manag, save, 50, rest..."
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,"[whole, bodi, feel, itchi, like, fire]"
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,no it s not behaving at all i m mad why am i h...,"[behav, mad, whi, becaus, see]"


## Challenge 3: Sentiment Analysis

In [18]:
sentiment = SentimentIntensityAnalyzer() #positivity score defined as greater than 25%

positive_tweet = lambda w: sentiment.polarity_scores(w)['pos']>.25


## Creating Bag of Words

The purpose of this step is to create a bag of words from the processed data. The bag of words contains all the unique words in your whole text body (a.k.a. corpus) with the number of occurrence of each word. It will allow you to understand which words are the most important features across the whole corpus.

Also, you can imagine you will have a massive set of words. The less important words (i.e. those of very low number of occurrence) do not contribute much to the sentiment. Therefore, you only need to use the most important words to build your feature set in the next step. In our case, we will use the top 5,000 words with the highest frequency to build the features.

In your Jupyter Notebook, combine all the words in text_processed and calculate the frequency distribution of all words. A convenient library to calculate the term frequency distribution is NLTK's FreqDist class (documentation). Then select the top 5,000 words from the frequency distribution.


In [19]:
# Getting the processed words and turning it into a list

w_list = tweets['cleaned_text'].apply(pd.Series).stack().reset_index(drop=True).tolist()
#w_list

In [20]:
# Frecuency of words

w_freq = FreqDist(w_list)

In [21]:
#Creating the positive bag of words

top_5000 = pd.concat([tweets['text'], 
                      pd.DataFrame({w:tweets['text'].str.contains(w) for w,c in w_freq.most_common(5000)}), 
                      pd.DataFrame({'is_positive':tweets['text'].apply(positive_tweet)})], axis=1)

In [22]:
#Result of the top500 with a is_positive column attached
top_5000.head()

Unnamed: 0,text,go,work,get,wa,day,today,like,miss,feel,...,ruddi,synch,bureaucrat,buyolog,premis,flog,unworthi,lastnight,tulip27,is_positive
0,awww that s a bummer you shoulda got david car...,True,False,False,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,is upset that he can t update his facebook by ...,False,False,False,False,True,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,i dived many times for the ball managed to sav...,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,my whole body feels itchy and like its on fire,False,False,False,False,False,False,True,False,True,...,False,False,False,False,False,False,False,False,False,False
4,no it s not behaving at all i m mad why am i h...,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [23]:
#To dictionary
top_5000_dict = top_5000.iloc[:,1:-1].to_dict(orient='records')

In [24]:
# Parsing the data in the form of list of tuples of dict-bool
top_5000_dict = [(top_5000_dict[i], top_5000['is_positive'][i]) for i in range(len(top_5000_dict))]


## Testing Naïve Bayes Model

Now we'll test our classifier with the test dataset. This is done by calling nltk.classify.accuracy(classifier, test).

As mentioned in one of the tutorial videos, a Naive Bayes model is considered OK if your accuracy score is over 0.6. If your accuracy score is over 0.7, you've done a great job!


In [25]:
#Split into 80% train and 20% test
train, test = top_5000_dict[:8000], top_5000_dict[8000:]

In [26]:
# Training the classifier
classifier = nltk.NaiveBayesClassifier.train(train)

In [27]:
# Results of the training
classifier.show_most_informative_features()

Most Informative Features
                     xox = True             True : False  =     30.9 : 1.0
                  welcom = True             True : False  =     17.9 : 1.0
                     bff = True             True : False  =     17.9 : 1.0
                    kiss = True             True : False  =     14.7 : 1.0
                 congrat = True             True : False  =     14.7 : 1.0
                 alright = True             True : False  =     14.7 : 1.0
                    fave = True             True : False  =     11.4 : 1.0
                 medicin = True             True : False  =     11.4 : 1.0
                  mystic = True             True : False  =     11.4 : 1.0
                  inspir = True             True : False  =     11.4 : 1.0


In [28]:
# Accuracy score
nltk.classify.accuracy(classifier, test)
# 86% of accuracy, which indicates a good fit

0.842


# Bonus Question 1 & 2: Improve Model Performance & Machine Learning Pipeline

If you are still not exhausted so far and want to dig deeper, try to improve your classifier performance. There are many aspects you can dig into, for example:

Improve stemming and lemmatization. Inspect your bag of words and the most important features. Are there any words you should furuther remove from analysis? You can append these words to further remove to the stop words list.

Remember we only used the top 5,000 features to build model? Try using different numbers of top features. The bottom line is to use as few features as you can without compromising your model performance. The fewer features you select into your model, the faster your model is trained. Then you can use a larger sample size to improve your model accuracy score.

In a new Jupyter Notebook, combine all your codes into a function (or a class). Your new function will execute the complete machine learning pipeline job by receiving the dataset location and output the classifier. **This will allow you to use your function to predict the sentiment of any tweet in real time**.
