# Bag of Words Lab

### Introduction

In this lab, we'll practice using the bag of words model using a corpus that consists of tweets about an airline.  Let's get started.

### Loading the Data

Let's begin by loading our data.

In [1]:
import pandas as pd
url = "https://raw.githubusercontent.com/jigsawlabs-student/nlp-text-representation/master/Tweets.csv"
df = pd.read_csv(url)

Let's take a look at some of the content in our dataset.  Select the first three rows of the dataset.

In [2]:
df[:3]

# 	tweet_id	airline_sentiment	airline_sentiment_confidence	negativereason	negativereason_confidence	airline	airline_sentiment_gold	name	negativereason_gold	retweet_count	text	tweet_coord	tweet_created	tweet_location	user_timezone
# 0	570306133677760513	neutral	1.0000	NaN	NaN	Virgin America	NaN	cairdin	NaN	0	@VirginAmerica What @dhepburn said.	NaN	2015-02-24 11:35:52 -0800	NaN	Eastern Time (US & Canada)
# 1	570301130888122368	positive	0.3486	NaN	0.0	Virgin America	NaN	jnardino	NaN	0	@VirginAmerica plus you've added commercials t...	NaN	2015-02-24 11:15:59 -0800	NaN	Pacific Time (US & Canada)
# 2	570301083672813571	neutral	0.6837	NaN	NaN	Virgin America	NaN	yvonnalynn	NaN	0	@VirginAmerica I didn't today... Must mean I n...	NaN	2015-02-24 11:15:48 -0800	Lets Play	Central Time (US & Canada)

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)


We can see that the dataset consists of different contents about an airline.  The airline sentiment has already been classified, and the text has the content of each tweet.

Let's assign `documents` to equal the column of text, and assign `sentiment` to equal the `airline_sentiment` column. 

In [3]:
documents = df.text

In [4]:
sentiment = df.airline_sentiment

In [5]:
sentiment[:3]
# 0     neutral
# 1    positive
# 2     neutral
# Name: airline_sentiment, dtype: object

0     neutral
1    positive
2     neutral
Name: airline_sentiment, dtype: object

Next let's get an overview of the different `sentiment` values by calling `value_counts` on that series.

In [6]:
sentiment.value_counts()

negative    9178
neutral     3099
positive    2363
Name: airline_sentiment, dtype: int64

So we can see that most of the reviews are negative.  Bummer.

Now, as we know we need all of the feature and target values to be numeric.  For coercing the text, we'll need our nlp techniques.  But we should be make the target numeric using our knowledge of pandas.  Change the target so that the negative is represented by 0, neutral by 1, and positive by 2.

> Assign the coerced series to the variable `y`.

In [7]:
sentiment_map = {'negative': 0, 'neutral': 1, 'positive': 2}

In [8]:
y = df.airline_sentiment.map(sentiment_map)
y[:4]

# 0    1
# 1    2
# 2    1
# 3    0
# Name: airline_sentiment, dtype: int64

0    1
1    2
2    1
3    0
Name: airline_sentiment, dtype: int64

In [9]:
y.value_counts()

0    9178
1    3099
2    2363
Name: airline_sentiment, dtype: int64

### Employ Bag of Words

Now we'll need to coerce our documents to be represented by a bag of words.  Load the CountVectorizer from the `feature_extraction.text` module.  Then assign an instance of the CountVectorizer to the variable `bow_vectorizer`.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer 

In [11]:
bow_vectorizer = CountVectorizer()

Assign the transformed bag of words documents to the variable `X_vectors`.

In [12]:
X_vectors = bow_vectorizer.fit_transform(documents)

In [13]:
X_vectors.toarray()[:2]

# array([[0, 0, 0, ..., 0, 0, 0],
#        [0, 0, 0, ..., 0, 0, 0]])

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

Get a sense of which words were encoded by using the `vocabulary_` method.

In [14]:
vocab = bow_vectorizer.vocabulary_

In [15]:
list(vocab.items())[:3]

# [('virginamerica', 14273), ('what', 14551), ('dhepburn', 4804)]

[('virginamerica', 14273), ('what', 14551), ('dhepburn', 4804)]

Remember, that this displays the list of each encoded token, and the associated index.  Show this by retrieving the number of times that `VirginAmerica` appears in the first tweet.

In [16]:
X_vectors.toarray()[0][14273]

# 1

1

Ok, enough messing around.  Now let's split our data into a training and test set, also using the `stratify` argument, so that sentiments are split equally between our training and test sets.  Set the test size to .2.

In [20]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_vectors, y, stratify = y, test_size = .2, random_state = 2)

In [21]:
X_train.shape, X_test.shape

# ((11712, 15051), (2928, 15051))

((11712, 15051), (2928, 15051))

And let's look at the value counts of y in the test and train sets.

In [22]:
y_train.value_counts(normalize = True), y_test.value_counts(normalize = True)

(0    0.626878
 1    0.211663
 2    0.161458
 Name: airline_sentiment, dtype: float64,
 0    0.627049
 1    0.211749
 2    0.161202
 Name: airline_sentiment, dtype: float64)

Now let's initialize a logistic regression model, train it, setting the `max_iter` to 500.

> Assign the model to the variable `lr`.

In [23]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter = 500)

In [24]:
lr.fit(X_train, y_train)

LogisticRegression(max_iter=500)

Now let's check the accuracy of the model on the test dataset.  

In [25]:
lr.score(X_test, y_test)

# 0.798155737704918

0.798155737704918

### Model Interpretation

In a multiclassification model like we have here, there are three different sets of parameters.

In [43]:
lr.coef_.shape

(3, 15051)

> We can see that there are three sets of features trained, one to predict the negative reviews, then to predict the neutral, and finally to predict the positive reviews.  

Let's select the first list of coefficients, look at associated feature names, and sort them from highest to lowest, selecting the top 20.

In [36]:
pd.Series(lr.coef_[0], bow.get_feature_names()).sort_values(ascending = False)[:20]

worst           2.214460
nothing         1.543403
delayed         1.463210
hours           1.414857
ridiculous      1.399412
terrible        1.362820
fail            1.351004
suck            1.347511
unacceptable    1.335372
sucks           1.291287
fix             1.266558
answer          1.259342
unless          1.231424
paid            1.229213
hold            1.228098
rude            1.219338
worse           1.208478
disappointed    1.204609
luggage         1.179982
frustrated      1.179896
dtype: float64

Now do the same thing for features of the positive tweets.  Select the top twenty features of the positive tweets and display their corresponding scores.

In [39]:
pd.Series(lr.coef_[-1], bow.get_feature_names()).sort_values(ascending = False)[:20]

# thank         2.224737
# awesome       2.152502
# thanks        2.013645
# great         1.828529
# worries       1.812265
# thnx          1.693800
# amazing       1.672030
# excellent     1.670677
# best          1.667961
# love          1.666248
# cool          1.548519
# appreciate    1.522806
# thx           1.501574
# wonderful     1.489044
# kudos         1.363512
# thankful      1.302682
# loved         1.221681
# refunded      1.215347
# happy         1.166732
# sweet         1.152855
# dtype: float64

thank         2.224737
awesome       2.152502
thanks        2.013645
great         1.828529
worries       1.812265
thnx          1.693800
amazing       1.672030
excellent     1.670677
best          1.667961
love          1.666248
cool          1.548519
appreciate    1.522806
thx           1.501574
wonderful     1.489044
kudos         1.363512
thankful      1.302682
loved         1.221681
refunded      1.215347
happy         1.166732
sweet         1.152855
dtype: float64

### Using Ngrams

Ok, now let's move to using ngrams and stop words in our model.  Initialize a new CountVectorizer, this time that splits our document into tokens of length one and two.

In [26]:
from sklearn.feature_extraction.text import CountVectorizer

ngram_vectorizor = CountVectorizer(ngram_range = (1, 2))

In [27]:
X_ngrams = ngram_vectorizor.fit_transform(documents)

We can see that there are now many more features to represent each document.

In [28]:
X_ngrams.shape

(14640, 117630)

> Initially, we only had 15000 features.

In [29]:
X_vectors.shape

(14640, 15051)

Ok, now let's split our model into training and test sets, still using `stratify`, and setting the `test_size` to .2.  Set the `random_state` to 2.

In [33]:
from sklearn.model_selection import train_test_split

X_train_ngrams, X_test_ngrams, y_train_ngrams, y_test_ngrams = train_test_split(X_ngrams, y,
                                                                                stratify = y, test_size = .2, random_state = 2)

Initialize a new logistic regression model, setting the maximum iterations to 1000.  Train the model on the training set.

In [36]:
from sklearn.linear_model import LogisticRegression

lr_ngrams = LogisticRegression(max_iter = 1000).fit(X_train_ngrams, y_train_ngrams)

Score the model on the test set.

In [37]:
lr_ngrams.score(X_test_ngrams, y_test_ngrams)
# 0.8114754098360656

0.8155737704918032

We can see that we perform slightly better using bag of ngrams text representation.

Now, let's take a look at the top thirty features of this model.

In [95]:
pd.Series(lr_ngrams.coef_[0], ngram_vectorizor.get_feature_names()).sort_values(ascending = False)[:30]

# delayed         1.565812
# worst           1.341883
# nothing         1.245329
# hours           1.237040
# sucks           1.138270
# delay           1.137530
# not             1.098006
# doesn           1.092324
# no              1.073342
# lost            1.062731
# suck            1.060646
# luggage         0.982798
# rude            0.955572
# why             0.954224
# stop            0.933671
# again           0.928162
# terrible        0.923423
# bags            0.905553
# hour            0.903597
# days            0.864337
# cancelled       0.849927
# website         0.849439
# answer          0.834583
# ridiculous      0.833111
# customers       0.818718
# the worst       0.810770
# paid            0.799298
# unacceptable    0.799168
# hrs             0.799022
# fail            0.795682
# dtype: float64

on hold                1.616944
the worst              1.324803
late flight            1.214466
my bag                 1.159465
cancelled flightled    1.119606
due to                 1.076597
my luggage             1.049506
cancelled flighted     1.036410
with no                0.923594
no one                 0.906188
customer service       0.903545
late flightr           0.896845
is not                 0.871396
this is                0.847528
my bags                0.844569
call back              0.836916
and no                 0.815591
delayed flight         0.785795
waiting for            0.779686
no response            0.768646
usairways you          0.759591
hour delay             0.755366
usairways your         0.754920
still waiting          0.748263
an hour                0.737140
for hours              0.732473
not even               0.731768
why do                 0.730159
united not             0.706476
had to                 0.706439
dtype: float64

### Using stop words

Now, without much guidance, use the `CountVectorizer` along with stop words.  Split the model into training and test sets, and see how the model scores.  

In [72]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

ngram_vectorizor_stop = CountVectorizer(ngram_range = (1, 2), stop_words=ENGLISH_STOP_WORDS)

In [73]:
X_ngram_stop = ngram_vectorizor_stop.fit_transform(documents)

In [74]:
from sklearn.model_selection import train_test_split

lr_stop_X_train, lr_stop_X_test, lr_stop_y_train, lr_stop_y_test = train_test_split(X_ngram_stop, y, stratify = y, test_size = .2)

In [75]:
lr = LogisticRegression(max_iter = 1000).fit(lr_stop_X_train, lr_stop_y_train)

We see that the model performs slightly worse with the stop words.

In [76]:
lr.score(lr_stop_X_test, lr_stop_y_test)
# 0.7817622950819673

0.7916666666666666

> Bonus:  

Even though we may train a worse model as we have a longer sequence of ngrams, we may be able to get a better understanding of what's leading to positive or a negative sentiment.  Change the ngram range above to 2, 3, so that we don't train on any individual sequences.  Then look at the most influential negative features. 

### Summary

In this lesson, we practiced using the `CountVectorizer` in sklearn.  We saw that we can use it to both classify tweets as positive or negative, and to understand what is the cause of positive or negative tweets.  We specified parameters such as `ngram_range` and `stop_words` in our CountVectorizer.  We found a logistic regression model performed the best without the use of stop words, and with an `ngram_range` of (1, 2).

### Resources

[5 part Spacy Tutorial](https://medium.com/@makcedward/nlp-pipeline-stop-words-part-5-d6770df8a936)

[Kaggle Amazon Food Reviews](https://www.kaggle.com/snap/amazon-fine-food-reviews)

[Spacy](https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/)