# Bag of Words Lab

### Introduction

In this lab, we'll practice using the bag of words model using a corpus that consists of tweets about an airline.  Let's get started.

### Loading the Data

Let's begin by loading our data.

In [1]:
import pandas as pd

df = pd.read_csv('./Tweets.csv')

In [41]:
documents = df.text

In [7]:
df.airline_sentiment.value_counts()

negative    9178
neutral     3099
positive    2363
Name: airline_sentiment, dtype: int64

In [13]:
sentiment_map = {'negative': 0, 'neutral': 1, 'positive': 2}

In [14]:
y = df.airline_sentiment.map(sentiment_map)

### Employ Bag of Words

In [18]:
from sklearn.feature_extraction.text import CountVectorizer 

In [19]:
bow = CountVectorizer()

In [20]:
X_vectors = bow.fit_transform(documents)

In [22]:
X_vectors.toarray()[:2]

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [23]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_vectors, y, stratify = y)

In [26]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter = 500)

In [27]:
lr.fit(X_train, y_train)

lr.score(X_test, y_test)

0.7956284153005464

### Model Interpretation

In [36]:
pd.Series(lr.coef_[0], bow.get_feature_names()).sort_values(ascending = False)[:20]

worst           2.214460
nothing         1.543403
delayed         1.463210
hours           1.414857
ridiculous      1.399412
terrible        1.362820
fail            1.351004
suck            1.347511
unacceptable    1.335372
sucks           1.291287
fix             1.266558
answer          1.259342
unless          1.231424
paid            1.229213
hold            1.228098
rude            1.219338
worse           1.208478
disappointed    1.204609
luggage         1.179982
frustrated      1.179896
dtype: float64

In [39]:
pd.Series(lr.coef_[-1], bow.get_feature_names()).sort_values(ascending = False)[:20]

thank         2.224737
awesome       2.152502
thanks        2.013645
great         1.828529
worries       1.812265
thnx          1.693800
amazing       1.672030
excellent     1.670677
best          1.667961
love          1.666248
cool          1.548519
appreciate    1.522806
thx           1.501574
wonderful     1.489044
kudos         1.363512
thankful      1.302682
loved         1.221681
refunded      1.215347
happy         1.166732
sweet         1.152855
dtype: float64

### Using Ngrams

In [40]:
from sklearn.feature_extraction.text import CountVectorizer

ngram_vectorizor = CountVectorizer(ngram_range = (1, 2))

In [42]:
X_ngrams = ngram_vectorizor.fit_transform(documents)

In [43]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_ngrams, y, stratify = y)

In [45]:
from sklearn.linear_model import LogisticRegression

lr_ngrams = LogisticRegression(max_iter = 1000).fit(X_train, y_train)

lr_ngrams.score(X_test, y_test)

0.8024590163934426

In [46]:
pd.Series(lr_ngrams.coef_[0], ngram_vectorizor.get_feature_names()).sort_values(ascending = False)[:20]

delayed     1.358320
worst       1.353864
suck        1.214716
nothing     1.205090
delay       1.176671
hours       1.168483
sucks       1.125183
doesn       1.124958
luggage     1.115510
not         1.098714
lost        1.088311
no          1.084192
why         1.073450
stop        1.024607
bags        0.947827
on hold     0.934132
again       0.920254
hour        0.916932
terrible    0.906476
missing     0.905110
dtype: float64

### Resources

[5 part Spacy Tutorial](https://medium.com/@makcedward/nlp-pipeline-stop-words-part-5-d6770df8a936)

[Kaggle Amazon Food Reviews](https://www.kaggle.com/snap/amazon-fine-food-reviews)

[Spacy](https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/)