# Applying TFIDF

### Introduction

In this lesson, we'll move towards applying tf-idf to predict ratings of our Amazon reviews.  And we'll compare how well this performs versus our bag of words representation.  Let's get started.

### Loading the dataa

Let's start by loading up our coconut water reviews data.

In [1]:
import pandas as pd
url = "https://raw.githubusercontent.com/jigsawlabs-student/nlp-text-representation/master/coconut_water.csv"
coconut_df = pd.read_csv(url, index_col = 0)

Looking at the first couple of reviews, we see that with each review we have a rating ranging from one to five, and corresponding text.

In [4]:
coconut_df[:2]

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
47836,47837,B004SRH2B6,AKACGHPVILE9R,"Sophronia ""Euphemia""",1,1,1,1314144000,Switched to O.N.E.,Must admit the taste of O.N.E. coconut water i...
47837,47838,B004SRH2B6,A2GO0AIHB846UX,vinny,1,1,5,1313884800,WOW!!,I love this stuff! Perfect blend of dark choc...


Let's assign our Text column to the variable documents.

In [5]:
documents = coconut_df.Text

And assign the `Score` to the target $y$.

In [6]:
y = coconut_df.Score

* Loading the library

Now let's load up our spacy library, including the spacy stop words.

In [7]:
from spacy.lang.en.stop_words import STOP_WORDS
import spacy 

nlp = spacy.load("en_core_web_sm")

Next define a `spacy_tokenizer` that parses each document and returns a list of the lemmas of each words.  In addition, make sure that each word is lowercased, and strip it of any whitespace.

In [85]:
def spacy_tokenizer(document):
    tokens = [word.lemma_.lower().strip() for word in nlp(document)]
    return tokens

In [87]:
spacy_tokenizer(document)

['must',
 'admit',
 'the',
 'taste',
 'of',
 'o.n.e.',
 'coconut',
 'water',
 'be',
 'well',
 '.',
 '',
 'take',
 'a',
 'long',
 'time',
 'to',
 'get',
 'through',
 'the',
 'supply',
 'of',
 'coconut',
 'water',
 '.']

> Notice that this text has an empty string in there.  Change the tokenizer so that it only returns the text if the lowercased stripped lemma is not an empty string.

In [88]:
def spacy_tokenizer_content(document):
    tokens = [word.lemma_.lower().strip() for word in nlp(document) if word.lemma_.lower().strip()]
    return tokens

In [89]:
spacy_tokenizer_content(document)

# ['must',
#  'admit',
#  'the',
#  'taste',
#  'of',
#  'o.n.e.',
#  'coconut',
#  'water',
#  'be',
#  'well',
#  '.',
#  'take',
#  'a',
#  'long',
#  'time',
#  'to',
#  'get',
#  'through',
#  'the',
#  'supply',
#  'of',
#  'coconut',
#  'water',
#  '.']

['must',
 'admit',
 'the',
 'taste',
 'of',
 'o.n.e.',
 'coconut',
 'water',
 'be',
 'well',
 '.',
 'take',
 'a',
 'long',
 'time',
 'to',
 'get',
 'through',
 'the',
 'supply',
 'of',
 'coconut',
 'water',
 '.']

Notice that we could also remove the periods by only returning tokens that return true for `is_alpha`.

### Splitting the data

Let's now split the data into training and test sets, using the stratify parameter so that our data is evenly split.  Set the `test_size` to $.2$ and `random_state` to 1.

> Because our documents are not yet numeric, use the variables `documents_train` and `documents_test` to assign the variables.

In [95]:
from sklearn.model_selection import train_test_split

documents_train, documents_test, y_train, y_test = train_test_split(documents,  y, 
                                                                    stratify = y,
                                                                    test_size = .2, random_state = 1)

Now let's now transform the text using the TFIDF vectorizer.  Let's not use stop words, but do use an ngram range.  Specify an ngram range from (1, 2).  Use the `spacy_tokenizer` defined above as the tokenizer, and set `decode_error` to `ignore` so that any new patterns in the test set do not throw an error.

In [172]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(tokenizer = spacy_tokenizer, decode_error = 'ignore')

Now call `fit_transform` on the `documents_train`.

In [173]:
vectors_train = vectorizer.fit_transform(documents_train)

In [174]:
vectors_train.shape
# (364, 17326)

(364, 2741)

> We can see that we have a training set of 364, each with over 17000 features.

Now remember that TF-IDF gives us a score for each document based on how often the word appears and the rarity of the word.

> So we can see how these words are encoded by passing through the trained data, along with the feature names.

In [175]:
df_vectors = pd.DataFrame(vectors_train.toarray(), columns = vectorizer.get_feature_names())
df_vectors[:5]

Unnamed: 0,Unnamed: 1,!,"""","""<br",#,$,$1.00,%,&,',...,zeco,zero,zica,zico,"zico""",zico-,zico.<br,zicos,zito,zombie
0,0.188759,0.06999,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.062303,0.0,0.0,0.0,0.0,0.0,0.0
1,0.126623,0.0,0.0,0.0,0.0,0.0,0.0,0.144329,0.0,0.0,...,0.0,0.0,0.0,0.125383,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.050544,0.0,0.0,0.0,0.0,0.0,0.0
3,0.097159,0.270191,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.096207,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.078257,0.0,0.0,0.0,0.080189,0.0,0.0,0.041222,0.0,...,0.0,0.0,0.0,0.087078,0.0,0.0,0.0,0.0,0.0,0.0


From here we can see some of the words given the highest values.  

In [176]:
df_vectors.iloc[0].sort_values(ascending = False)[:10]

order          0.232879
good!<br       0.225073
/>cancelled    0.225073
plastic        0.222743
recipe         0.210369
-pron-         0.193645
today          0.191846
auto           0.191846
               0.188759
delivery       0.174802
Name: 0, dtype: float64

In [177]:
df_vectors.iloc[1].sort_values(ascending = False)[:10]

by          0.242019
about       0.229052
popular     0.211680
talk        0.211680
puree       0.211680
mixer       0.211680
40          0.201183
everyone    0.175891
as          0.172758
light       0.171593
Name: 1, dtype: float64

### Training a model

Now let's train a logistic regression model.

In [178]:
from sklearn.linear_model import LogisticRegression

In [179]:
model = LogisticRegression()
model.fit(vectors_train, y_train)

LogisticRegression()

Now we haven't yet encoded our test set, so use the `vectorizer` to apply the same transformation to the `documents_test`. 

In [180]:
vectors_test = vectorizer.transform(documents_test)

In [181]:
model.score(vectors_test, y_test)

0.6304347826086957

Now let's look at some of the most important features where the rating was 0.  Create a series that looks at the words with the highest parameters.  Select the top 20 largest features.

In [182]:
pd.Series(model.coef_[0], vectorizer.get_feature_names()).sort_values(ascending = False)[:20]

# plastic        1.244543
# concentrate    1.111643
# from           0.758279
# "              0.738002
# buy            0.726199
# .              0.683141
# new            0.680499
# bottle         0.674728
# product        0.654184
# this           0.628896
# sip            0.616454
# ever           0.595003
# ?              0.574656
# terrible       0.570207
# what           0.567574
# finish         0.544019
# disgusting     0.538367
# taste          0.509934
# horrible       0.471150
# money          0.462101
# dtype: float64

plastic        1.244543
concentrate    1.111643
from           0.758279
"              0.738002
buy            0.726199
.              0.683141
new            0.680499
bottle         0.674728
product        0.654184
this           0.628896
sip            0.616454
ever           0.595003
?              0.574656
terrible       0.570207
what           0.567574
finish         0.544019
disgusting     0.538367
taste          0.509934
horrible       0.471150
money          0.462101
dtype: float64

Now select the ten features for the five star rated reviews.

In [183]:
pd.Series(model.coef_[-1], vectorizer.get_feature_names()).sort_values(ascending = False)[:10]

# chocolate    1.574490
# !            1.399597
# great        1.395714
# love         1.034309
# and          0.740629
# delicious    0.727754
# favorite     0.596369
# more         0.554068
# healthy      0.533834
# for          0.533679
# dtype: float64

chocolate    1.574490
!            1.399597
great        1.395714
love         1.034309
and          0.740629
delicious    0.727754
favorite     0.596369
more         0.554068
healthy      0.533834
for          0.533679
dtype: float64

### Only Bag of Words

Now let's see if we perform any better using the old technique of bag of words.  Use a CountVectorizer along with the `spacy_tokenizer` and set the `decode_error` to `ignore`.

In [185]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_bow = CountVectorizer(tokenizer = spacy_tokenizer,
                                    decode_error = 'ignore'
                                    )


Now transform the `documents_train`.

In [186]:
vectors_bow_train = vectorizer_bow.fit_transform(documents_train)

And transform the `documents_test` by the vectorizer.

In [187]:
vectors_bow_test = vectorizer_bow.transform(documents_test)

Let's see how this performs.  Train a logistic regression model and score it on the test set.  Set the maximum number of iterations at `2000`. 

In [188]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter = 2000)
model.fit(vectors_bow_train, y_train)

LogisticRegression(max_iter=2000)

In [189]:
model.score(vectors_bow_test, y_test)
# 0.6521739130434783

0.6521739130434783

> We can see a slight bump in the score.

Now let's see if we can do any better using regularization.  Let's use the LogisticRegressionCV model for this.

We can get a sense of how the regularization works from the documentation:
> Each of the values in Cs describes the inverse of regularization strength. If Cs is as an int, then a grid of Cs values are chosen in a logarithmic scale between 1e-4 and 1e4. Like in support vector machines, smaller values specify stronger regularization.

In [195]:
from sklearn.linear_model import LogisticRegressionCV

model_cv = LogisticRegressionCV(max_iter=2000)

In [196]:
model_cv.fit(vectors_bow_train, y_train)

LogisticRegressionCV(max_iter=2000)

In [197]:
model_cv.score(vectors_bow_test, y_test)
# 0.6521739130434783

0.6521739130434783

We can see that the model used a range of regularization parameters.

In [200]:
model_cv.Cs_.round(2)

array([0.00000e+00, 0.00000e+00, 1.00000e-02, 5.00000e-02, 3.60000e-01,
       2.78000e+00, 2.15400e+01, 1.66810e+02, 1.29155e+03, 1.00000e+04])

But that a relatively high one was chosen.  And this would shows that not a lot of regularization was used.

In [206]:
model_cv.C_

array([2.7825594, 2.7825594, 2.7825594, 2.7825594, 2.7825594])

### Using Ngrams

Now let's use bow with ngrams to see if this helps at all.

In [209]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_bow_ngram = CountVectorizer(tokenizer = spacy_tokenizer,
                                    decode_error = 'ignore',
                                 ngram_range = (1, 2))


In [214]:
vectorizer_ngram_train = vectorizer_bow_ngram.fit_transform(documents_train)

In [215]:
vectorizer_ngram_test = vectorizer_bow_ngram.transform(documents_test)

In [216]:
from sklearn.linear_model import LogisticRegressionCV

model_cv = LogisticRegressionCV(max_iter=2000)
model_cv.fit(vectorizer_ngram_train, y_train)

LogisticRegressionCV(max_iter=2000)

In [217]:
model_cv.score(vectorizer_ngram_test, y_test)

# 0.6195652173913043

0.6195652173913043

> We see that the score slighly decreases.

Finally, let's look at some of the most important features when some rates a 0.

In [218]:
pd.Series(model_cv.coef_[0],vectorizer_bow_ngram.get_feature_names()).sort_values(ascending = False)[:10]

plastic             0.977640
concentrate         0.891844
new                 0.855902
from concentrate    0.843802
what                0.822078
not buy             0.773720
bottle              0.718263
taste               0.672921
this                0.671672
ever                0.637346
dtype: float64

In [220]:
pd.Series(model_cv.coef_[-1],vectorizer_bow_ngram.get_feature_names()).sort_values(ascending = False)[:20]

great        1.430512
chocolate    1.335300
!            0.989884
on           0.899702
the good     0.816733
with         0.806251
all          0.789968
and          0.749770
delicious    0.749574
's           0.743754
be -pron-    0.664415
favorite     0.626084
for          0.610332
calorie      0.587700
in           0.580491
more         0.579082
can          0.554747
-pron- do    0.540177
/>i          0.533376
have be      0.513296
dtype: float64

### Summary

In this lesson, we used TF-IDF with our coconut water reviews dataset.  We saw that our model performed better by simply using a bow of words text representation.  We also saw little benefit from regularization.  We were able to get a sense of what people liked and did not like about the product by looking at the most significant features both when ngrams was used and when it wasn't.