# Predicting sentiment from product reviews


The goal of this first notebook is to explore logistic regression and feature engineering with existing Turi Create functions.

In this notebook you will use product review data from Amazon.com to predict whether the sentiments about a product (from its reviews) are positive or negative.

* Use SFrames to do some feature engineering
* Train a logistic regression model to predict the sentiment of product reviews.
* Inspect the weights (coefficients) of a trained logistic regression model.
* Make a prediction (both class and probability) of sentiment for a new product review.
* Given the logistic regression weights, predictors and ground truth labels, write a function to compute the **accuracy** of the model.
* Inspect the coefficients of the logistic regression model and interpret their meanings.
* Compare multiple logistic regression models.

Let's get started!


In [58]:
import pandas as pd
import numpy as np
import string
import math
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import json


In [59]:
dtype_dict = {'name':str, 'review':str, 'rating':int }

In [60]:
products = pd.read_csv("amazon_baby.csv", dtype = dtype_dict)

In [61]:
products

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5
...,...,...,...
183526,Baby Teething Necklace for Mom Pretty Donut Sh...,Such a great idea! very handy to have and look...,5
183527,Baby Teething Necklace for Mom Pretty Donut Sh...,This product rocks! It is a great blend of fu...,5
183528,Abstract 2 PK Baby / Toddler Training Cup (Pink),This item looks great and cool for my kids.......,5
183529,"Baby Food Freezer Tray - Bacteria Resistant, B...",I am extremely happy with this product. I have...,5


In [62]:
products[269:270]

Unnamed: 0,name,review,rating
269,The First Years Massaging Action Teether,A favorite in our house!,5


## Build the word count vector for each review

In [63]:
import string 
def remove_punctuation(text):
    try: # python 2.x
        text = text.translate(None, string.punctuation) 
    except: # python 3.x
        translator = text.maketrans('', '', string.punctuation)
        text = text.translate(translator)
    return text

#review_without_punctuation = products['review'].apply(remove_punctuation)
#products['word_count'] = turicreate.text_analytics.count_words(review_without_punctuation)

IMPORTANT. Make sure to fill n/a values in the review column with empty strings (if applicable). The n/a values indicate empty reviews. For instance, Pandas's the fillna() method lets you replace all N/A's in the review columns as follows:

In [64]:
products = products.fillna({'review':''})

In [65]:
products['review_clean'] = products['review'].astype(str).apply(remove_punctuation)

In [66]:
products.isnull().sum() ## To check the number of empty values in each column

name            318
review            0
rating            0
review_clean      0
dtype: int64

In [67]:
products[products['review'].isnull() == True] ## Print the data for which the value is empty

Unnamed: 0,name,review,rating,review_clean


In [68]:
products[38:39]

Unnamed: 0,name,review,rating,review_clean
38,SoftPlay Twinkle Twinkle Elmo A Bedtime Book,,5,


## Extract sentiments

We will **ignore** all reviews with *rating = 3*, since they tend to have a neutral sentiment.
We will ignore all reviews with rating = 3, since they tend to have a neutral sentiment. In SFrame, for instance,

In [69]:
products = products[products['rating'] != 3]

In [70]:
len(products)

166752

In [71]:
set(products['rating'])

{1, 2, 4, 5}

In [72]:
set(products['rating'])

{1, 2, 4, 5}

4. Now, we will assign reviews with a rating of 4 or higher to be positive reviews, while the ones with rating of 2 or lower are negative. For the sentiment column, we use +1 for the positive class label and -1 for the negative class label. A good way is to create an anonymous function that converts a rating into a class label and then apply that function to every element in the rating column. In SFrame, you would use apply():

In [73]:
products['sentiment'] = products['rating'].apply(lambda x: +1 if x > 3 else -1)

In [74]:
products[products['rating'] == 2].head()

Unnamed: 0,name,review,rating,review_clean,sentiment
21,Nature's Lullabies Second Year Sticker Calendar,I only purchased a second-year calendar for my...,2,I only purchased a secondyear calendar for my ...,-1
41,"SoftPlay Giggle Jiggle Funbook, Happy Bear",This bear is absolutely adorable and I would g...,2,This bear is absolutely adorable and I would g...,-1
71,Wall Decor Removable Decal Sticker - Colorful ...,Would not purchase again or recommend. The dec...,2,Would not purchase again or recommend The deca...,-1
78,Cloth Diaper Pins Stainless Steel Traditional ...,These were good quality--worked fine--heavy d...,2,These were good qualityworked fineheavy dutya...,-1
80,Cloth Diaper Pins Stainless Steel Traditional ...,"While the diaper pins are attractive, the meta...",2,While the diaper pins are attractive the metal...,-1


5. Let's perform a train/test split with 80% of the data in the training set and 20% of the data in the test set. If you are using SFrame, make sure to use seed=1 so that you get the same result as everyone else does. (This way, you will get the right numbers for the quiz.)

In [75]:
!ls

[1m[32mCLA02-NB01.ipynb[m[m
Classification Resources.ipynb
Week1_logistic_Regression_assignment_1.ipynb
_1ccb9ec834e6f4b9afb46f4f5ab56402_module-2-assignment-test-idx.json.zip
_1ccb9ec834e6f4b9afb46f4f5ab56402_module-2-assignment-train-idx.json.zip
_35bdebdff61378878ea2247780005e52_amazon_baby.gl.zip
_596922c59a7068d4eb7baa104c01a685_amazon_baby.csv.zip
_5f6530087fcb6e7c03d5925fcee9c692_logistic-regression-model-annotated.pdf
[1m[32mamazon_baby.csv[m[m
[1m[34mamazon_baby.gl[m[m
[1m[34mamazon_baby.sframe[m[m
intro.pdf
module-2-assignment-test-idx.json
module-2-assignment-train-idx.csv
module-2-assignment-train-idx.json
module-2-assignment-train-idx.txt
ogistic-regression-model-annotated.pdf
xNzmahsUS8Kc5mobFGvCag_c9cb67842a5d46cfa9e0013faf338b5a_CLA02-NB01.ipynb.zip
z2MqguI0Eemm5A4ynZyB2A_fb01f0fbfc2b4701913139e2e54281af_amazon_baby.sframe.zip


In [76]:
# open output file for reading
with open('module-2-assignment-train-idx.json', 'r') as filehandle:
    train_index = json.load(filehandle)

In [77]:
train_data = products.iloc[train_index]

In [78]:
# open output file for reading
with open('module-2-assignment-test-idx.json', 'r') as filehandle:
    test_index = json.load(filehandle)

In [79]:
test_data = products.iloc[test_index]

In [80]:
train_data.shape

(133416, 5)

In [81]:
test_data.shape

(33336, 5)

In [82]:
#train_data, test_data = train_test_split(products_without_3, test_size = 0.2, random_state = 0)

 # Train a sentiment classifier with logistic regression
 
 We will now compute the word count for each word that appears in the reviews. A vector consisting of word counts is often referred to as bag-of-word features. Since most words occur in only a few reviews, word count vectors are sparse. For this reason, scikit-learn and many other tools use sparse matrices to store a collection of word count vectors. Refer to appropriate manuals to produce sparse word count vectors. General steps for extracting word count vectors are as follows:

Learn a vocabulary (set of all words) from the training data. Only the words that show up in the training data will be considered for feature extraction.
Compute the occurrences of the words in each review and collect them into a row vector.
Build a sparse matrix where each row is the word count vector for the corresponding review. Call this matrix train_matrix.
Using the same mapping between words and columns, convert the test data into a sparse matrix test_matrix.
The following cell uses CountVectorizer in scikit-learn. Notice the token_pattern argument in the constructor.

In [83]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
     # Use this token pattern to keep single-letter words
# First, learn vocabulary from the training data and assign columns to words
# Then convert the training data into a sparse matrix
train_matrix = vectorizer.fit_transform(train_data['review_clean'])


In [84]:
print(len(vectorizer.get_feature_names()))

121712


In [85]:
train_matrix.shape

(133416, 121712)

In [86]:
# Second, convert the test data into a sparse matrix, using the same word-column mapping
test_matrix = vectorizer.transform(test_data['review_clean'])

7. Learn a logistic regression classifier using the training data. If you are using scikit-learn, you should create an instance of the LogisticRegression class and then call the method fit() to train the classifier. This model should use the sparse word count matrix (train_matrix)as features and the column sentiment of train_data as the target. Use the default values for other parameters. Call this model sentiment_model.

In [109]:
## Instansiate the model
sentiment_model = LogisticRegression()

In [110]:
sentiment_model.fit(train_matrix, train_data['sentiment'])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

8. There should be over 100,000 coefficients in this sentiment_model. Recall from the lecture that positive weights w_j correspond to weights that cause positive sentiment, while negative weights correspond to negative sentiment. Calculate the number of positive (>= 0, which is actually nonnegative) coefficients.

Quiz question: How many weights are >= 0?

In [111]:
sentiment_model.coef_

array([[-6.45685548e-01,  1.11368336e-03,  8.85726835e-03, ...,
         1.53742345e-03,  1.11041598e-03, -4.15427957e-04]])

In [112]:
weights = sentiment_model.coef_

In [113]:
weights = weights.flatten()

In [114]:
weights_positive = [num for num in weights if num >= 0]

In [115]:
len(weights_positive)

90358

In [116]:
len(weights)

121712

## Making predictions with logistic regression

9. Now that a model is trained, we can make predictions on the test data. In this section, we will explore this in the context of 3 data points in the test data. Take the 11th, 12th, and 13th data points in the test data and save them to sample_test_data. The following cell extracts the three data points from the SFrame test_data and print their content:

In [117]:
sample_test_data = test_data[10:13]
print (sample_test_data)

                                                 name  \
59                          Our Baby Girl Memory Book   
71  Wall Decor Removable Decal Sticker - Colorful ...   
91  New Style Trailing Cherry Blossom Tree Decal R...   

                                               review  rating  \
59  Absolutely love it and all of the Scripture in...       5   
71  Would not purchase again or recommend. The dec...       2   
91  Was so excited to get this product for my baby...       1   

                                         review_clean  sentiment  
59  Absolutely love it and all of the Scripture in...          1  
71  Would not purchase again or recommend The deca...         -1  
91  Was so excited to get this product for my baby...         -1  


Let's dig deeper into the first row of the sample_test_data. Here's the full review:

In [118]:
# products_without_3[products_without_3['name'] == 'Our Baby Girl Memory Book']
# products_without_3[products_without_3['name'] == 'Wall Decor Removable Decal Sticker - Colorful Butterflies']
# train_data[train_data['name'] == 'Wall Decor Removable Decal Sticker - Colorful Butterflies']
# test_data[test_data['name'] == 'Wall Decor Removable Decal Sticker - Colorful Butterflies']

In [119]:
sample_test_data['review']

59    Absolutely love it and all of the Scripture in...
71    Would not purchase again or recommend. The dec...
91    Was so excited to get this product for my baby...
Name: review, dtype: object

We will now make a **class** prediction for the **sample_test_data**. The `sentiment_model` should predict **+1** if the sentiment is positive and **-1** if the sentiment is negative. Recall from the lecture that the **score** (sometimes called **margin**) for the logistic regression model  is defined as:

$$
\mbox{score}_i = \mathbf{w}^T h(\mathbf{x}_i)
$$ 

where $h(\mathbf{x}_i)$ represents the features for example $i$.  We will write some code to obtain the **scores** using Turi Create. For each row, the **score** (or margin) is a number in the range **[-inf, inf]**.

In [120]:
sample_test_matrix = vectorizer.transform(sample_test_data['review_clean'])
scores = sentiment_model.decision_function(sample_test_matrix)
print (scores)

[  4.86911646  -3.22670736 -10.16656684]


### Predicting sentiment

These scores can be used to make class predictions as follows:

$$
\hat{y} = 
\left\{
\begin{array}{ll}
      +1 & \mathbf{w}^T h(\mathbf{x}_i) > 0 \\
      -1 & \mathbf{w}^T h(\mathbf{x}_i) \leq 0 \\
\end{array} 
\right.
$$

Using scores, write code to calculate $\hat{y}$, the class predictions:

In [121]:
predictions_sample_test = [1 if score > 0 else -1 for score in scores]

In [122]:
predictions_sample_test

[1, -1, -1]

In [123]:
predictions_model = sentiment_model.predict(sample_test_matrix)
predictions_model

array([ 1, -1, -1])

**Checkpoint**: Make sure your class predictions match with the one obtained from Turi Create.

### Probability predictions

Recall from the lectures that we can also calculate the probability predictions from the scores using:
$$
P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))}.
$$

Using the variable **scores** calculated previously, write code to calculate the probability that a sentiment is positive using the above formula. For each row, the probabilities should be a number in the range **[0, 1]**.

In [124]:
def calc_proba(scores):
    proba = [1/(1+np.exp(-score)) for score in scores]
    return proba

In [125]:
calc_proba(scores)

[0.9923783871392199, 0.03817295539717425, 3.8432570468759424e-05]

In [126]:
predictions_probabilities = sentiment_model.predict_proba(sample_test_matrix)

In [127]:
predictions_probabilities[:,1]

array([9.92378387e-01, 3.81729554e-02, 3.84325705e-05])

** Quiz Question:** Of the three data points in **sample_test_data**, which one (first, second, or third) has the **lowest probability** of being classified as a positive review?

In [128]:
#--->The third one


13. We now turn to examining the full test dataset, test_data, and use sklearn.linear_model.LogisticRegression to form predictions on all of the test data points.

Using the sentiment_model, find the 20 reviews in the entire test_data with the highest probability of being classified as a positive review. We refer to these as the "most positive reviews."

To calculate these top-20 reviews, use the following steps:

Make probability predictions on test_data using the sentiment_model.
Sort the data according to those predictions and pick the top 20.


Quiz Question: Which of the following products are represented in the 20 most positive reviews?

In [129]:
test_scores = sentiment_model.decision_function(test_matrix)

In [130]:
test_scores

array([ 2.32667214, 14.83724448,  2.84575242, ..., 10.28183724,
       11.01529378,  4.4617744 ])

In [131]:
predictions_test = sentiment_model.predict(test_matrix)

In [132]:
predictions_test.shape

(33336,)

In [139]:
test_data['sentiment_predicted'] = predictions_test

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [140]:
test_data['proba_predicted'] = sentiment_model.predict_proba(test_matrix)[:,1]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [141]:
test_data.head(10)

Unnamed: 0,name,review,rating,review_clean,sentiment,proba_predicted,sentiment_predicted
9,"Baby Tracker&reg; - Daily Childcare Journal, S...",This has been an easy way for my nanny to reco...,4,This has been an easy way for my nanny to reco...,1,0.911062,1
10,"Baby Tracker&reg; - Daily Childcare Journal, S...",I love this journal and our nanny uses it ever...,4,I love this journal and our nanny uses it ever...,1,1.0,1
16,Nature's Lullabies First Year Sticker Calendar,"I love this little calender, you can keep trac...",5,I love this little calender you can keep track...,1,0.945099,1
20,Nature's Lullabies Second Year Sticker Calendar,I had a hard time finding a second year calend...,5,I had a hard time finding a second year calend...,1,0.999905,1
28,"Lamaze Peekaboo, I Love You","One of baby's first and favorite books, and it...",4,One of babys first and favorite books and it i...,1,0.977628,1
36,"Lamaze Peekaboo, I Love You",My son loved this book as an infant. It was p...,5,My son loved this book as an infant It was pe...,1,0.99925,1
37,"Lamaze Peekaboo, I Love You",Our baby loves this book & has loved it for a ...,5,Our baby loves this book has loved it for a w...,1,0.998812,1
41,"SoftPlay Giggle Jiggle Funbook, Happy Bear",This bear is absolutely adorable and I would g...,2,This bear is absolutely adorable and I would g...,-1,0.921633,1
43,SoftPlay Peek-A-Boo Where's Elmo A Children's ...,I bought two for recent baby showers! The boo...,5,I bought two for recent baby showers The book...,1,0.997983,1
56,Baby's First Year Undated Wall Calendar with S...,I searched high and low for a first year calen...,5,I searched high and low for a first year calen...,1,0.997683,1


In [142]:
#predicted_proba_test = calc_proba(test_scores)

In [143]:
#test_data['predicted_proba'] = predicted_proba_test

In [144]:
test_data.head()

Unnamed: 0,name,review,rating,review_clean,sentiment,proba_predicted,sentiment_predicted
9,"Baby Tracker&reg; - Daily Childcare Journal, S...",This has been an easy way for my nanny to reco...,4,This has been an easy way for my nanny to reco...,1,0.911062,1
10,"Baby Tracker&reg; - Daily Childcare Journal, S...",I love this journal and our nanny uses it ever...,4,I love this journal and our nanny uses it ever...,1,1.0,1
16,Nature's Lullabies First Year Sticker Calendar,"I love this little calender, you can keep trac...",5,I love this little calender you can keep track...,1,0.945099,1
20,Nature's Lullabies Second Year Sticker Calendar,I had a hard time finding a second year calend...,5,I had a hard time finding a second year calend...,1,0.999905,1
28,"Lamaze Peekaboo, I Love You","One of baby's first and favorite books, and it...",4,One of babys first and favorite books and it i...,1,0.977628,1


In [145]:
test_data.sort_values('proba_predicted', ascending = False).iloc[0:20]

Unnamed: 0,name,review,rating,review_clean,sentiment,proba_predicted,sentiment_predicted
52631,Evenflo X Sport Plus Convenience Stroller - Ch...,After seeing this in Parent's Magazine and rea...,5,After seeing this in Parents Magazine and read...,1,1.0,1
140816,"Diono RadianRXT Convertible Car Seat, Plum",I bought this seat for my tall (38in) and thin...,5,I bought this seat for my tall 38in and thin 2...,1,1.0,1
100166,"Infantino Wrap and Tie Baby Carrier, Black Blu...",I bought this carrier when my daughter was abo...,5,I bought this carrier when my daughter was abo...,1,1.0,1
119182,Roan Rocco Classic Pram Stroller 2-in-1 with B...,Great Pram Rocco!!!!!!I bought this pram from ...,5,Great Pram RoccoI bought this pram from Europe...,1,1.0,1
137034,Graco Pack 'n Play Element Playard - Flint,My husband and I assembled this Pack n' Play l...,4,My husband and I assembled this Pack n Play la...,1,1.0,1
133651,"Britax 2012 B-Agile Stroller, Red",[I got this stroller for my daughter prior to ...,4,I got this stroller for my daughter prior to t...,1,1.0,1
168697,Graco FastAction Fold Jogger Click Connect Str...,Graco's FastAction Jogging Stroller definitely...,5,Gracos FastAction Jogging Stroller definitely ...,1,1.0,1
87017,Baby Einstein Around The World Discovery Center,I am so HAPPY I brought this item for my 7 mon...,5,I am so HAPPY I brought this item for my 7 mon...,1,1.0,1
180646,Mamas &amp; Papas 2014 Urbo2 Stroller - Black,After much research I purchased an Urbo2. It's...,4,After much research I purchased an Urbo2 Its e...,1,1.0,1
14008,"Stork Craft Beatrice Combo Tower Chest, White",I bought the tower despite the bad reviews and...,5,I bought the tower despite the bad reviews and...,1,1.0,1


In [146]:
test_data_positive.head(20)

Unnamed: 0,name,review,rating,review_clean,sentiment,sentiment_predicted
52631,Evenflo X Sport Plus Convenience Stroller - Ch...,After seeing this in Parent's Magazine and rea...,5,After seeing this in Parents Magazine and read...,1,1.0
140816,"Diono RadianRXT Convertible Car Seat, Plum",I bought this seat for my tall (38in) and thin...,5,I bought this seat for my tall 38in and thin 2...,1,1.0
100166,"Infantino Wrap and Tie Baby Carrier, Black Blu...",I bought this carrier when my daughter was abo...,5,I bought this carrier when my daughter was abo...,1,1.0
119182,Roan Rocco Classic Pram Stroller 2-in-1 with B...,Great Pram Rocco!!!!!!I bought this pram from ...,5,Great Pram RoccoI bought this pram from Europe...,1,1.0
137034,Graco Pack 'n Play Element Playard - Flint,My husband and I assembled this Pack n' Play l...,4,My husband and I assembled this Pack n Play la...,1,1.0
133651,"Britax 2012 B-Agile Stroller, Red",[I got this stroller for my daughter prior to ...,4,I got this stroller for my daughter prior to t...,1,1.0
168697,Graco FastAction Fold Jogger Click Connect Str...,Graco's FastAction Jogging Stroller definitely...,5,Gracos FastAction Jogging Stroller definitely ...,1,1.0
87017,Baby Einstein Around The World Discovery Center,I am so HAPPY I brought this item for my 7 mon...,5,I am so HAPPY I brought this item for my 7 mon...,1,1.0
180646,Mamas &amp; Papas 2014 Urbo2 Stroller - Black,After much research I purchased an Urbo2. It's...,4,After much research I purchased an Urbo2 Its e...,1,1.0
14008,"Stork Craft Beatrice Combo Tower Chest, White",I bought the tower despite the bad reviews and...,5,I bought the tower despite the bad reviews and...,1,1.0


**Quiz Question**: Which of the following products are represented in the 20 most positive reviews? [multiple choice]


Now, let us repeat this exercise to find the "most negative reviews." Use the prediction probabilities to find the  20 reviews in the **test_data** with the **lowest probability** of being classified as a **positive review**. Repeat the same steps above but make sure you **sort in the opposite order**.

In [148]:
test_data_negative = test_data.sort_values('proba_predicted')
test_data_negative.head(20)

Unnamed: 0,name,review,rating,review_clean,sentiment,proba_predicted,sentiment_predicted
94560,The First Years True Choice P400 Premium Digit...,Note: we never installed batteries in these un...,1,Note we never installed batteries in these uni...,-1,1.115711e-14,-1
16042,Fisher-Price Ocean Wonders Aquarium Bouncer,We have not had ANY luck with Fisher-Price pro...,2,We have not had ANY luck with FisherPrice prod...,-1,1.527143e-14,-1
155287,VTech Communications Safe &amp; Sounds Full Co...,"This is my second video monitoring system, the...",1,This is my second video monitoring system the ...,-1,1.678581e-12,-1
120209,Levana Safe N'See Digital Video Baby Monitor w...,This is the first review I have ever written o...,1,This is the first review I have ever written o...,-1,2.324164e-12,-1
48694,Adiri BPA Free Natural Nurser Ultimate Bottle ...,I will try to write an objective review of the...,2,I will try to write an objective review of the...,-1,9.558468e-11,-1
95420,One Step Ahead Hide-Away Extra Long Bed Rail,"I bought a brand new 56"" hide-away bed safety ...",1,I bought a brand new 56 hideaway bed safety ra...,-1,3.359313e-10,-1
53207,Safety 1st High-Def Digital Monitor,We bought this baby monitor to replace a diffe...,1,We bought this baby monitor to replace a diffe...,-1,3.689456e-10,-1
81332,Cloth Diaper Sprayer--styles may vary,I bought this sprayer out of desperation durin...,1,I bought this sprayer out of desperation durin...,-1,6.80889e-10,-1
176046,Baby Trend Inertia Infant Car Seat - Horizon,"I really wanted to love this seat; however, I ...",1,I really wanted to love this seat however I wo...,-1,7.04691e-10,-1
167249,Samsung SEW-3037W Wireless Pan Tilt Video Baby...,Reviewers. You failed me!This thing worked for...,1,Reviewers You failed meThis thing worked for 2...,-1,8.061841e-10,-1


## Compute accuracy of the classifier

We will now evaluate the accuracy of the trained classifier. Recall that the accuracy is given by


$$
\mbox{accuracy} = \frac{\mbox{# correctly classified examples}}{\mbox{# total examples}}
$$

This can be computed as follows:

* **Step 1:** Use the trained model to compute class predictions (**Hint:** Use the `predict` method)
* **Step 2:** Count the number of data points when the predicted class labels match the ground truth labels (called `true_labels` below).
* **Step 3:** Divide the total number of correct predictions by the total number of data points in the dataset.

Complete the function below to compute the classification accuracy:

In [287]:
test_data[test_data['sentiment'] == test_data['sentiment_predicted']].shape[0]

31063

In [288]:
accuracy = (test_data[test_data['sentiment'] == test_data['sentiment_predicted']].shape[0])/test_data.shape[0] 

In [289]:
accuracy

0.9318154547636189

**Quiz Question**: What is the accuracy of the **sentiment_model** on the **test_data**? Round your answer to 2 decimal places (e.g. 0.76)

**Quiz Question**: Does a higher accuracy value on the **training_data** always imply that the classifier is better?

## Learn another classifier with fewer words

There were a lot of words in the model we trained above. We will now train a simpler logistic regression model using only a subset of words that occur in the reviews. For this assignment, we selected a 20 words to work with. These are:

In [290]:
significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 
      'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed', 
      'work', 'product', 'money', 'would', 'return']

In [291]:
vectorizer_word_subset = CountVectorizer(vocabulary=significant_words) # limit to 20 words
train_matrix_word_subset = vectorizer_word_subset.fit_transform(train_data['review_clean'])
test_matrix_word_subset = vectorizer_word_subset.transform(test_data['review_clean'])

In [292]:
print((vectorizer_word_subset.get_feature_names()))

['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed', 'work', 'product', 'money', 'would', 'return']


In [296]:
train_data['review_clean'][1]

'it came early and was not disappointed i love planet wise bags and now my wipe holder it keps my osocozy wipes moist and does not leak highly recommend it'

In [298]:
train_matrix_word_subset.toarray()[0]

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0])

## Train a logistic regression model on a subset of data
We will now build a classifier with **word_count_subset** as the feature and **sentiment** as the target. 

In [299]:
## Instansiate the model
simple_model = LogisticRegression()
simple_model.fit(train_matrix_word_subset, train_data['sentiment'])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [300]:
simple_model_weights = simple_model.coef_

In [301]:
simple_model_weights

array([[ 1.36369679,  0.94395038,  1.19221941,  0.08542375,  0.52017372,
         1.51026262,  1.67326913,  0.50375976,  0.19093732,  0.05881344,
        -1.65214402, -0.20934844, -0.51145646, -2.03448908, -2.34847753,
        -0.62130739, -0.32049066, -0.89806176, -0.36215714, -2.10981455]])

In [361]:
df_simple_model_weights = pd.DataFrame(simple_model_weights.flatten(), significant_words, columns= ['simple_model_coefficients'] )

**Quiz Question**: Consider the coefficients of **simple_model**. There should be 21 of them, an intercept term + one for each word in **significant_words**. How many of the 20 coefficients (corresponding to the 20 **significant_words** and *excluding the intercept term*) are positive for the `simple_model`?

In [362]:
df_simple_model_weights = df_simple_model_weights.reset_index()
df_simple_model_weights = df_simple_model_weights.rename(columns={'index':'features'})

In [363]:
df_simple_model_weights = df_simple_model_weights.sort_values('features')

In [1]:
df_simple_model_weights.head()

NameError: name 'df_simple_model_weights' is not defined

**Quiz Question**: Are the positive words in the **simple_model** (let us call them `positive_significant_words`) also positive words in the **sentiment_model**?

In [365]:
df_sentiment_model_weights = pd.DataFrame(weights.flatten(),vectorizer.get_feature_names() 
                                          , columns= ['sentiment_model_coefficients'] )

In [366]:
df_sentiment_model_weights.shape

(121712, 1)

In [367]:
df_sentiment_model_weights.head()

Unnamed: 0,sentiment_model_coefficients
0,-0.645686
0,0.001114
0,0.008857
1,0.013365
1,0.001572


In [368]:
df_sentiment_model_weights.columns

Index(['sentiment_model_coefficients'], dtype='object')

In [369]:
df_sentiment_model_weights = df_sentiment_model_weights.reset_index()

In [370]:
df_sentiment_model_weights = df_sentiment_model_weights.rename(columns={'index':'features'})

### Note

** to form the data frame using the names of the columns **

**simple_model_coef_table = pd.DataFrame({'word':significant_words, 
'coefficient':simple_model.coef_.flatten()})** 

In [371]:
df_sentiment_model_weights.head()

Unnamed: 0,features,sentiment_model_coefficients
0,0,-0.645686
1,0,0.001114
2,0,0.008857
3,1,0.013365
4,1,0.001572


In [372]:
df_sentiment_model_weights_subset = df_sentiment_model_weights[df_sentiment_model_weights['features'].isin (significant_words)]

In [374]:
df_sentiment_model_weights_subset.head()

Unnamed: 0,features,sentiment_model_coefficients
7386,able,0.323124
20190,broke,-1.173417
22122,car,0.108676
34453,disappointed,-2.202657
37640,easy,1.449064


In [376]:

df_combined_weights = pd.merge(left=df_sentiment_model_weights_subset, right=df_simple_model_weights, 
                               on = 'features')

In [377]:
df_combined_weights

Unnamed: 0,features,sentiment_model_coefficients,simple_model_coefficients
0,able,0.323124,0.190937
1,broke,-1.173417,-1.652144
2,car,0.108676,0.058813
3,disappointed,-2.202657,-2.348478
4,easy,1.449064,1.192219
5,even,-0.361602,-0.511456
6,great,1.205939,0.94395
7,less,-0.314856,-0.209348
8,little,0.484516,0.520174
9,love,1.651704,1.363697


# Comparing models

We will now compare the accuracy of the **sentiment_model** and the **simple_model** using the `get_classification_accuracy` method you implemented above.

First, compute the classification accuracy of the **sentiment_model** on the **train_data**:

In [378]:
train_scores = sentiment_model.decision_function(train_matrix)
predictions_train = sentiment_model.predict(train_matrix)
train_data['sentiment_predicted'] = predictions_train
predicted_proba_train = calc_proba(train_scores)
train_data['predicted_proba'] = predicted_proba_train

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [379]:
train_data.head()

Unnamed: 0,name,review,rating,review_clean,sentiment,sentiment_predicted,predicted_proba
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,it came early and was not disappointed i love ...,1,1,0.812801
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,Very soft and comfortable and warmer than it l...,1,1,0.976625
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5,This is a product well worth the purchase I h...,1,1,0.999871
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5,All of my kids have cried nonstop when I tried...,1,1,0.995399
5,Stop Pacifier Sucking without tears with Thumb...,"When the Binky Fairy came to our house, we did...",5,When the Binky Fairy came to our house we didn...,1,1,0.999982


In [382]:
train_accuracy = (train_data[train_data['sentiment'] == train_data['sentiment_predicted']].shape[0])/train_data.shape[0] 

In [383]:
train_accuracy

0.9479597649457336

In [385]:

simple_predictions_train = simple_model.predict(train_matrix_word_subset)
train_data['simple_model_predicted'] = simple_predictions_train


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [388]:
train_data.head()

Unnamed: 0,name,review,rating,review_clean,sentiment,sentiment_predicted,predicted_proba,simple_model_predicted
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,it came early and was not disappointed i love ...,1,1,0.812801,1
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,Very soft and comfortable and warmer than it l...,1,1,0.976625,1
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5,This is a product well worth the purchase I h...,1,1,0.999871,1
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5,All of my kids have cried nonstop when I tried...,1,1,0.995399,1
5,Stop Pacifier Sucking without tears with Thumb...,"When the Binky Fairy came to our house, we did...",5,When the Binky Fairy came to our house we didn...,1,1,0.999982,1


In [389]:
simple_train_accuracy = (train_data[train_data['sentiment'] == train_data['simple_model_predicted']].
                  shape[0])/train_data.shape[0] 

In [390]:
simple_train_accuracy

0.8668225700065959

**Quiz Question**: Which model (**sentiment_model** or **simple_model**) has higher accuracy on the TRAINING set?

20. Now, we will repeat this exercise on the test_data. Start by computing the classification accuracy of the sentiment_model on the test_data.

Next, compute the classification accuracy of the simple_model on the test_data.

**Quiz Question: Which model (sentiment_model or simple_model) has higher accuracy on the TEST set?**

In [391]:
test_accuracy = (test_data[test_data['sentiment'] == test_data['sentiment_predicted']].shape[0])/test_data.shape[0] 

In [392]:
test_accuracy

0.9318154547636189

In [393]:

simple_predictions_test = simple_model.predict(test_matrix_word_subset)
test_data['simple_model_predicted'] = simple_predictions_test

In [394]:
test_data.head()

Unnamed: 0,name,review,rating,review_clean,sentiment,sentiment_predicted,predicted_proba,simple_model_predicted
100166,"Infantino Wrap and Tie Baby Carrier, Black Blu...",I bought this carrier when my daughter was abo...,5,I bought this carrier when my daughter was abo...,1,1,1.0,1
140816,"Diono RadianRXT Convertible Car Seat, Plum",I bought this seat for my tall (38in) and thin...,5,I bought this seat for my tall 38in and thin 2...,1,1,1.0,1
87017,Baby Einstein Around The World Discovery Center,I am so HAPPY I brought this item for my 7 mon...,5,I am so HAPPY I brought this item for my 7 mon...,1,1,1.0,1
133651,"Britax 2012 B-Agile Stroller, Red",[I got this stroller for my daughter prior to ...,4,I got this stroller for my daughter prior to t...,1,1,1.0,1
137034,Graco Pack 'n Play Element Playard - Flint,My husband and I assembled this Pack n' Play l...,4,My husband and I assembled this Pack n Play la...,1,1,1.0,1


In [395]:
simple_test_accuracy = (test_data[test_data['sentiment'] == test_data['simple_model_predicted']].
                  shape[0])/test_data.shape[0] 

In [396]:
simple_test_accuracy

0.8693604511639069

## Baseline: Majority class prediction

It is quite common to use the **majority class classifier** as the a baseline (or reference) model for comparison with your classifier model. The majority classifier model predicts the majority class for all data points. At the very least, you should healthily beat the majority class classifier, otherwise, the model is (usually) pointless.

What is the majority class in the **train_data**?

In [398]:
positive_label = len(test_data[test_data['sentiment']>0])
negative_label = len(test_data[test_data['sentiment']<0])
print ("positive_label is {}, negative_label is {}".format(positive_label, negative_label))

positive_label is 28095, negative_label is 5241


In [399]:
baseline_accuracy = positive_label*1./(positive_label+negative_label)
print ("baseline_accuracy is {}".format(baseline_accuracy))

baseline_accuracy is 0.8427825773938085
