# Sentiment analsyis of Amazon product reviews using Logistic Regression

In this assignment, we will use product review data from Amazon.com to predict whether the sentiments about a product (from its reviews) are positive or negative.

## Load amazon dataset

#### import pandas 

In [1]:
import pandas as pd 

#### read csv into a pandas Dataframe instance 

In [2]:
products = pd.read_csv('amazon_baby.csv')

#### Size of the data 

In [3]:
products.shape

(183531, 3)

There are 183531 reviews for this product

#### A look into the first 5 rows of the dataframe 

The first column is the title of the review, the second is the review itself and the third the rating the reviewer gave to the product.

In [4]:
products.head(5)

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


## A look into some of the reviews and ratings

In [5]:
print(products['review'][4],"\n")
print("rating: ",products['rating'][4])

All of my kids have cried non-stop when I tried to ween them off their pacifier, until I found Thumbuddy To Love's Binky Fairy Puppet.  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from it.This is a must buy book, and a great gift for expecting parents!!  You will save them soo many headaches.Thanks for this book!  You all rock!! 

rating:  5


In [6]:
print(products['review'][331],"\n")
print("rating: ", products['rating'][331])

Granted our 3-month old isn't producing really stinky diapers yet, but the Champ is much better than using a trash can.  My only issue is that its hard to flip the handle over using one hand - kind of a must when holding on to a squirmy infant. 

rating:  4


In [7]:
print(products['review'][4125],"\n")
print("rating: ",products['rating'][4125])

The gate did not fit into the entrance of our staircase and although the website of the manufacturer claims that there are extensions, none could be found on the site.Even though I could not complete the installation, the product itself did not inspire much confidence that it was going to withstand what we demanded of it, namely keeping our toddler daughter from falling down the stairs.In addition, the pieces used to put together the gate were of poor quality and tolerance.  One only wishes that the manufacturers of such products understood that when dealing with safety, short-cuts in quality should never be tolerated. 

rating:  2


In [8]:
print(products['review'][8340],"\n")
print("rating: ",products['rating'][8340])

It is PERFECT for infants or toddlers!  As a new first time mother I carried everything in the house with us when she was a baby!  As a toddler we still have a pretty full bag!  This is the ONLY bag we have found that everything fits in and the bag itself it not huge!The side pockets were great for bottles, sippys or a drink for mom.  The front pouch is perfect to carry clippers, medicine and other small stuff.  The inside pockets are perfect jars of food or for a bottle/sippy for that transition period.  Plus you'll still have room for your wallet, a couple of outfits, toys and anything else you might need.It's also a pretty good color.  It doesn't really show alot of dirty and it washes nicely.  Overall, we love the bag and wish they had it available in more colors! 

rating:  5


#### What is the range of possible review scores?

In [9]:
products.rating.unique()

array([3, 5, 4, 2, 1])

## Perform text cleaning

We start by removing punctuation, so that words "cake." and "cake!" are counted as the same word. The cleaned review will be added into a new column called "review_clean". Let's write a function that removes all puncutations from the text.

In [10]:
products = products.fillna({'review':""}) #Remove NA rows with an empty string

#Write a function remove_punctuation that strips punctuation from a line of text
def remove_punctuation(text):
    import string #package that contains list of punctuations
    translator = str.maketrans({key: None for key in string.punctuation}) #repace all punctuations with an 
                                                                          #empty string
    return text.translate(translator).lower() #transforms all the words into lower-case

#### Now let's apply the function we just created to every row in the column 'review' and assign the new text to the colum 'review clean' 

In [11]:
products['review_clean']=products['review'].apply(remove_punctuation)

#### This is how the dataframe looks with the new cleaned column 

In [12]:
products.head()

Unnamed: 0,name,review,rating,review_clean
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3,these flannel wipes are ok but in my opinion n...
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,it came early and was not disappointed i love ...
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,very soft and comfortable and warmer than it l...
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5,this is a product well worth the purchase i h...
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5,all of my kids have cried nonstop when i tried...


## Extract Sentiments

Lets start by removing the reviews with a rating=3, since they tend to have a neutral sentiment. Then, let's assign reviews with a rating of 4 or higher to be __positive__ while the ones with a rating 2 or lower to be __negative__ .

In [13]:
products = products[products['rating']!=3] # exludes rows in which ratings are equal to 3
products['sentiment'] = products['rating'].apply(lambda rating: +1 if rating>3 else -1) # creates an anonimous 
                                                                                        # function that assigns +1
                                                                                        # to ratings higher than 3
                                                                                        # and -1 to ratings lower
                                                                                        # than 3. The results are 
                                                                                        # stored in a new column
                                                                                        # named "sentiment"

## Split into training and test sets

In [14]:
from urllib.request import urlopen

with urlopen("https://s3.amazonaws.com/static.dato.com/files/coursera/course-3/"
                   "indices-json/module-2-assignment-train-idx.json") as f:
    train_idx = f.read().decode('ascii')
    train_idx = train_idx.split(", ")
    
    
with urlopen("https://s3.amazonaws.com/static.dato.com/files/coursera/course-3/"
                   "indices-json/module-2-assignment-test-idx.json") as f:
    test_idx = f.read().decode('ascii')
    test_idx = test_idx.split(", ")

In [15]:
train_idx[0]=train_idx[0].strip("[")
train_idx[-1]=train_idx[-1].strip("]")
test_idx[0]=test_idx[0].strip("[")
test_idx[-1]=test_idx[1].strip("]")

In [16]:
train_idx = [int(i) for i in train_idx]
test_idx = [int(i) for i in test_idx]

In [261]:
train_data, test_data = products.iloc[train_idx],products.iloc[test_idx]

In [262]:
train_data[0:4]

Unnamed: 0,name,review,rating,review_clean,sentiment
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,it came early and was not disappointed i love ...,1
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,very soft and comfortable and warmer than it l...,1
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5,this is a product well worth the purchase i h...,1
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5,all of my kids have cried nonstop when i tried...,1


## Build the word count vector for each review

### Learn a vocabulary (set of all words) from the training data. Create bag of words

In learing from text one of the fundamental problems we have is that the lenght of each review, or any text,is not standarized. So you can't use each individual words as an input feature because long reviews would require a different input space than short reviews. This is where a cool trick called Bag of Words comes into rescue. The basic idea is to get a review and count the frequency of words in it and to create a dictonary of the words we are interested in mapping. Then we can match each review into a frequency count defined over the word dictionary. So in the end we have a vector (the dictionary) of size $n$ that can be applied to different reviews. This way all the input features have the same dimension. 

SckikitLearn has a class that let's us a create a Bag of Words, called the CountVectorizer.

In [19]:
from sklearn.feature_extraction.text import CountVectorizer

# Use this token pattern to keep single-letter words
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')  

# First, learn vocabulary from the training data and assign columns to words
# then convert the training data into a sparse matrix
train_matrix = vectorizer.fit_transform(train_data['review_clean']) #the input parameter is a list of texts

# Second, convert the test data into a sparse matrix, using the same word-column mapping
test_matrix = vectorizer.transform(test_data['review_clean'])

In [20]:
from sklearn.externals import joblib
joblib.dump(train_matrix.tocsr(), 'dataset.joblib')

['dataset.joblib',
 'dataset.joblib_01.npy',
 'dataset.joblib_02.npy',
 'dataset.joblib_03.npy']

In [135]:

train_matrix.shape

(133416, 121712)

Now we have a feature matrix with 133416 features. In other words, we are going to train our algorithm with 133416 different words and each word will receive a weight. 

In [140]:
print(train_matrix.data)

[3 1 1 ..., 2 1 2]


## Train a sentiment classifier with logistic regression

#### Import Logistic Regression class from ScikitLearn

In [22]:
from sklearn.linear_model import LogisticRegression

####  Create instance of the LogisticRegression class and assign it to the variable "sentiment_model"

In [23]:
sentiment_model = LogisticRegression()

#### Fit the model with the features (train_matrix) and the target variables (train_data['sentiment'])

In [24]:
sentiment_model.fit(train_matrix,train_data['sentiment'])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

### how many weights are bigger than zero? 

In [67]:
def count_coeff():
    counter_bigger = 0
    counter_smaller = 0
    for coef in sentiment_model.coef_[0]:
        if coef>=0:
            counter_bigger +=1
        else:
            counter_smaller += 1
    print("how many >= 0;", "how many < 0")
    return counter_bigger,counter_smaller

In [129]:
sentiment_model.coef_ # some of the weights that where trained after the fit method was called
                      # each weight corresponds to a single word

array([[ -1.23889324e+00,   1.59863291e-04,   2.63828080e-02, ...,
          1.17685365e-02,   3.10346626e-03,  -6.36644403e-05]])

In [69]:
count_coeff()

how many >= 0; how many < 0


(85911, 35801)

85911 words have a positive weight. These are words that are associated with positive reviews.

## Making predictions with logistic regression

Now that a model is trained, we can make predictions on the test data. In this section, we will explore this in the context of 3 data points in the test data. Take the 11th, 12th, and 13th data points in the test data and save them to sample_test_data. The following cell extracts the three data points from the SFrame test_data and print their content:

In [79]:
sample_test_data = test_data[10:13]
sample_test_data

Unnamed: 0,name,review,rating,review_clean,sentiment
59,Our Baby Girl Memory Book,Absolutely love it and all of the Scripture in...,5,absolutely love it and all of the scripture in...,1
71,Wall Decor Removable Decal Sticker - Colorful ...,Would not purchase again or recommend. The dec...,2,would not purchase again or recommend the deca...,-1
91,New Style Trailing Cherry Blossom Tree Decal R...,Was so excited to get this product for my baby...,1,was so excited to get this product for my baby...,-1


In [105]:
print(sample_test_data[:1]['review'].values)

[ 'Absolutely love it and all of the Scripture in it.  I purchased the Baby Boy version for my grandson when he was born and my daughter-in-law was thrilled to receive the same book again.']


That review seems pretty positive. Now, let's see what the next row of the sample_test_data looks like. As we could guess from the rating (-1), the review is quite negative.

In [106]:
print(sample_test_data[1:2]['review'].values)

[ 'Would not purchase again or recommend. The decals were thick almost plastic like and were coming off the wall as I was applying them! The would NOT stick! Literally stayed stuck for about 5 minutes then started peeling off.']


### We will now make a class prediction for the sample_test_data. The sentiment_model should predict +1 if the sentiment is positive and -1 if the sentiment is negative. Each review $i$ will receive a score based on the function $score_i= weights \times features_i$

In [110]:
# convert sample_test_data into the sparse matrix format first
sample_test_matrix = vectorizer.transform(sample_test_data['review_clean']) # converts into numpy matrix

In [120]:
# this is how test_data looks before conversion
sample_test_data['review_clean'] # a pandas Series

59    absolutely love it and all of the scripture in...
71    would not purchase again or recommend the deca...
91    was so excited to get this product for my baby...
Name: review_clean, dtype: object

In [125]:
# and after conversion
sample_test_matrix.data # a numpy array

array([1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 3, 1,
       1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 5, 1, 2, 3, 2, 2, 1, 2, 1, 1, 1])

In [163]:
print(sample_test_matrix.shape) # 3 reviews and 121712 words of bag of words each.

(3, 121712)


In [142]:
scores = sentiment_model.decision_function(sample_test_matrix) # calculates the score of each data point

In [144]:
print(scores)

[  5.60798627  -3.1429946  -10.44043584]


If the score is > 1 the prediciton will be of a positive review and if it is < 1, of a negative review.

### Using scores, write code to calculate predicted labels for sample_test_data.


In [152]:
def predict_labels():
    predictions = []
    for score in scores:
        if score > 0:
            predictions.append(1)
        else:
            predictions.append(-1)
    return predictions

In [290]:
sample_predictions = predict_labels()

In [291]:
print(sample_predictions)

[1, -1, -1]


## Probability Predictions

We can calculate the probability predictions from the scores using: $$P(y_i = +1| x_i,w) = \frac{1}{(1+exp(-w_th(x_i))}$$ 


#### Using the scores calculated previously, write code to calculate the probability that a sentiment is positive using the above formula. For each row, the probabilities should be a number in the range [0, 1]. 

In [169]:
from math import exp 

def get_probability(scores):
    """this function will transform each score into a number 
    between 0 and 1. It is know as the sigmoid"""    
    
    probabilities = []
    
    for score in scores:
        probability = 1/(1 + exp(score)) 
        probabilities.append(probability)
        
    return probabilities

In [178]:
sample_probabilities = get_probability(scores)
print(sample_probabilities)

[0.003655040847281396, 0.958631800317721, 0.9999707743866006]


The first review has a 0,3% probability of being postive 
The second review has a 95,86% probability of being postive 
The third review has a 99,99% probability of being postive 

Of the three data points in sample_test_data, which one (first, second, or third) has the lowest probability of being classified as a positive review?

first one

##  Find the most positive (and negative) review

We now turn to examining the full test dataset, test_data, and use sklearn.linear_model.LogisticRegression to form predictions on all of the test data points

Using the sentiment_model, find the 20 reviews in the entire test_data with the highest probability of being classified as a positive review. We refer to these as the "most positive reviews."

In [274]:
probabilities = sentiment_model.predict_proba(test_matrix)[:,0]

 Which of the following products are represented in the 20 most positive reviews?

In [278]:
test_data = test_data.copy()
test_data.loc[:,'probabilities (%)'] = probabilities[::-1]*100

In [279]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)
test_data

Unnamed: 0,name,review,rating,review_clean,sentiment,probabilities (%)
9,"Baby Tracker&reg; - Daily Childcare Journal, S...",This has been an easy way for my nanny to reco...,4,this has been an easy way for my nanny to reco...,1,0.000
10,"Baby Tracker&reg; - Daily Childcare Journal, S...",I love this journal and our nanny uses it ever...,4,i love this journal and our nanny uses it ever...,1,0.000
16,Nature's Lullabies First Year Sticker Calendar,"I love this little calender, you can keep trac...",5,i love this little calender you can keep track...,1,0.001
20,Nature's Lullabies Second Year Sticker Calendar,I had a hard time finding a second year calend...,5,i had a hard time finding a second year calend...,1,0.090
28,"Lamaze Peekaboo, I Love You","One of baby's first and favorite books, and it...",4,one of babys first and favorite books and it i...,1,2.473
36,"Lamaze Peekaboo, I Love You",My son loved this book as an infant. It was p...,5,my son loved this book as an infant it was pe...,1,0.212
37,"Lamaze Peekaboo, I Love You",Our baby loves this book & has loved it for a ...,5,our baby loves this book has loved it for a w...,1,0.935
41,"SoftPlay Giggle Jiggle Funbook, Happy Bear",This bear is absolutely adorable and I would g...,2,this bear is absolutely adorable and i would g...,-1,0.006
43,SoftPlay Peek-A-Boo Where's Elmo A Children's ...,I bought two for recent baby showers! The boo...,5,i bought two for recent baby showers the book...,1,0.031
56,Baby's First Year Undated Wall Calendar with S...,I searched high and low for a first year calen...,5,i searched high and low for a first year calen...,1,0.002


###  Which of the following products are represented in the 20 most positive reviews?

In [286]:
most_positive = test_data.sort_values(['probabilities (%)'],ascending=False)
print(most_positive[0:20]['name'])

167384         Baby Jogger Summit X3 Double Stroller, Black
64332                     Stokke Tripp Trapp Highchair, Red
107322                        Maclaren Volo Stroller, Black
134858    Infant Optics DXR-5 2.4 GHz Digital Video Baby...
27892                Medela Ice Pack for Breastmilk Storage
90039                      Kidco Anti-Tip TV Strap - 2 Pack
130677               Summer Infant Deluxe Piddle Pad, Black
103052                             Toy Story Twin Sheet Set
70351     American Baby Company 100% Organic Cotton Inte...
172995    Gerber 2 Pack Cotton Knit Fitted Crib Sheets P...
173848                         Yo Gabba Gabba Brobee Pillow
124516             Munchkin Shampoo Rinser, Colors May Vary
108536                                Bugaboo Transport Bag
11489            Fisher-Price Booster Seat, Blue/Green/Gray
143401               RSVP Endurace Spring Whisk, 9-1/4-Inch
33215     Lamaze Play &amp; Grow Jacques the Peacock Tak...
28259     DaVinci Emily 4-in-1 Convertib

Now, let us repeat this exercise to find the "most negative reviews."

### Which of the following products are represented in the 20 most negative reviews?

In [288]:
most_negative = test_data.sort_values(['probabilities (%)'],ascending=True)
print(most_negative[0:20]['name'])

104258                Elegant Baby Cross with Diamond, Gold
65390     The First Years Compass B540 Booster Seat, Abs...
2957                            BABYBJORN Potty Chair - Red
49952         Medela 5 oz Breastmilk Bottle Set (3 Bottles)
87337     Sunshine Kids Radian XTSL Convertible Car Seat...
84138     Thermos Funtainer Straw Bottle, Dora The Explo...
14783          North States Supergate Expandable Swing Gate
42881                 Wubbanub Infant Pacifier - Pink Horse
131125    Fisher-Price Discover 'n Grow Storybook Projec...
69539                                Jeep Stroller Mesh Bag
133302    Child Craft London Euro Style Stationary Crib,...
15289     Graco Baby Einstein Discover and Play Entertainer
118560                  Sorelle Lynn Changing Table, Merlot
46636     Fisher-Price Rainforest Color Changing Sun Shades
97590     Bright Starts Ingenuity Automatic Bouncer, Bel...
35459     Cloud b Sleep Sheep On The Go Travel Sound Mac...
17851            Prince Lionheart Multi-

### Compute accuracy of the classifier



In [295]:
predictions = sentiment_model.predict(test_matrix)
print(predictions.shape)

(33336,)


In [296]:
test_data.loc[:,'predictions'] = predictions

In [301]:
test_data.head()

Unnamed: 0,name,review,rating,review_clean,sentiment,probabilities (%),predictions
9,"Baby Tracker&reg; - Daily Childcare Journal, S...",This has been an easy way for my nanny to reco...,4,this has been an easy way for my nanny to reco...,1,0.0,1
10,"Baby Tracker&reg; - Daily Childcare Journal, S...",I love this journal and our nanny uses it ever...,4,i love this journal and our nanny uses it ever...,1,0.0,1
16,Nature's Lullabies First Year Sticker Calendar,"I love this little calender, you can keep trac...",5,i love this little calender you can keep track...,1,0.001,1
20,Nature's Lullabies Second Year Sticker Calendar,I had a hard time finding a second year calend...,5,i had a hard time finding a second year calend...,1,0.09,1
28,"Lamaze Peekaboo, I Love You","One of baby's first and favorite books, and it...",4,one of babys first and favorite books and it i...,1,2.473,1


### Count the number of data points when the predicted class labels match the ground truth labels.


In [306]:
def accuracy(data):
    """caculate the accuracy of the sentiment's predicitons"""
    
    correctly_classified = 0
    total_examples = data.shape[0]
    
    for sentiment,prediction in zip(data.sentiment,data.predictions):
        if prediction == sentiment:
            correctly_classified += 1
        else:
            continue
       
    return correctly_classified/total_examples

In [310]:
print(round(accuracy(test_data),2))

0.93


### Does a higher accuracy value on the training_data always imply that the classifier is better?

It doesn't!!! The classifier may be overfitted. This is to say that it captures too much random error into it. In anything we try to model there will always be a part that is generated due to random noise. We don't want to capture this because these errors are unpredictible by definition. In other words, we might get a model that is excelent at predidicting the data we trained it weith but bad if it tries to predict new data.

### Learn another classifier with fewer words

There were a lot of words in the model we trained above. We will now train a simpler logistic regression model using only a subet of words that occur in the reviews. For this assignment, we selected 20 words to work with. These are:

In [339]:
significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 
      'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed', 
      'work', 'product', 'money', 'would', 'return']

Compute a new set of word count vectors using only these words. The CountVectorizer class has a parameter that lets you limit the choice of words when building word count vectors:

In [312]:
vectorizer_word_subset = CountVectorizer(vocabulary=significant_words) # limit to 20 words
train_matrix_word_subset = vectorizer_word_subset.fit_transform(train_data['review_clean'])
test_matrix_word_subset = vectorizer_word_subset.transform(test_data['review_clean'])

### Train a logistic regression model on a subset of data

In [313]:
simple_model = LogisticRegression()

In [314]:
simple_model.fit(train_matrix_word_subset,train_data.sentiment)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

#### Let us inspect the weights (coefficients) of the simple_model and build a table to store (word, coefficient) pair

In [329]:
print(train_matrix_word_subset[0:8]) # 133416 reviews, 20 words

  (0, 0)	1
  (0, 14)	1
  (2, 0)	2
  (2, 4)	1
  (2, 6)	1
  (2, 7)	1
  (2, 16)	2
  (3, 1)	1
  (3, 2)	1
  (3, 6)	1
  (3, 15)	1
  (4, 1)	1
  (4, 16)	1
  (4, 18)	1
  (5, 8)	1
  (6, 5)	1
  (6, 8)	1
  (6, 18)	1
  (7, 16)	1


"(2, 16) 2" should read: the word with the index 16 appears 2 times in the review number 2. 

In [369]:
simple_model_coef_table = pd.DataFrame()
simple_model_coef_table['word'] = significant_words
simple_model_coef_table['coefficient'] = simple_model.coef_[0]
simple_model_coef_table.sort_values(by='coefficient',ascending=False)

Unnamed: 0,word,coefficient
6,loves,1.673
5,perfect,1.51
0,love,1.364
2,easy,1.193
1,great,0.944
4,little,0.52
7,well,0.504
8,able,0.191
3,old,0.086
9,car,0.059
