In [2]:
import sframe
products = sframe.SFrame('amazon_baby.gl/')

###Removing punctuation

In [3]:
#for simplicity remove all punctuation
def remove_punctuation(text):
    import string
    return text.translate(None, string.punctuation) 

products['review_clean'] = products['review'].apply(remove_punctuation)


In [4]:
#Ignore all reviews with rating = 3, since they tend to have a neutral sentiment.
products = products[products['rating'] != 3]

In [5]:
'''Rating 4-5 will be positive, rating <= 2 negarive. 
Create a class label with value +1 for positive and -1 for negative'''
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)

In [10]:
#sanity check to look at neg reviews
products[products['sentiment'] != 1]

name,review,rating,review_clean,sentiment
Nature's Lullabies Second Year Sticker Calendar ...,I only purchased a second-year calendar for ...,2.0,I only purchased a secondyear calendar for ...,-1
"SoftPlay Giggle Jiggle Funbook, Happy Bear ...",This bear is absolutely adorable and I would ...,2.0,This bear is absolutely adorable and I would ...,-1
"SoftPlay Cloth Book, Love",This book is boring. Nothing to stimulate my ...,1.0,This book is boring Nothing to stimulate my ...,-1
Hunnt&reg; Falling Flowers and Birds Kids ...,The reason:Small sizeHard to apply on the wall ...,1.0,The reasonSmall sizeHard to apply on the wall ...,-1
Wall Decor Removable Decal Sticker - Colorful ...,Would not purchase again or recommend. The decals ...,2.0,Would not purchase again or recommend The decals ...,-1
Cloth Diaper Pins Stainless Steel ...,These were good quality --worked fine--heavy ...,2.0,These were good qualityworked fineheavy ...,-1
Cloth Diaper Pins Stainless Steel ...,"While the diaper pins are attractive, the metal in ...",2.0,While the diaper pins are attractive the metal in ...,-1
Cloth Diaper Pins Stainless Steel ...,"The steel part is not strong at all, unlike ...",1.0,The steel part is not strong at all unlike the ...,-1
Cloth Diaper Pins Stainless Steel ...,I really thought I was getting a dozen ...,2.0,I really thought I was getting a dozen pinst ...,-1
Super Mario Game Nintendo Wall Sticker and Decal ...,These do not stick to the wall. They start to peel ...,1.0,These do not stick to the wall They start to peel ...,-1


###Splitting data into training and testing sets

In [11]:
train_data, test_data = products.random_split(.8, seed=1)

We will now compute the word count for each word that appears in the reviews. A vector consisting of word counts is often referred to as bag-of-word features. Since most words occur in only a few reviews, word count vectors are sparse. For this reason, scikit-learn and many other tools use sparse matrices to store a collection of word count vectors. Refer to appropriate manuals to produce sparse word count vectors. General steps for extracting word count vectors are as follows:

Learn a vocabulary (set of all words) from the training data. Only the words that show up in the training data will be considered for feature extraction.
Compute the occurrences of the words in each review and collect them into a row vector.
Build a sparse matrix where each row is the word count vector for the corresponding review. Call this matrix train_matrix.
Using the same mapping between words and columns, convert the test data into a sparse matrix test_matrix.


In [12]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
# learn vocabulary from the training data and assign columns to words
# convert the training data into a sparse matrix
train_matrix = vectorizer.fit_transform(train_data['review_clean'])
# convert the test data into a sparse matrix, using the same word-column mapping
test_matrix = vectorizer.transform(test_data['review_clean'])

###Train a sentiment classifier with logistic regression

In [14]:
from sklearn.linear_model import LogisticRegression 

In [24]:
clf = LogisticRegression()

In [194]:
sentiment_model = clf.fit(train_matrix, train_data['sentiment'])

In [195]:
len(sentiment_model.coef_[0])

121712

In [32]:
#check how many of the weights are positive
sum(sentiment_model.coef_[0]>=0)

85752

###Making predictions with logistic regression

In [33]:
#pick 3 data points to make a prediction
sample_test_data = test_data[10:13]
print sample_test_data

+-------------------------------+-------------------------------+--------+
|              name             |             review            | rating |
+-------------------------------+-------------------------------+--------+
|   Our Baby Girl Memory Book   | Absolutely love it and all... |  5.0   |
| Wall Decor Removable Decal... | Would not purchase again o... |  2.0   |
| New Style Trailing Cherry ... | Was so excited to get this... |  1.0   |
+-------------------------------+-------------------------------+--------+
+-------------------------------+-----------+
|          review_clean         | sentiment |
+-------------------------------+-----------+
| Absolutely love it and all... |     1     |
| Would not purchase again o... |     -1    |
| Was so excited to get this... |     -1    |
+-------------------------------+-----------+
[3 rows x 5 columns]



In [34]:
#examine more thoroughly the sample data
sample_test_data[0]['review']

'Absolutely love it and all of the Scripture in it.  I purchased the Baby Boy version for my grandson when he was born and my daughter-in-law was thrilled to receive the same book again.'

In [35]:
sample_test_data[1]['review']

'Would not purchase again or recommend. The decals were thick almost plastic like and were coming off the wall as I was applying them! The would NOT stick! Literally stayed stuck for about 5 minutes then started peeling off.'

In [36]:
'''convert sample_test_data into the sparse matrix format and 
calculate the score of each data point in sample_test_data.'''
sample_test_matrix = vectorizer.transform(sample_test_data['review_clean'])
scores = sentiment_model.decision_function(sample_test_matrix)
print scores

[  5.59095054  -3.12647284 -10.42233483]


###Prediciting sentiment

In [48]:
#Using sample test data we predict that y^ is +1 if score is >0 and y^ is -1 if score is <=0 
predicted_sample_sent = []
for i in range(len(scores)):
    val = 1 if scores[i] >0 else -1
    predicted_sample_sent.append(val)

In [49]:
predicted_sample_sent

[1, -1, -1]

In [51]:
#comparing these values to predictive model:
predictions_sent_mod = sentiment_model.predict(test_matrix)
predictions_sent_mod[10:13]

array([ 1, -1, -1])

###Probability predictions

In [105]:
'''calculating the probabilities of positive prediction using sigmoid function'''
from math import exp, pow
def calculate_probs(scores):
    predicted_positive_prob = []
    for score in (scores):
        predicted_positive_prob.append(1/(1+exp(-score)))
    return predicted_positive_prob

In [106]:
print calculate_probs(scores)

[0.9962823929095075, 0.042028388468810046, 2.9759427222786967e-05]


In [88]:
scores

array([  5.59095054,  -3.12647284, -10.42233483])

Indeed, for the sentient categorized as negative, the probability of them to be positive is very close to 0

###Finding the most positive (and negative) review

In [78]:
'''Using the sentiment model to predict the sentiment of entire testing set. '''
scores_test = sentiment_model.decision_function(test_matrix)
scores_test

array([  1.27953357,  14.11629773,   2.65732344, ...,  12.09962799,
        12.85986045,   3.94599332])

In [114]:
test_data['probability'] = calculate_probs(scores_test)

In [233]:
'''Check which are the review with the highest probablity for a positive sentiment'''
test_data.sort('probability', ascending = False).print_rows(20)

+-------------------------------+-------------------------------+--------+
|              name             |             review            | rating |
+-------------------------------+-------------------------------+--------+
| Evenflo X Sport Plus Conve... | After seeing this in Paren... |  5.0   |
| Diono RadianRXT Convertibl... | I bought this seat for my ... |  5.0   |
| Roan Rocco Classic Pram St... | Great Pram Rocco!!!!!!I bo... |  5.0   |
| P'Kolino Silly Soft Seatin... | I've purchased both the P'... |  4.0   |
| Infantino Wrap and Tie Bab... | I bought this carrier when... |  5.0   |
| Baby Einstein Around The W... | I am so HAPPY I brought th... |  5.0   |
| Mamas &amp; Papas 2014 Urb... | After much research I purc... |  4.0   |
| Graco Pack 'n Play Element... | My husband and I assembled... |  4.0   |
| Buttons Cloth Diaper Cover... | We are big Best Bottoms fa... |  4.0   |
| Graco FastAction Fold Jogg... | Graco's FastAction Jogging... |  5.0   |
| Simple Wishes Hands-Fre

In [244]:
'''Check which are the review with the lowest probablity for a positive sentiment'''
test_data.sort('probability').print_rows(20)

+-------------------------------+-------------------------------+--------+
|              name             |             review            | rating |
+-------------------------------+-------------------------------+--------+
| Fisher-Price Ocean Wonders... | We have not had ANY luck w... |  2.0   |
| Levana Safe N'See Digital ... | This is the first review I... |  1.0   |
| Safety 1st Exchangeable Ti... | I thought it sounded great... |  1.0   |
| Adiri BPA Free Natural Nur... | I will try to write an obj... |  2.0   |
| VTech Communications Safe ... | This is my second video mo... |  1.0   |
| The First Years True Choic... | Note: we never installed b... |  1.0   |
| Safety 1st High-Def Digita... | We bought this baby monito... |  1.0   |
| Cloth Diaper Sprayer--styl... | I bought this sprayer out ... |  1.0   |
| Motorola Digital Video Bab... | DO NOT BUY THIS BABY MONIT... |  1.0   |
| Philips AVENT Newborn Star... | It's 3am in the morning an... |  1.0   |
| Cosco Alpha Omega Elite

In [252]:
import pandas as pd
tmp = test_data.to_dataframe()

In [258]:
tmp1 = tmp.sort_values(by = 'probability')[0:20]

In [260]:
tmp1.sort_values(by = 'name')

Unnamed: 0,name,review,rating,review_clean,sentiment,probability
8818,Adiri BPA Free Natural Nurser Ultimate Bottle ...,I will try to write an objective review of the...,2,I will try to write an objective review of the...,-1,1.544472e-13
31226,Belkin WeMo Wi-Fi Baby Monitor for Apple iPhon...,I read so many reviews saying the Belkin WiFi ...,2,I read so many reviews saying the Belkin WiFi ...,-1,6.545745e-10
7310,Chicco Cortina KeyFit 30 Travel System in Adve...,My wife and I have used this system in two car...,1,My wife and I have used this system in two car...,-1,6.832724e-10
14711,Cloth Diaper Sprayer--styles may vary,I bought this sprayer out of desperation durin...,1,I bought this sprayer out of desperation durin...,-1,4.134443e-11
1810,Cosco Alpha Omega Elite Convertible Car Seat,I bought this car seat after both seeing the ...,1,I bought this car seat after both seeing the ...,-1,4.505826e-10
10814,Ellaroo Mei Tai Baby Carrier - Hershey,This is basically an overpriced piece of fabri...,1,This is basically an overpriced piece of fabri...,-1,4.746746e-10
2931,Fisher-Price Ocean Wonders Aquarium Bouncer,We have not had ANY luck with Fisher-Price pro...,2,We have not had ANY luck with FisherPrice prod...,-1,8.806168e-16
21700,Levana Safe N'See Digital Video Baby Monitor w...,This is the first review I have ever written o...,1,This is the first review I have ever written o...,-1,1.848758e-15
20594,Motorola Digital Video Baby Monitor with Room ...,DO NOT BUY THIS BABY MONITOR!I purchased this ...,1,DO NOT BUY THIS BABY MONITORI purchased this m...,-1,9.448434e-11
27231,NUK Cook-n-Blend Baby Food Maker,It thought this would be great. I did a lot of...,1,It thought this would be great I did a lot of ...,-1,8.072012e-10


####Computing accuracy of the classifier

In [124]:
test_predictions = sentiment_model.predict(test_matrix)

In [138]:
accuracy = sum(test_predictions == test_data['sentiment'])/ float(len(test_data))
print round(accuracy, 2)

0.93


####Learning another classifier with fewer words

In [139]:
"""Training a simpler logistic regression model using only a subset of words that occur in the reviews."""
significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 
                     'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed', 
                     'work', 'product', 'money', 'would', 'return']

In [140]:
'''Compute a new set of word count vectors using only these words'''
vectorizer_word_subset = CountVectorizer(vocabulary=significant_words) # limit to 20 words
train_matrix_word_subset = vectorizer_word_subset.fit_transform(train_data['review_clean'])
test_matrix_word_subset = vectorizer_word_subset.transform(test_data['review_clean'])

####Training a logistic regression model on a subset of data

In [209]:
clf1 = LogisticRegression()
simple_model = clf1.fit(train_matrix_word_subset, train_data['sentiment'])

In [210]:
#Inspecting the weights (coefficients) of the simple_model
simple_model_coef_table = sframe.SFrame({'word':significant_words,
                                         'coefficient':simple_model.coef_.flatten()})

In [211]:
simple_model_coef_table.sort('coefficient', ascending = False)

coefficient,word
1.67307389259,loves
1.50981247669,perfect
1.36368975931,love
1.19253827349,easy
0.943999590573,great
0.520185762718,little
0.503760457768,well
0.190908572065,able
0.0855127794633,old
0.058854671153,car


In [212]:
simple_model_coef_table.print_rows(20)

+-----------------+--------------+
|   coefficient   |     word     |
+-----------------+--------------+
|  1.36368975931  |     love     |
|  0.943999590573 |    great     |
|  1.19253827349  |     easy     |
| 0.0855127794633 |     old      |
|  0.520185762718 |    little    |
|  1.50981247669  |   perfect    |
|  1.67307389259  |    loves     |
|  0.503760457768 |     well     |
|  0.190908572065 |     able     |
|  0.058854671153 |     car      |
|  -1.65157634496 |    broke     |
| -0.209562864534 |     less     |
| -0.511379631799 |     even     |
|  -2.03369861394 |    waste     |
|  -2.3482982195  | disappointed |
| -0.621168773642 |     work     |
| -0.320556236734 |   product    |
| -0.898030737715 |    money     |
| -0.362166742274 |    would     |
|  -2.10933109032 |    return    |
+-----------------+--------------+
[20 rows x 2 columns]



In [213]:
sum(simple_model_coef_table['coefficient'] > 0)

10

In [214]:
simple_positive = list(simple_model_coef_table[simple_model_coef_table['coefficient']>0]['word'])

In [215]:
sentiment_model.coef_

array([[ -1.23707677e+00,   1.96133895e-04,   2.59841044e-02, ...,
          1.14844613e-02,   3.17099575e-03,  -6.98805068e-05]])

In [216]:
simple_model.coef_

array([[ 1.36368976,  0.94399959,  1.19253827,  0.08551278,  0.52018576,
         1.50981248,  1.67307389,  0.50376046,  0.19090857,  0.05885467,
        -1.65157634, -0.20956286, -0.51137963, -2.03369861, -2.34829822,
        -0.62116877, -0.32055624, -0.89803074, -0.36216674, -2.10933109]])

In [None]:
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
# learn vocabulary from the training data and assign columns to words
# convert the training data into a sparse matrix
train_matrix = vectorizer.fit_transform(train_data['review_clean'])
# convert the test data into a sparse matrix, using the same word-column mapping
test_matrix = vectorizer.transform(test_data['review_clean'])

In [196]:
sentiment_model_coef_table = sframe.SFrame({'word':vectorizer.vocabulary_,
                                         'coefficient':sentiment_model.coef_.flatten()})

In [237]:
sentiment_model_coef_table.sort('coefficient')

coefficient,word
-2.74568257921,valuable
-2.68181334275,finds
-2.62259886043,12202012
-2.61558898482,2022
-2.45151415383,silentgood
-2.4322338759,prove
-2.32632336344,happierour
-2.29951053064,achievements
-2.29388037711,spitty
-2.28615453328,thingbunnies


In [238]:
sentiment_positive = list(sentiment_model_coef_table[sentiment_model_coef_table['coefficient']>0]['word'])

In [239]:
s = set(sentiment_positive)
temp = [x for x in simple_positive if x not in s]
print temp

['little', 'perfect', 'able', 'car']


Above words have poistive coefficient in simple model, but negative in sentiment model

####Comparing models

Computing the classification accuracy of the sentiment_model and
simple_model on the train_data.

In [220]:
simple_prediction = simple_model.predict(train_matrix_word_subset)
sentiment_prediction = sentiment_model.predict(train_matrix)

In [222]:
simple_train_accuracy = sum(simple_prediction == train_data['sentiment'])/ float(len(train_data))
print round(simple_train_accuracy, 2)

0.87


In [223]:
sentiment_train_accuracy = sum(sentiment_prediction == train_data['sentiment'])/ float(len(train_data))
print round(sentiment_train_accuracy, 2)

0.97


Computing the classification accuracy of the sentiment_model and
simple_model on the test_data.

In [224]:
simple_prediction_test = simple_model.predict(test_matrix_word_subset)
sentiment_prediction_test = sentiment_model.predict(test_matrix)

In [225]:
simple_test_accuracy = sum(simple_prediction_test == test_data['sentiment'])/ float(len(test_data))
print round(simple_test_accuracy, 2)

0.87


In [226]:
sentiment_test_accuracy = sum(sentiment_prediction_test == test_data['sentiment'])/ float(len(test_data))
print round(sentiment_test_accuracy, 2)

0.93


####Majority class prediction

In [228]:
#find majority class:
sum(test_data['sentiment'] > 0)/float(len(test_data))

0.8427825773938085

Since the proportion of positive sentiment is .84, the majority class will be positive review.
To check the accuracy, we compare how many true posivites are in data set versus total number of data points, since the majority class will predict all data points to be positive.

In [229]:
majority_class_accuracy = round(sum(test_data['sentiment'] > 0)/float(len(test_data)),2)
print majority_class_accuracy

0.84
