### Today we are going to perform the simple classification of the amazon reviews' sentiment.

### Please, download the dataset amazon_baby.csv.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.linear_model import LogisticRegression

baby_df = pd.read_csv('amazon_baby.csv')
baby_df.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


In [2]:
print(baby_df['review'][4])

All of my kids have cried non-stop when I tried to ween them off their pacifier, until I found Thumbuddy To Love's Binky Fairy Puppet.  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from it.This is a must buy book, and a great gift for expecting parents!!  You will save them soo many headaches.Thanks for this book!  You all rock!!


## Exercise 1 (data preparation)
a) Remove punctuation from reviews using the given function.   
b) Replace all missing (nan) revies with empty "" string.  
c) Drop all the entries with rating = 3, as they have neutral sentiment.   
d) Set all positive $\geq4$ ratings to 1 and negative $\leq2$ to -1.

In [3]:
# b) replace all nan's with empty strings
baby_df['review'] = baby_df['review'].fillna('')

# a) remove punctuaction
trans = str.maketrans('','',string.punctuation)

# using separator that doesn't exists in out data
sep = chr(0)

# much faster than apply method
baby_df['review'] = sep.join(baby_df['review'].tolist()).translate(trans).split(sep)



print(baby_df['review'][4])

All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock


In [4]:
# c)
baby_df = baby_df[baby_df['rating']!=3]
baby_df.head()

Unnamed: 0,name,review,rating
1,Planetwise Wipe Pouch,it came early and was not disappointed i love ...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase I h...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried nonstop when I tried...,5
5,Stop Pacifier Sucking without tears with Thumb...,When the Binky Fairy came to our house we didn...,5


In [5]:
# d)
#baby_df.loc[ baby_df['rating'] < 0, 'rating' ] = -1
#baby_df.loc[ baby_df['rating'] > 0, 'rating' ] = 1
# assuming negative are < 3

baby_df.loc[ baby_df['rating'] < 3, 'rating' ] = -1
baby_df.loc[ baby_df['rating'] > 3, 'rating' ] = 1
baby_df.describe()


Unnamed: 0,rating
count,166752.0
mean,0.682247
std,0.731124
min,-1.0
25%,1.0
50%,1.0
75%,1.0
max,1.0


## CountVectorizer
In order to analyze strings, we need to assign them numerical values. We will use one of the simplest string representation, which transforms strings into the $n$ dimensional vectors. The number of dimensions will be the size of our dictionary, and then the values of the vector will represent the number of appereances of the given word in the sentence.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
reviews_train_example = ["We like apples",
                   "We hate oranges",
                   "I adore bananas",
                   "We like like apples and oranges",
                   "They dislike bananas"]

X_train_example = vectorizer.fit_transform(reviews_train_example)

print(vectorizer.get_feature_names())
print(X_train_example.todense())



['adore', 'and', 'apples', 'bananas', 'dislike', 'hate', 'like', 'oranges', 'they', 'we']
[[0 0 1 0 0 0 1 0 0 1]
 [0 0 0 0 0 1 0 1 0 1]
 [1 0 0 1 0 0 0 0 0 0]
 [0 1 1 0 0 0 2 1 0 1]
 [0 0 0 1 1 0 0 0 1 0]]




In [7]:
reviews_test_example = ["They like bananas",
                   "We hate oranges bananas and apples",
                   "We love bananas"] #New word!

X_test_example = vectorizer.transform(reviews_test_example)

print(X_test_example.todense())

[[0 0 0 1 0 0 1 0 1 0]
 [0 1 1 1 0 1 0 1 0 1]
 [0 0 0 1 0 0 0 0 0 1]]


We should acknowledge few facts. Firstly, CountVectorizer does not take order into account. Secondly, it ignores one-letter words (this can be changed during initialization). Finally, for test values, CountVectorizer ignores words which are not in it's dictionary.

## Exercise 2 
a) Split dataset into training and test sets.     
b) Transform reviews into vectors using CountVectorizer. 

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
# a)
# X = baby_df['review'].values
X = baby_df['review']
# y = baby_df['rating'].values
y = baby_df['rating']
X_train,X_test, y_train,y_test = train_test_split(X,y, test_size = 0.25, random_state = 4)
X_test_ = X_test

In [9]:
# b)
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

## Exercise 3 
a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were).   
b) Print 10 most positive and 10 most negative words.

In [10]:
#a)
# with default(1000) iteration number algorithm fails to converge
model = LogisticRegression(max_iter = 2000)
model.fit(X_train,y_train)

LogisticRegression(max_iter=2000)

In [11]:
#b)
coef = model.coef_[0]
tmp = np.argsort(coef)
names = vectorizer.get_feature_names_out()

worst = tmp[:10]
best = tmp[-10:]

print('Most negative words:', ', '.join(np.take(names,worst)))
print('Most positive words:', ', '.join(np.take(names,best)))

Most negative words: worst, dissapointed, theory, useless, worthless, disappointing, poorly, unusable, unacceptable, terrible
Most positive words: perfect, awesome, hesitate, pleasantly, skeptical, lifesaver, penny, saves, rich, ply


## Exercise 4 
a) Predict the sentiment of test data reviews.   
b) Predict the sentiment of test data reviews in terms of probability.   
c) Find five most positive and most negative reviews.   
d) Calculate the accuracy of predictions.

In [12]:
#a)
y_pred = model.predict(X_test)
print(y_pred[:10])

[-1 -1 -1  1  1  1  1  1  1  1]


In [13]:
#b)
print(model.classes_)
prob_pred = model.predict_proba(X_test)
print(prob_pred[:10])
#hint: model.predict_proba()

[-1  1]
[[9.99619411e-01 3.80588670e-04]
 [9.99602958e-01 3.97041856e-04]
 [9.93656973e-01 6.34302679e-03]
 [1.00428253e-01 8.99571747e-01]
 [5.33195222e-02 9.46680478e-01]
 [1.36344068e-01 8.63655932e-01]
 [1.16180959e-04 9.99883819e-01]
 [1.49475310e-08 9.99999985e-01]
 [9.46040089e-02 9.05395991e-01]
 [1.93133068e-02 9.80686693e-01]]


In [14]:
#c) 
negative = prob_pred[:,0]
positive = prob_pred[:,1]

print('Most positive reviews:')
print(np.take(X_test_,np.argsort(positive)[-5:]))
print('')
print('Most negative reviews:')
print(np.take(X_test_,np.argsort(negative)[-5:]))

Most positive reviews:
180646    After much research I purchased an Urbo2 Its e...
144112    Background Ive been using Grovia diapers for f...
147949    Amazing Love Love Love it  All 5 STARS all the...
176040    I went back to work full time just six weeks a...
180953    AMAZING stroller  It took me about 2 minutes t...
Name: review, dtype: object

Most negative reviews:
175191    I had to return this stroller for three reason...
87026     First off I did manage to find this product fo...
10180     Please see my email to the companyHelloI am wr...
120707    The previous reviewers laud the piece of mind ...
178360    I have rated and left reviews for many items o...
Name: review, dtype: object


In [15]:
#d)
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,y_pred))

0.9306754941469967


## Exercise 5
In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.


a) Redo exercises 2-5 using limited dictionary.   
b) Check the impact of all the words from the dictionary.   
c) Compare accuracy of predictions and the time of evaluation.

In [16]:
significant_words = ['love','great','easy','old','little','perfect','loves','well','able','car','broke','less','even','waste','disappointed','work','product','money','would','return']

In [17]:
#a)
limited_vec = CountVectorizer()
# learn words from dictionary
limited_vec.fit(significant_words)

X = baby_df['review']
y = baby_df['rating']

# split data
X_train_l,X_test_l,y_train_l,y_test_l = train_test_split(X,y, random_state = 15, test_size = 0.2)

X_test_l_ = X_test_l
# transform
X_train_l = limited_vec.transform(X_train_l)
X_test_l = limited_vec.transform(X_test_l)

model_limited = LogisticRegression(max_iter = 2000)

model_limited.fit(X_train_l,y_train_l)



LogisticRegression(max_iter=2000)

Model fitting is mush faster with limited dictionary, 98.9s vs 12.5s

In [18]:
coef = model_limited.coef_[0]

tmp = np.argsort(coef)
names = limited_vec.get_feature_names_out() 

worst = tmp[:10]
best = tmp[-10:]
print('Most negative words:', ', '.join(np.take(names,worst)))
print('Most positive words:', ', '.join(np.take(names,best)))

Most negative words: disappointed, return, waste, broke, money, work, even, would, product, less
Most positive words: car, old, able, well, little, great, easy, love, perfect, loves


In [19]:
y_pred_proba = model_limited.predict_proba(X_test_l)

lPositive = np.argsort(y_pred_proba[:,1])
lNegative = np.argsort(y_pred_proba[:,0])

print('Most positive reviews:')
print(np.take(X_test_l_,lPositive[-5:]))
print('')
print('Most negative reviews:')
print(np.take(X_test_l_,lNegative[-5:]))


Most positive reviews:
27089     I am 8 months pregnant and after our most rece...
68018     i purchased the black BUILT NY diaper bag and ...
147975    Let me start by saying that I have gone throug...
151056    I did tons of research on strollers I knew I w...
103297    I was very excited when I heard Chicco was fin...
Name: review, dtype: object

Most negative reviews:
167296    the message 8220Out of Range 8220 is in the mo...
113785    We have the Britax Marathon Convertible Car Se...
24325     This never worked right and I tried to return ...
121156    I added this product Dr Browns BPA Free Deluxe...
22491     My wife has been sucessful using the pump but ...
Name: review, dtype: object


In [20]:
#b)
tDF = pd.DataFrame({'word':names,'impact':coef})
print(tDF.sort_values('impact').to_string(index=False))

        word    impact
disappointed -2.305434
      return -2.060137
       waste -1.978250
       broke -1.675744
       money -0.889217
        work -0.623915
        even -0.535602
       would -0.333011
     product -0.319120
        less -0.167600
         car  0.066429
         old  0.104351
        able  0.211537
        well  0.502414
      little  0.509981
       great  0.931909
        easy  1.183541
        love  1.372507
     perfect  1.511950
       loves  1.716172


We can see that 'less', 'car' and 'old' are most neutral

Timing prediction and accuracy calculation

Model with all words

In [21]:
%%time
#c)
y_pred = model.predict(X_test)
print(accuracy_score(y_test,y_pred))
#hint: %time, %timeit

0.9306754941469967
Wall time: 29 ms


Model with limited dictionary

In [22]:
%%time
y_pred_l = model_limited.predict(X_test_l)
print(accuracy_score(y_test_l,y_pred_l))

0.8691493508440526
Wall time: 4 ms


Comparing model without limited dictionary with limited one:  
Longer predict time 29ms vs 4ms  
Better accuracy: 0.93 vs 0.86  