### Today we are going to perform the simple classification of the amazon reviews' sentiment.

### Please, download the dataset amazon_baby.csv.

In [272]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.linear_model import LogisticRegression

def remove_punctuation(text):
    import string
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

baby_df = pd.read_csv('amazon_baby.csv')
baby_df.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


Here we see head of dataset.

## Exercise 1 (data preparation)
a) Remove punctuation from reviews using the given function.   
b) Replace all missing (nan) revies with empty "" string.  
c) Drop all the entries with rating = 3, as they have neutral sentiment.   
d) Set all positive $\geq 4$ ratings to 1 and negative $\leq$ 2 to -1.

In [273]:
#a)
for i in range(len(baby_df['review'])):
    baby_df['review'][i] = remove_punctuation(str(baby_df['review'][i]))
    

#short test: 
baby_df["review"][4] == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'
remove_punctuation(baby_df["review"][4]) == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  baby_df['review'][i] = remove_punctuation(str(baby_df['review'][i]))


True

In [274]:
#test
(baby_df["review"][4]) == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'

True

In [275]:
baby_df.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,These flannel wipes are OK but in my opinion n...,3
1,Planetwise Wipe Pouch,it came early and was not disappointed i love ...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase I h...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried nonstop when I tried...,5


In [276]:
baby_df.isnull().sum()

name      318
review      0
rating      0
dtype: int64

In [277]:
#b)
baby_df["review"].fillna("", inplace=True)

#short test:
baby_df["review"][38] == baby_df["review"][38]

True

In [278]:
baby_df.isnull().sum()

name      318
review      0
rating      0
dtype: int64

In [279]:
#c)

baby_df.drop(baby_df.loc[baby_df["rating"] == 3].index, inplace = True)


#short test:
sum(baby_df["rating"] == 3)



0

In [280]:
baby_df.describe()

Unnamed: 0,rating
count,166752.0
mean,4.233191
std,1.295527
min,1.0
25%,4.0
50%,5.0
75%,5.0
max,5.0


In [281]:
#d) 

baby_df["rating"].replace({1: -1, 2: -1, 4: 1, 5: 1}, inplace=True)

#short test:
sum(baby_df["rating"]**2 != 1)


0

The rating has 2 values -1 or 1, depending on whether the review is positive or negative.

## CountVectorizer
In order to analyze strings, we need to assign them numerical values. We will use one of the simplest string representation, which transforms strings into the $n$ dimensional vectors. The number of dimensions will be the size of our dictionary, and then the values of the vector will represent the number of appereances of the given word in the sentence.

In [282]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
reviews_train_example = ["We like apples",
                   "We hate oranges",
                   "I adore bananas",
                   "We like like apples and oranges",
                   "They dislike bananas"]

X_train_example = vectorizer.fit_transform(reviews_train_example)

print(vectorizer.get_feature_names())
print(X_train_example.todense())



['adore', 'and', 'apples', 'bananas', 'dislike', 'hate', 'like', 'oranges', 'they', 'we']
[[0 0 1 0 0 0 1 0 0 1]
 [0 0 0 0 0 1 0 1 0 1]
 [1 0 0 1 0 0 0 0 0 0]
 [0 1 1 0 0 0 2 1 0 1]
 [0 0 0 1 1 0 0 0 1 0]]




In [283]:
reviews_test_example = ["They like bananas",
                   "We hate oranges bananas and apples",
                   "We love bananas"] #New word!

X_test_example = vectorizer.transform(reviews_test_example)
print(X_test_example.todense())

[[0 0 0 1 0 0 1 0 1 0]
 [0 1 1 1 0 1 0 1 0 1]
 [0 0 0 1 0 0 0 0 0 1]]


We should acknowledge few facts. Firstly, CountVectorizer does not take order into account. Secondly, it ignores one-letter words (this can be changed during initialization). Finally, for test values, CountVectorizer ignores words which are not in it's dictionary.

## Exercise 2 
a) Split dataset into training and test sets.     
b) Transform reviews into vectors using CountVectorizer. 

In [284]:
#a)
from sklearn.model_selection import train_test_split

reviews = baby_df.review.values.reshape(-1,1)
rating = baby_df.rating.values.reshape(-1)

reviews_train, reviews_test, y_train, y_test = train_test_split(reviews,rating, test_size=0.3)

I'm splitting the dataset into training and test sets in the proportion that the test is 30% of the dataset and training is 70%.


In [285]:
#b)

vectorizer = CountVectorizer()

X_train = vectorizer.fit_transform(reviews_train.ravel())

# print(vectorizer.get_feature_names())
# print(X_train.todense())


## Exercise 3 
a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were).   
b) Print 10 most positive and 10 most negative words.

In [286]:
#a)
model = LogisticRegression(max_iter=2000)
model.fit(X_train, y_train)


Creation model of logistic regression with a bigger number of iterations because our dataset is too big.

In [287]:
#b)
feature_names = vectorizer.get_feature_names()
coef = model.coef_.reshape(-1).tolist()

dic = {}
for F, C in zip(feature_names, coef):
    dic[F] = float(C)

sorted_dict = dict(sorted(dic.items(), key=lambda item: item[1]))
lowest = list(sorted_dict.items())[:10]
highest = list(sorted_dict.items())[-10:]

print("Negative: \nword\t\tcoef")
for elem in lowest:
    print(elem)


print("Positive: \nword\t\tcoef")
for elem in highest:
    print(elem)





Negative: 
word		coef
('dissapointed', -2.869900335535926)
('worst', -2.7357007953485994)
('worthless', -2.6242619912601226)
('disappointing', -2.5043846004027444)
('useless', -2.480909383366112)
('poor', -2.4407002284006096)
('theory', -2.3905292606590316)
('poorly', -2.3738154639875897)
('shame', -2.271842241736584)
('disappointed', -2.1927525595269546)
Positive: 
word		coef
('con', 1.8068141253097656)
('perfect', 1.8143354853374791)
('perfectly', 1.8494611041241396)
('minor', 1.8627812790578622)
('penny', 1.9059388844080818)
('ply', 1.9585718496227098)
('excellent', 2.0102268675151906)
('worry', 2.0340061491520944)
('skeptical', 2.1840167171366707)
('rich', 2.2259654712212242)


Here we see the 10 most positive (with the biggest coefficients) and negative (with the smallest coefficients) words.
we can tell that these words 

We can say that most of these words make sense, such as 'worst', 'disappointing', 'poorly' etc. I don't know what the meaning of the word 'tomorrow' is, for example, because it is not negative in itself, but apparently, it is important. Similarly, with positive words, we have clearly positive like 'excellent' or 'satisfied'.

## Exercise 4 
a) Predict the sentiment of test data reviews.   
b) Predict the sentiment of test data reviews in terms of probability.   
c) Find five most positive and most negative reviews.   
d) Calculate the accuracy of predictions.

In [288]:
#a) Predict the sentiment of test data reviews.
X_test = vectorizer.transform(reviews_test.ravel())
y_pred = model.predict(X_test)
print(y_pred)

[ 1  1  1 ...  1  1 -1]


In [289]:
#b) Predict the sentiment of test data reviews in terms of probability.
probability_arr = model.predict_proba(X_test)
print(probability_arr)
#hint: model.predict_proba()

[[7.51939513e-05 9.99924806e-01]
 [3.74526008e-02 9.62547399e-01]
 [1.75043759e-02 9.82495624e-01]
 ...
 [1.41483347e-01 8.58516653e-01]
 [2.28083766e-03 9.97719162e-01]
 [8.59815936e-01 1.40184064e-01]]


There is a table that contains values representing the probability that the sample is from 1st or 2nd class. One of the classes represents a positive review another one a negative review

In [290]:
#c) Find five most positive and most negative reviews.
class1=[]
class2=[]
for row in probability_arr:
  class1.append(row[0])
  class2.append(row[1])

class1 = [(idx, item) for idx,item in enumerate(class1)]
class2 = [(idx, item) for idx,item in enumerate(class2)]

class1.sort(key=lambda a: a[1], reverse=True)
class2.sort(key=lambda a: a[1], reverse=True)

print("------ Negative reviews ------ ")
for elem in class1[:5]:
  print(reviews_test[elem[0]][0], '\n')

print("------ Positive reviews ------ ")
for elem in class2[:5]:
  print(reviews_test[elem[0]][0], '\n')

#hint: use the results of b)

------ Negative reviews ------ 
I have NEVER written a review before for anything DO NOT BUY THIS PRODUCTThis is a very expensive monitor and the features are awesome It is really a fantastic monitor truly The clarity is great the VOX feature is awesome the intercom is fabulous for older kids although there is a delay on it so after you speak you cannot hear the response as it wont pick up sounds for several seconds after you let go of the talk button And I REALLY hate that you cannot mute the volume The lullaby feature is nice but I do NOT want to sit there and listen to the lullaby myself But you must No muteBUTafter only 10 months of use our camera just stopped working I unplugged the camera to move it to a safer location as our son had become mobile and I was concerned he could reach the cord and pull it down onto himself When I plugged the camera back in nothing The green power light would not come onI naturally thought it was the outlet Checked the fuse box etc But it was not the

There are five most positive and most negative reviews. They were correctly classified. They are pretty long this is probably why they are **most** positive or negative.

In [291]:
#d) Calculate the accuracy of predictions.
correct_num = 0
total_num = len(y_test)
for x,y in zip(y_pred, y_test):
  if x == y:
    correct_num += 1

print("Accuracy: ", correct_num/total_num)

Accuracy:  0.9316555391196578


The accuracy is pretty high.

## Exercise 5
In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.


a) Redo exercises 2-5 using limited dictionary.   
b) Check the impact of all the words from the dictionary.   
c) Compare accuracy of predictions and the time of evaluation.

In [292]:
significant_words = ['love','great','easy','old','little','perfect','loves','well','able','car','broke','less','even','waste','disappointed','work','product','money','would','return']

In [293]:
#a)
vectorizer_limited = CountVectorizer(vocabulary=significant_words)
X_train_limited = vectorizer_limited.fit_transform(reviews_train.ravel())
X_test_limited = vectorizer_limited.transform(reviews_test.ravel())

print(vectorizer_limited.get_feature_names())



['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed', 'work', 'product', 'money', 'would', 'return']


There are all words from our limited dictionary.

In [294]:
model_limited = LogisticRegression() 
model_limited.fit(X_train_limited, y_train)

In [295]:
feature_names = vectorizer_limited.get_feature_names()
coef = model_limited.coef_.reshape(-1).tolist()

dic = {}
for F, C in zip(feature_names, coef):
    dic[F] = float(C)

sorted_dict = dict(sorted(dic.items(), key=lambda item: item[1]))
lowest = list(sorted_dict.items())[:10]
highest = list(sorted_dict.items())[-10:]

print("Negative: \nword\t\tcoef")
for elem in lowest:
    print(elem)


print("Positive: \nword\t\tcoef")
for elem in highest:
    print(elem)


Negative: 
word		coef
('disappointed', -2.332719838494596)
('return', -2.031332706376337)
('waste', -2.0001616257882153)
('broke', -1.6468126065831814)
('money', -0.9293952568578915)
('work', -0.6315629357156675)
('even', -0.5250012409028297)
('would', -0.3390215413130481)
('product', -0.3318396622410652)
('less', -0.16181879869297328)
Positive: 
word		coef
('car', 0.0696477394780101)
('old', 0.0788006838351267)
('able', 0.23642895361518534)
('well', 0.47641565543494624)
('little', 0.4918529480871129)
('great', 0.9652752186988836)
('easy', 1.1842615364431641)
('love', 1.357962848392536)
('perfect', 1.495127575171717)
('loves', 1.671256459482649)




b) Here we can see the impact of all words from the dictionary
right now for negative words coefficient is higher than before (e.g. for 'disappointed' earlier it was -3, now it is -2.4. And generally, values are bigger.
When we look at positive words general trend is that for limited dictionary values of coefficients are lower than before. 

In [296]:
#a) Predict the sentiment of test data reviews.
y_pred_limited = model_limited.predict(X_test_limited)
print(y_pred_limited)

#b) Predict the sentiment of test data reviews in terms of probability.
probability_arr_limited = model_limited.predict_proba(X_test_limited)
print(probability_arr_limited)

[1 1 1 ... 1 1 1]
[[0.00155491 0.99844509]
 [0.21430537 0.78569463]
 [0.14294781 0.85705219]
 ...
 [0.08963436 0.91036564]
 [0.08880771 0.91119229]
 [0.27684922 0.72315078]]


In [297]:
#c) Find five most positive and most negative reviews.
class1=[]
class2=[]
for row in probability_arr_limited:
  class1.append(row[0])
  class2.append(row[1])

class1 = [(idx, item) for idx,item in enumerate(class1)]
class2 = [(idx, item) for idx,item in enumerate(class2)]

class1.sort(key=lambda a: a[1], reverse=True)
class2.sort(key=lambda a: a[1], reverse=True)

print("------ Negative reviews ------ ")
for elem in class1[:5]:
  print(reviews_test[elem[0]][0])
  print("class:", y_test[elem[0]], '\n')

print("------ Positive reviews ------ ")
for elem in class2[:5]:
  print(reviews_test[elem[0]][0])
  print("class:", y_test[elem[0]], '\n')


------ Negative reviews ------ 
My wife has been sucessful using the pump but generally speaking the pump has numerous idiosyncrasies that you should consider before purchasing it  Had we done more homework we likely would have chosen a different pump  If only you could return a personal item to Babies r Us once youve opened it we wouldMy main complaint is that the pump is not very practical  While my wife expresses a fair amount of milk using the product she obtained the same amount of milk using the Medela pump Symphony  There is a long unrelated story why we rented a Medela pump for 2 weeks but it provided a useful period to compare the products sidebysideIn order of importanceSetup  A lot of parts to plug in connect and ultimately clean not to mention misplace or loose  Medela has this pump beat handsdown on this front  Managing this process at home is a chore if youre pumping at work I would imagine this becomes even more difficult  I suppose this is why Avent dont provide the opt

In negative reviews is one badly classified, we can see that this model is doing a bit worse.

In [298]:
#d) Calculate the accuracy of predictions.

correct_num = 0
total_num = len(y_test)
for x,y in zip(y_pred_limited, y_test):
  if x == y:
    correct_num += 1

print("Accuracy: ", correct_num/total_num)

Accuracy:  0.8686682924879063


Here accuracy is much lower than before. So predictions were better with the full dictionary.

c)
logistic regression + fit model for all words:  2m 2.4s       
for limited dictionary:                         0.2s

The difference in time is significant when we have a limited dictionary we can do predictions much faster, than with a full dictionary.