### Today we are going to perform the simple classification of the amazon reviews' sentiment.

### Please, download the dataset amazon_baby.csv.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.linear_model import LogisticRegression
from timeit import default_timer as timer

def remove_punctuation(text):
    import string
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

baby_df = pd.read_csv('amazon_baby.csv')
baby_df.head()


Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


## Exercise 1 (data preparation)
a) Remove punctuation from reviews using the given function.   
b) Replace all missing (nan) revies with empty "" string.  
c) Drop all the entries with rating = 3, as they have neutral sentiment.   
d) Set all positive ($\geq$4) ratings to 1 and negative($\leq$2) to -1.

In [None]:
#a)
baby_df["review"] = [remove_punctuation(str(baby_df['review'][x])) for x in range(0,baby_df.shape[0])]


#short test: 
baby_df["review"][4] == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'
remove_punctuation(baby_df["review"][4]) == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'

True

In [None]:
#b)
baby_df['review'] = baby_df['review'].fillna(" ")

#short test:
baby_df["review"][38] == baby_df["review"][38]

True

In [None]:
#c)
baby_df = baby_df[baby_df.rating != 3] 

#short test:
sum(baby_df["rating"] == 3)

0

In [None]:
#d) 
baby_df['rating'] = baby_df['rating'].replace([2,1],-1)
baby_df['rating'] = baby_df['rating'].replace([4,5],1)


print(np.unique(baby_df['rating']))

#short test:
sum(baby_df["rating"]**2 != 1)

[-1  1]


0

## CountVectorizer
In order to analyze strings, we need to assign them numerical values. We will use one of the simplest string representation, which transforms strings into the $n$ dimensional vectors. The number of dimensions will be the size of our dictionary, and then the values of the vector will represent the number of appereances of the given word in the sentence.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
reviews_train_example = ["We like apples",
                   "We hate oranges",
                   "I adore bananas",
                   "We like like apples and oranges",
                   "They dislike bananas"]

X_train_example = vectorizer.fit_transform(reviews_train_example)

print(vectorizer.get_feature_names())
print(X_train_example.todense())



['adore', 'and', 'apples', 'bananas', 'dislike', 'hate', 'like', 'oranges', 'they', 'we']
[[0 0 1 0 0 0 1 0 0 1]
 [0 0 0 0 0 1 0 1 0 1]
 [1 0 0 1 0 0 0 0 0 0]
 [0 1 1 0 0 0 2 1 0 1]
 [0 0 0 1 1 0 0 0 1 0]]


In [None]:
reviews_test_example = ["They like bananas",
                   "We hate oranges bananas and apples",
                   "We love bananas"] #New word!

X_test_example = vectorizer.transform(reviews_test_example)

print(vectorizer.get_feature_names())
print(X_test_example.todense())

['adore', 'and', 'apples', 'bananas', 'dislike', 'hate', 'like', 'oranges', 'they', 'we']
[[0 0 0 1 0 0 1 0 1 0]
 [0 1 1 1 0 1 0 1 0 1]
 [0 0 0 1 0 0 0 0 0 1]]


We should acknowledge few facts. Firstly, CountVectorizer does not take order into account. Secondly, it ignores one-letter words (this can be changed during initialization). Finally, for test values, CountVectorizer ignores words which are not in it's dictionary.

## Exercise 2 
a) Split dataset into training and test sets.     
b) Transform reviews into vectors using CountVectorizer. 

In [None]:
#a)
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(baby_df['review'],baby_df['rating'],
                                                    test_size=0.33, random_state=42)


In [None]:
#b)
vectorizer = CountVectorizer()

X_train_v = vectorizer.fit_transform(X_train)

X_test_v = vectorizer.transform(X_test)

## Exercise 3 
a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were).   
b) Print 10 most positive and 10 most negative words.

In [None]:
#a)
model = LogisticRegression(max_iter=10000)

start = timer()

model.fit(X_train_v, y_train)

end = timer()

time_1 = end - start

In [None]:
#b)

best = np.argsort(model.coef_)
print("10 most positive: ", np.array(vectorizer.get_feature_names())[best[0,-10:]])
print("10 most negative: ",np.array(vectorizer.get_feature_names())[best[0, 0:10]])

#hint: model.coef_, vectorizer.get_feature_names()

10 most positive:  ['perfect' 'sooner' 'glad' 'rich' 'saves' 'ply' 'minor' 'lifesaver' 'con'
 'wonderfully']
10 most negative:  ['dissapointed' 'worst' 'worthless' 'useless' 'poorly' 'nope'
 'disappointing' 'disappointed' 'disappointment' 'slowflow']


In [None]:
"""
As we can see, when 10 most negative words are crearly negative, some words from these most positive
aren't, like sooner or minor.
"""

"\nAs we can see, when 10 most negative words are crearly negative, some words from these most positive\naren't, like sooner or minor.\n"

## Exercise 4 
a) Predict the sentiment of test data reviews.   
b) Predict the sentiment of test data reviews in terms of probability.   
c) Find five most positive and most negative reviews.   
d) Calculate the accuracy of predictions.

In [None]:
#a)
y_predict = model.predict(X_test_v)

In [None]:
#b)
y_proba = model.predict_proba(X_test_v)
y_proba
#hint: model.predict_proba()

array([[0.51498469, 0.48501531],
       [0.82814556, 0.17185444],
       [0.98794345, 0.01205655],
       ...,
       [0.38584744, 0.61415256],
       [0.0120209 , 0.9879791 ],
       [0.00985932, 0.99014068]])

In [None]:
"""
Each word has its own probability, that the y-value is positive(first column) or negative(second column).
"""

In [None]:
#c)

data = np.argsort(y_proba[:,1])[-5:]

print(X_test.iloc[data])
print(y_test.iloc[data])

data = np.argsort(y_proba[:,1])[:5]

print(X_test.iloc[data])
print(y_test.iloc[data])


#hint: use the results of b)

57108     I started wearing the Babyplus when I was 18 w...
180646    After much research I purchased an Urbo2 Its e...
100166    I bought this carrier when my daughter was abo...
129722    This is a review of the 2012 Bumbleride Flite ...
168086    Buttons vs Best Bottoms reviewFirst thing I wa...
Name: review, dtype: object
57108     1
180646    1
100166    1
129722    1
168086    1
Name: rating, dtype: int64
147902    My disappointment with this product prompted m...
175191    I had to return this stroller for three reason...
77072     I thought it sounded great to have different t...
89902     I am so incredibly disappointed with the strol...
133297    The first monitor broke within 1 month of use ...
Name: review, dtype: object
147902   -1
175191   -1
77072    -1
89902    -1
133297   -1
Name: rating, dtype: int64


In [None]:
"""
After printing the y-values of the 5 most positive and negative reviews, we can see that their y-values are correct.
"""

In [None]:
#d) 
print(model.score(X_train_v, y_train))
print(model.score(X_test_v, y_test))

0.9708743947083411
0.930945501462865


In [None]:
"""
The accuracy is 0.04 higher for the training set than for the test set.
"""

## Exercise 5
In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.


a) Redo exercises 2-5 using limited dictionary.   
b) Check the impact of all the words from the dictionary.   
c) Compare accuracy of predictions and the time of evaluation.

In [None]:
significant_words = ['love','great','easy','old','little','perfect','loves','well','able','car','broke','less','even','waste','disappointed','work','product','money','would','return']

In [None]:
#a)
vectorizer_2 = CountVectorizer(vocabulary=significant_words)
X_train_2v = vectorizer_2.fit_transform(X_train)
X_test_2v = vectorizer_2.transform(X_test)

In [None]:
model_2 = LogisticRegression(max_iter=10000)

start_2 = timer()

model_2.fit(X_train_2v, y_train)

end_2 = timer()

time_2 = end_2 - start_2

In [None]:
best_2 = np.argsort(model_2.coef_)
print("10 most positive: ", np.array(vectorizer_2.get_feature_names())[best_2[0,-10:]])
print("10 most negative: ",np.array(vectorizer_2.get_feature_names())[best_2[0, 0:10]])

y_predict_2 = model_2.predict(X_test_2v)
y_proba_2 = model_2.predict_proba(X_test_2v)

data_2 = np.argsort(y_proba_2[:,1])[-5:]
data_2 = np.argsort(y_proba_2[:,1])[:5]

print(X_test.iloc[data_2])
print(y_test.iloc[data_2])

print(X_test.iloc[data_2])
print(y_test.iloc[data_2])


10 most positive:  ['old' 'car' 'able' 'well' 'little' 'great' 'easy' 'love' 'perfect'
 'loves']
10 most negative:  ['disappointed' 'return' 'waste' 'broke' 'money' 'work' 'even' 'would'
 'product' 'less']
41581     Looks really cute however the cloth smells fun...
168391    I loved all the features of the car seat  It i...
35763     Day 1 Assembled it Had it up and running playi...
56798     I was excited to give these instruments to my ...
111079    I searched for Baby Blanket Made in the USA an...
Name: review, dtype: object
41581    -1
168391    1
35763    -1
56798    -1
111079   -1
Name: rating, dtype: int64
41581     Looks really cute however the cloth smells fun...
168391    I loved all the features of the car seat  It i...
35763     Day 1 Assembled it Had it up and running playi...
56798     I was excited to give these instruments to my ...
111079    I searched for Baby Blanket Made in the USA an...
Name: review, dtype: object
41581    -1
168391    1
35763    -1
56798    -1
111

In [None]:
"""
As you can see, the model classified the words in the dictionary quite well.
After printing the 5 most positive and negative reviews, we can see
that one of the most negative reviews is actually positive.
"""

In [None]:
#b)
for w, k in sorted(zip(significant_words, model_2.coef_[0]), key = lambda x: x[1]):
  print("{} and impact: {}".format(w, k))

disappointed and impact: -2.387758373661248
return and impact: -2.0878135743686914
waste and impact: -1.9981697871878579
broke and impact: -1.6429197636994468
money and impact: -0.9403220306130846
work and impact: -0.6213267170831115
even and impact: -0.49910674931051097
would and impact: -0.3508212078627339
product and impact: -0.3075479144735849
less and impact: -0.20315523738278388
old and impact: 0.06100797468095392
car and impact: 0.08451117821537883
able and impact: 0.2062102887977539
well and impact: 0.49598179029584777
little and impact: 0.5190223589367308
great and impact: 0.912380499218431
easy and impact: 1.1796233668038638
love and impact: 1.3584748195710263
perfect and impact: 1.5133629421180903
loves and impact: 1.7033230640221935


In [None]:
"""
The word with the most negative impact is "disappointed" and the word with
the most positive impact is "loves".
Some words, like "old" and "car" have positive impact, but it is very low.
"""

In [None]:
#c)
print("Accuracy of second model on training set: ", model.score(X_train_v, y_train))
print("Accuracy of second model on training set: ", model.score(X_test_v, y_test))
print("Time of evaluation: ", time_1)
print()
print("Accuracy of second model on training set: ", model_2.score(X_train_2v, y_train))
print("Accuracy of second model on training set: ", model_2.score(X_test_2v, y_test))
print("Time of evaluation: ", time_2)

#hint: %time, %timeit

Accuracy of second model on training set:  0.9708743947083411
Accuracy of second model on training set:  0.930945501462865
Time of evaluation:  83.07514877200003

Accuracy of second model on training set:  0.8672341415823063
Accuracy of second model on training set:  0.8675062239909865
Time of evaluation:  0.42288639699995656


In [None]:
"""
First model took more time than the second.
The accuracy difference is 0.1 for training set and 0.07 for the testing set.
An interesting fact is that the testing and training accuracy in the second model are similar.
It can be concluded that it doesn't take long to build a pretty good model, but if we decide 
to build a better one, it will take much longer.
"""