### Today we are going to perform the simple classification of the amazon reviews' sentiment.

### Please, download the dataset amazon_baby.csv.

In [1]:
import pandas as pd
import time
from sklearn.linear_model import LogisticRegression
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

def remove_punctuation(text):
    import string
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

baby_df = pd.read_csv('amazon_baby.csv')
baby_df.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


## Exercise 1 (data preparation)
a) Remove punctuation from reviews using the given function.   
b) Replace all missing (nan) revies with empty "" string.  
c) Drop all the entries with rating = 3, as they have neutral sentiment.   
d) Set all positive ($\geq$4) ratings to 1 and negative($\leq$2) to -1.

In [2]:
#a)
baby_df.review = remove_punctuation(baby_df.review.str)

#short test: 
baby_df["review"][4] == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'
remove_punctuation(baby_df["review"][4]) == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'

# We are using function given above to remove punctuation from reviews.

True

In [3]:
#b)
baby_df.review = baby_df.review.fillna('')

#short test:
baby_df["review"][38] == baby_df["review"][38]

# By using function .fillna('') we are replacing nan values with empty string ''.

True

In [4]:
#c)
baby_df = baby_df.drop(baby_df[baby_df.rating == 3].index)
baby_df.head()

#short test:
sum(baby_df["rating"] == 3)

# Using function .drop() we are droping all entries with rating value = 3 because they have neutral sentiment

0

In [5]:
#d)
baby_df.loc[baby_df.rating <= 2, 'rating'] = 1
baby_df.loc[baby_df.rating >= 4, 'rating'] = -1

#short test:
sum(baby_df["rating"]**2 != 1)

# We are setting positive ratings to 1 and negative to -1 by using function .loc

0

## CountVectorizer
In order to analyze strings, we need to assign them numerical values. We will use one of the simplest string representation, which transforms strings into the $n$ dimensional vectors. The number of dimensions will be the size of our dictionary, and then the values of the vector will represent the number of appereances of the given word in the sentence.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
reviews_train_example = ["We like apples",
                   "We hate oranges",
                   "I adore bananas",
                   "We like like apples and oranges",
                   "They dislike bananas"]

X_train_example = vectorizer.fit_transform(reviews_train_example)

print(vectorizer.get_feature_names())
print(X_train_example.todense())



['adore', 'and', 'apples', 'bananas', 'dislike', 'hate', 'like', 'oranges', 'they', 'we']
[[0 0 1 0 0 0 1 0 0 1]
 [0 0 0 0 0 1 0 1 0 1]
 [1 0 0 1 0 0 0 0 0 0]
 [0 1 1 0 0 0 2 1 0 1]
 [0 0 0 1 1 0 0 0 1 0]]


In [7]:
reviews_test_example = ["They like bananas",
                   "We hate oranges bananas and apples",
                   "We love bananas"] #New word!

X_test_example = vectorizer.transform(reviews_test_example)

print(X_test_example.todense())

[[0 0 0 1 0 0 1 0 1 0]
 [0 1 1 1 0 1 0 1 0 1]
 [0 0 0 1 0 0 0 0 0 1]]


We should acknowledge few facts. Firstly, CountVectorizer does not take order into account. Secondly, it ignores one-letter words (this can be changed during initialization). Finally, for test values, CountVectorizer ignores words which are not in it's dictionary.

## Exercise 2 
a) Split dataset into training and test sets.     
b) Transform reviews into vectors using CountVectorizer. 

In [8]:
#a)
from sklearn.model_selection import train_test_split

ratings_train, ratings_test, reviews_train, reviews_test = train_test_split(baby_df.rating, baby_df.review)

# Here with help of train_test_split function we are splitting data into training and test sets

In [9]:
#b)
vectorizer = CountVectorizer()

vector_train_rev = vectorizer.fit_transform(list(reviews_train))

vector_test_rev = vectorizer.transform(list(reviews_test))

# Next we use CountVectorizer() which is tool used to transform a given text into a vector of numbers. Transformation is based on
# frequency of each word that figure in the text.

## Exercise 3 
a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were).   
b) Print 10 most positive and 10 most negative words.

In [10]:
#a)
model = LogisticRegression(solver='lbfgs', max_iter=1100)
model.fit(vector_train_rev, ratings_train)

# We are creating model by using Logistic Regression which is used to predict the probability of a binary event figuring.
# To avoid warnings (which are not meaningful, only makes output worse looking) I'm using solver and max_iter.

In [11]:
#b)
words = list(zip(model.coef_[0], vectorizer.get_feature_names()))
sorted_words = sorted(words, key=lambda x: x[0])

most_positive_words = [word[1] for word in sorted_words[:10]]
most_negative_words = [word[1] for word in sorted_words[-10:]]

print('Most positive words:', most_positive_words)
print('Most negative words:', most_negative_words)

#hint: model.coef_, vectorizer.get_feature_names()

# By using functions given in hint and few more we are creating list of 10 most positive and negative words.

Most positive words: ['lifesaver', 'saves', 'thankful', 'amazed', 'rich', 'excellent', 'worry', 'breeze', 'pleasantly', 'perfect']
Most negative words: ['unacceptable', 'unusable', 'worthless', 'intelligent', 'useless', 'theory', 'poorly', 'disappointing', 'dissapointed', 'worst']


## Exercise 4 
a) Predict the sentiment of test data reviews.   
b) Predict the sentiment of test data reviews in terms of probability.   
c) Find five most positive and most negative reviews.   
d) Calculate the accuracy of predictions.

In [12]:
#a)
begin = time.time()
predictions = model.predict(vector_test_rev)
predictions_time_to_do = time.time() - begin
print(predictions)

# We are using .predict() function to predict the labels of the data values based on our trained model. We also use function
# time from time library to save time of evaluation for next exercise.

[-1 -1 -1 ... -1 -1 -1]


In [13]:
#b)
probability_predictions = model.predict_proba(vector_test_rev)
print(probability_predictions)

#hint: model.predict_proba()

# By using predict_proba() function we are getting in return array which contains lists with class probabilities for the
# given data.

[[9.93564437e-01 6.43556330e-03]
 [9.99999998e-01 1.59579556e-09]
 [9.99999880e-01 1.19526503e-07]
 ...
 [9.99814359e-01 1.85641203e-04]
 [9.94388893e-01 5.61110684e-03]
 [9.76840033e-01 2.31599670e-02]]


In [14]:
#c)
reviews = list(zip(probability_predictions, reviews_test))
sorted_reviews = sorted(reviews, key=lambda x: x[0][1])

most_positive_reviews = [review[1] for review in sorted_reviews[:5]]
most_negative_reviews = [review[1] for review in sorted_reviews[-5:]]

print('The most positive reviews:')
for x, review in enumerate(most_positive_reviews):
    print(f'{x+1}. {review}')

print('\nThe most negative reviews:')
for x, review in enumerate(most_negative_reviews):
    print(f'{x+1}. {review}')

#hint: use the results of b)

# By using results of exercise 4b) here we are creating list of 5 most positive and negative reviews.

The most positive reviews:
1. updated 32213 After extensive research trial and error and even a class on cloth diapering at the local birth center I settled on the Grovia system and have not regretted that decision My son is now 22 months and we have been using Grovia since about 8 weeks We have lots of cloth diapering friends so I have had the chance to see lots of cloth diapers in action and I am very glad that I invested in grovia to begin with Here are the products we use why we like them and in some cases how they could be improved There is a lot of material covered in this review I am doing it this way because I wish there had been something comprehensive like this on Amazon when I was shopping for my diaper stuff If you are considering buying a Grovia system this will be helpful for you If you are looking for a comprehensive review on just one item other reviews may be more suitableOur total system consists of the followingEssentials17 Hybrid shells in a variety of prints and so

In [15]:
#d) 
from sklearn.metrics import accuracy_score

predictions_accuracy_score = accuracy_score(ratings_test, predictions)
print('Accuracy of predictions equals:', predictions_accuracy_score)

# Accuracy_score function returns value of our prediction accuracy, which is about 93% which is good.

Accuracy of predictions equals: 0.9294761082325849


## Exercise 5
In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.


a) Redo exercises 2-5 using limited dictionary.   
b) Check the impact of all the words from the dictionary.   
c) Compare accuracy of predictions and the time of evaluation.

In [16]:
significant_words = ['love','great','easy','old','little','perfect','loves','well','able','car','broke','less','even','waste','disappointed','work','product','money','would','return']

In [17]:
#a)
ratings_train, ratings_test, reviews_train, reviews_test = train_test_split(baby_df.rating, baby_df.review)

vectorizer = CountVectorizer(vocabulary=significant_words)
vector_train_rev = vectorizer.fit_transform(list(reviews_train))
vector_test_rev = vectorizer.transform(list(reviews_test))

model = LogisticRegression()
model.fit(vector_train_rev, ratings_train)

words = list(zip(model.coef_[0], vectorizer.get_feature_names()))
sorted_words = sorted(words, key=lambda x: x[0])

most_positive_words = [word[1] for word in sorted_words[:10]]
most_negative_words = [word[1] for word in sorted_words[-10:]]

print('Most positive words:', most_positive_words)
print('Most negative words:', most_negative_words)

begin = time.time()
predictions_small = model.predict(vector_test_rev)
predictions_small_time_to_do = time.time() - begin
print(predictions_small)

probability_predictions = model.predict_proba(vector_test_rev)
print(probability_predictions)

reviews = list(zip(probability_predictions, reviews_test))
sorted_reviews = sorted(reviews, key=lambda x: x[0][1])

most_positive_reviews = [review[1] for review in sorted_reviews[:5]]
most_negative_reviews = [review[1] for review in sorted_reviews[-5:]]

print('The most positive reviews:')
for x, review in enumerate(most_positive_reviews):
    print(f'{x+1}. {review}')

print('\nThe most negative reviews:')
for x, review in enumerate(most_negative_reviews):
    print(f'{x+1}. {review}')

predictions_small_accuracy_score = accuracy_score(ratings_test, predictions_small)
print('\nAccuracy of predictions equals:', predictions_small_accuracy_score)

# Here we are doing the same things as we made in previous exercise but just changing some things to make it all suitable
# for smaller dictionary. As we can see we are getting much worse accuracy of predictions than we got previously for bigger
# dictionary.

Most positive words: ['loves', 'perfect', 'love', 'easy', 'great', 'well', 'little', 'able', 'old', 'car']
Most negative words: ['less', 'product', 'would', 'even', 'work', 'money', 'broke', 'waste', 'return', 'disappointed']
[-1 -1 -1 ... -1  1 -1]
[[0.97329915 0.02670085]
 [0.97301164 0.02698836]
 [0.88690006 0.11309994]
 ...
 [0.79718743 0.20281257]
 [0.43004588 0.56995412]
 [0.75952016 0.24047984]]
The most positive reviews:
1. As parents of two little ones Id like to say we are experts in appreciating different baby bottle designs and what works best for baby  Heres our criteria for a baby bottle and why the purpleredesigned Lansinoh mOmma natural wave bottle is 5 stars for usOur Baby Bottle CriteriaFirst and foremostBABY MUST LIKE ITMunchkin Latch BottleFails here because the nipple easily collapses  Just a very slight pressure with your finger and the nipple collapses  Now imagine a moving baby with a slurping motion and this nipple collapses too easilyAvent Classic BottleWe hav

In [18]:
#b)
for word, coeff in zip(vectorizer.get_feature_names(), model.coef_[0]):
    print('Word {0} has impact {1:.5f}'.format(word, abs(coeff)))

# Here we can see what impact every word from our smaller dictionary had.

Word love has impact 1.39464
Word great has impact 0.93729
Word easy has impact 1.18922
Word old has impact 0.08672
Word little has impact 0.49968
Word perfect has impact 1.47738
Word loves has impact 1.68581
Word well has impact 0.51642
Word able has impact 0.18310
Word car has impact 0.07277
Word broke has impact 1.63485
Word less has impact 0.17046
Word even has impact 0.53084
Word waste has impact 1.92660
Word disappointed has impact 2.36668
Word work has impact 0.63175
Word product has impact 0.32060
Word money has impact 0.94596
Word would has impact 0.34181
Word return has impact 2.09393


In [19]:
#c)
from prettytable import PrettyTable
accuracy_all = predictions_accuracy_score * 100
accuracy_limited = predictions_small_accuracy_score * 100
accuracy_diff = accuracy_all - accuracy_limited
accuracy_diff_x = accuracy_all / accuracy_limited
time_all = predictions_time_to_do * 1000
time_limited = predictions_small_time_to_do * 1000
time_diff = (time_all - time_limited)
time_diff_x = time_all / time_limited

t = PrettyTable(['', 'Dictionary with all words', 'Dictionary with limited words', 'All words difference with limited dictionary by'])
t.add_row(['Accuracy', f'{accuracy_all:.2f}%', f'{accuracy_limited:.2f}%', f'{accuracy_diff:.2f}% / {accuracy_diff_x:.2f}x'])
t.add_row(['Time of evaluation', f'{time_all:.2f}ms', f'{time_limited:.2f}ms', f'{time_diff:.2f}ms / {time_diff_x:.2f}x'])
print(t)

#hint: %time, %timeit

# As we can see in simple table shown below using dictionary with all words has about 6% better accuracy of prediction
# but on the other hand it is also about 5x times slower to predict it than on dictionary with only 20 words.


+--------------------+---------------------------+-------------------------------+-------------------------------------------------+
|                    | Dictionary with all words | Dictionary with limited words | All words difference with limited dictionary by |
+--------------------+---------------------------+-------------------------------+-------------------------------------------------+
|      Accuracy      |           92.95%          |             86.82%            |                  6.13% / 1.07x                  |
| Time of evaluation |          12.04ms          |             2.38ms            |                  9.66ms / 5.05x                 |
+--------------------+---------------------------+-------------------------------+-------------------------------------------------+
