### Today we are going to perform the simple classification of the amazon reviews' sentiment.

### Please, download the dataset amazon_baby.csv.

In [614]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.linear_model import LogisticRegression

def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

baby_df = pd.read_csv('amazon_baby.csv')
baby_df.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


## Exercise 1 (data preparation)
a) Remove punctuation from reviews using the given function.   
b) Replace all missing (nan) revies with empty "" string.  
c) Drop all the entries with rating = 3, as they have neutral sentiment.   
d) Set all positive ($\geq$4) ratings to 1 and negative($\leq$2) to -1.

In [615]:
#b)Replace all missing (nan) reviews with empty "" string
"""firstly we repalce all missing reviews with empty strings,
so we do not have any "garbage" data when removing punctuation"""

baby_df["review"] = baby_df["review"].fillna("")

# a)Remove punctuation from reviews using the given function
baby_df["review"].apply(remove_punctuation)

#short test: 
print(baby_df["review"][4] == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock')
print(remove_punctuation(baby_df["review"][4]) == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock')

False
True


In [616]:
#b)
"""
implemented above
"""

#short test:
baby_df["review"][38] == baby_df["review"][38]

True

In [617]:
#c) Drop all the entries with rating = 3, as they have neutral sentiment
baby_df = baby_df.drop(baby_df[baby_df["rating"] == 3].index)

#short test:
sum(baby_df["rating"] == 3)

0

In [618]:
#d) Set all positive >= 4 ratings to 1 and negative <= 2 to -1
"""
Here we map date to either positive or negative
This is why we had to drop "3". In this case it was neutral
"""

baby_df["rating"] = np.where(baby_df["rating"] < 3, -1, 1)
#short test:
sum(baby_df["rating"]**2 != 1) # no element is different from 1

0

## CountVectorizer
In order to analyze strings, we need to assign them numerical values. We will use one of the simplest string representation, which transforms strings into the $n$ dimensional vectors. The number of dimensions will be the size of our dictionary, and then the values of the vector will represent the number of appereances of the given word in the sentence.

In [619]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
reviews_train_example = ["We like apples",
                   "We hate oranges",
                   "I adore bananas",
                   "We like like apples and oranges",
                   "They dislike bananas"]

X_train_example = vectorizer.fit_transform(reviews_train_example)

print(vectorizer.get_feature_names_out())
print(X_train_example.todense())



['adore' 'and' 'apples' 'bananas' 'dislike' 'hate' 'like' 'oranges' 'they'
 'we']
[[0 0 1 0 0 0 1 0 0 1]
 [0 0 0 0 0 1 0 1 0 1]
 [1 0 0 1 0 0 0 0 0 0]
 [0 1 1 0 0 0 2 1 0 1]
 [0 0 0 1 1 0 0 0 1 0]]


In [620]:
reviews_test_example = ["They like bananas",
                   "We hate oranges bananas and apples",
                   "We love bananas"] #New word!

X_test_example = vectorizer.transform(reviews_test_example)

print(X_test_example.todense())

[[0 0 0 1 0 0 1 0 1 0]
 [0 1 1 1 0 1 0 1 0 1]
 [0 0 0 1 0 0 0 0 0 1]]


We should acknowledge few facts. Firstly, CountVectorizer does not take order into account. Secondly, it ignores one-letter words (this can be changed during initialization). Finally, for test values, CountVectorizer ignores words which are not in it's dictionary.

## Exercise 2 
a) Split dataset into training and test sets.     
b) Transform reviews into vectors using CountVectorizer. 

In [621]:
#a)Split dataset into training and test sets.
train_df = baby_df.sample(frac=.8, random_state=42)
test_df = baby_df.drop(train_df.index)

In [622]:
#b)Transform reviews into vectors using CountVectorizer. 
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X_train_example = vectorizer.fit_transform(train_df.review)
# only transform on test

print(vectorizer.get_feature_names_out()[3000:4000])
# print(X_train_example.todense())
print(vectorizer.get_feature_names_out().shape)
print(X_train_example.shape) #too big to do todense()
"""
size is too big, so we cant do .todense
Instead I print small portion of dictionary
"""

['acorde' 'acordian' 'acorn' 'acorns' 'acorss' 'acosco' 'acoss'
 'acostumbrarla' 'acound' 'acoustic' 'acoustics' 'acquaintance'
 'acquaintances' 'acquainted' 'acquainting' 'acquard' 'acquire' 'acquired'
 'acquiring' 'acquisition' 'acquisitions' 'acre' 'acreage' 'acres' 'acrid'
 'acrobat' 'acrobatic' 'acrobatics' 'acrobats' 'acronym' 'acronyms'
 'across' 'acrossed' 'acrossthe' 'acrylic' 'acrylonitrile' 'act' 'actally'
 'actaully' 'actaulyl' 'acted' 'actident' 'actied' 'acting' 'action'
 'actions' 'activ3' 'activate' 'activated' 'activatei' 'activates'
 'activating' 'activation' 'activator' 'active' 'actived' 'actively'
 'actives' 'activiation' 'activies' 'activited' 'activites' 'activities'
 'activitites' 'activity' 'activitygym' 'actm' 'actoually' 'acts'
 'actting' 'actual' 'actuality' 'actualize' 'actuallt' 'actually'
 'actuallydon' 'actuallyy' 'actualy' 'actuate' 'actuated' 'actully'
 'actuly' 'actvities' 'acual' 'acually' 'acupuncture' 'acura' 'acurate'
 'acurately' 'acure' 'acustom

'\nsize is too big, so we cant do .todense\nInstead I print small portion of dictionary\n'

## Exercise 3 
a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were).   
b) Print 10 most positive and 10 most negative words.

In [623]:
import time

#a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were)
"""
I had to change max_iter because model was failing do converge
"""
no_dict_train_0 = time.time()
model = LogisticRegression(max_iter=1200)
model.fit(X_train_example, train_df["rating"])
no_dict_train_1 = time.time()

In [624]:
#b) Print 10 most positive and 10 most negative words
sorted_coef, sorted_features = (list(x) for x in zip(*sorted(zip(model.coef_.tolist()[0], vectorizer.get_feature_names_out()))))

top_positive = sorted_features[-10:]
top_negative = sorted_features[:10]

print(top_positive)
print(top_negative)
#hint: model.coef_, vectorizer.get_feature_names()

['excellent', 'skeptical', 'pleased', 'saves', 'outstanding', 'highly', 'pleasantly', 'worry', 'amazed', 'rich']
['dissapointed', 'worst', 'worthless', 'poorly', 'intelligent', 'unusable', 'disappointing', 'concept', 'falsely', 'poor']


## Exercise 4 
a) Predict the sentiment of test data reviews.   
b) Predict the sentiment of test data reviews in terms of probability.   
c) Find five most positive and most negative reviews.   
d) Calculate the accuracy of predictions.

In [625]:
#a)
# Here I predict sentiment od test data reviews.
# As we can observe on first 20 reviews, they can be either positive or negative.
X_test_example = vectorizer.transform(test_df.review) # create test data vectorizer
no_dict_predict_0 = time.time()
sentiments = model.predict(X_test_example)
no_dict_predict_1 = time.time()
print(sentiments[:20])



[ 1  1  1  1  1  1  1 -1  1  1 -1  1  1  1  1  1  1  1 -1  1]


In [626]:
#b)
# here I have written with what probability a sentiment is given.
# it prints the probability for positive or negative result
probability_reviews_sentiment = model.predict_proba(X_test_example)

print(f"Sentence: {reviews_test_example[0]}\n"
      f"Probability for negative sentence: "
      f"{probability_reviews_sentiment[0][0]}\n"
      f"Probability for positive sentence: "
      f"{probability_reviews_sentiment[0][1]}\n")

# As we can see, prediction is very strong, that te given sentence has a positive meaning

print(f"Sentence: {reviews_test_example[2]}\n"
      f"Probability for negative sentence: "
      f"{probability_reviews_sentiment[2][0]}\n"
      f"Probability for positive sentence: "
      f"{probability_reviews_sentiment[2][1]}")

# In second example, with different sentence we can also observe high probability for positive meaning and in fact, this sentence has positive meaning
# Predictions are working!

#hint: model.predict_proba()

Sentence: They like bananas
Probability for negative sentence: 0.00797114819307787
Probability for positive sentence: 0.9920288518069221

Sentence: We love bananas
Probability for negative sentence: 0.05755111956822989
Probability for positive sentence: 0.9424488804317701


In [627]:
#c) Find five most positive and most negative reviews.
sorted_sentiment, sorted_reviews = (list(x) for x in zip(*sorted(zip(probability_reviews_sentiment[:,0], baby_df["review"]), reverse=True)))
# sorting in descending order - best review is first

top_positive_review = sorted_reviews[0:5]
top_negative_review = sorted_reviews[-6:-1]

print(top_positive_review, end="\n\n\n\n")
print(top_negative_review)

# I am having problems with extracting real positive and negative reviews.
# The problem could be that people who write negative reviews tend to write them short,
# and the model definitely favours long reviews.


["These bowls are nice and sturdy, just like I expected them to be.  They are just the right size for my daughter's lunch.  The only issue is that the lids are somewhat hard to get on.  Just takes a little practice, though.  I definitely think we'll get more than our money's worth out of these.", 'These bottles are excellent bottles for to prevent colic in babies.  The price is to expensive.', 'Its cool, refuses to stay on my faucet though as my facuet starts small and gets bigger... Still Trying to find a way to rig it up to stay on', 'Of course, no accidents since we purchased this.  But the sale was easy, the pad was easy to put on.', 'I think by far this is the best rattle we own. The handle is small enough that a 2 month old starting to learn to grasp can practice on, they can chew on it ( which they will all eventually do), the rattle sound is loud enough to get their attention without annoying the grown ups in the room. It is colorful enough that they can learn to follow it when

In [628]:
#d) Calculate the accuracy of predictions.
from sklearn.metrics import precision_score

y_pred = model.predict(X_test_example)
y_true = test_df["rating"]

first_score = precision_score(y_true,  y_pred)
print(first_score)

# accuracy of predictions is almost 95 %, it is very good result

0.9491554997208264


## Exercise 5
In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.


a) Redo exercises 2-5 using limited dictionary.   
b) Check the impact of all the words from the dictionary.   
c) Compare accuracy of predictions and the time of evaluation.

In [629]:
significant_words = ['love','great','easy','old','little','perfect','loves','well','able','car','broke','less','even','waste','disappointed','work','product','money','would','return']

In [630]:
#a) I don't split my data since I already did that.

train_df = baby_df.sample(frac=.8, random_state=42)
test_df = baby_df.drop(train_df.index)

vectorizer_subset = CountVectorizer()

vectorizer_subset.fit_transform(significant_words)

X_train_subset = vectorizer_subset.transform(train_df.review)
X_test_subset = vectorizer_subset.transform(test_df.review)

with_dict_train_0 = time.time()
second_model = LogisticRegression(max_iter=1200)
second_model.fit(X_train_subset,  train_df["rating"])
with_dict_train_1 = time.time()

sorted_coef_subset, sorted_features_subset = (list(t) for t in zip(*sorted(zip(second_model.coef_.tolist()[0], vectorizer_subset.get_feature_names_out()))))

top_positive = sorted_features_subset[-10:]
top_negative = sorted_features_subset[:10]

print('Positive: ', top_positive)
print('Negative: ', top_negative)

# find X_train_subset; X_test_subset, using vectorizer_subset


Positive:  ['car', 'old', 'able', 'little', 'well', 'great', 'easy', 'love', 'perfect', 'loves']
Negative:  ['disappointed', 'return', 'waste', 'broke', 'money', 'work', 'even', 'would', 'product', 'less']


In [631]:
with_dict_predict_0 = time.time()
sentiments = second_model.predict(X_test_subset)
with_dict_predict_1 = time.time()
print(sentiments[:20])

# Here we receive the same result as before


[ 1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1]


In [632]:
probability_reviews_sentiment = second_model.predict_proba(X_test_subset)

print(f"Sentence: {reviews_test_example[0]}\n"
      f"Probability for negative sentence: "
      f"{probability_reviews_sentiment[0][0]}\n"
      f"Probability for positive sentence: "
      f"{probability_reviews_sentiment[0][1]}\n")

# As we can see, prediction is very strong, that te given sentence has a positive meaning

print(f"Sentence: {reviews_test_example[2]}\n"
      f"Probability for negative sentence: "
      f"{probability_reviews_sentiment[2][0]}\n"
      f"Probability for positive sentence: "
      f"{probability_reviews_sentiment[2][1]}")

# In this case, model is not as sure as in the previous example


Sentence: They like bananas
Probability for negative sentence: 0.18146233040711224
Probability for positive sentence: 0.8185376695928878

Sentence: We love bananas
Probability for negative sentence: 0.21641058825873238
Probability for positive sentence: 0.7835894117412676


In [633]:
proba_sentiments = second_model.predict_proba(X_test_subset)

sorted_sentiment, sorted_reviews = (list(x) for x in zip(*sorted(zip(proba_sentiments[:,0], baby_df["review"]), reverse=True)))
# sorting in descending order - best review is first

top_positive_review = sorted_reviews[0:5]
top_negative_review = sorted_reviews[-6:-1]

print(top_positive_review, end="\n\n\n\n")
print(top_negative_review)

#same as in exercise 4, I have no idea why I receive wrong output

['This is great for calming the fears of a little one when bedtime is near.  Lovely display covers the ceiling.', "It's great! I love it! My niece loves it! Her parents love it! All my family love it! It's great for kids!", 'I love this product. When I prepare food for my baby, I feel so happy.Yes. It is right. You have to add some water in there to get better results. However, I just broke the gasket. It is still working but it is leaking a little bit:( I do not know whether I can find the gasket!I guess it was my fault so I still recommend this product.', "We love this bath tub!  We had the First Year convertible bath tub first...horrible.  Our son slipped all over the place and you could not get all areas cleaned because of the set up of the tub.  We love the Euro Bath!  It is large enough for baby to have splash and play time without worrying about them slipping.  I love the fact that our son can continue to grow with this tub!  Don't worry about buying anything else!!!!", 'This ha

In [634]:
#b)

for coef, feature in zip(second_model.coef_[0], vectorizer_subset.get_feature_names_out()):
    print(feature, coef)
    
    
# I have no idea why, but I receive wrong results in some cases

able 0.21976363349928654
broke -1.6703338320502181
car 0.05753770019024416
disappointed -2.3983054566077855
easy 1.1451412454555527
even -0.5168710532218316
great 0.9483502452216717
less -0.15665853311237765
little 0.49983953018095795
love 1.3566339961091352
loves 1.714153155127098
money -0.878863514014164
old 0.09387289540906824
perfect 1.5331518579535348
product -0.3212417688019006
return -2.0672539101190512
waste -2.0128900120635658
well 0.5282632108466655
work -0.6163190781420029
would -0.33534385634315766


In [635]:
#c)

print(f'Train, no dict: {no_dict_train_1 - no_dict_train_0}\n'
      f'Predict, no dict: {no_dict_predict_1 - no_dict_predict_0}\n'
      f'Train, dict: {with_dict_train_1 - with_dict_train_0}\n'
      f'Predict, dict: {with_dict_predict_1 - with_dict_predict_0}')

# as we can bserve in output, training with dictionary is way quicker than without



Train, no dict: 28.676074981689453
Predict, no dict: 0.005292177200317383
Train, dict: 0.19220709800720215
Predict, dict: 0.0012595653533935547


In [636]:
# comparing checking accuracy of prediction
y_pred = second_model.predict(X_test_subset)
y_true = test_df["rating"]

second_score = precision_score(y_true,  y_pred)

print(second_score)

0.8736418638537806


In [637]:
if first_score > second_score:
    print("First model was more accurate")
else:
    print("Second model was more accurate")

First model was more accurate
