### Today we are going to perform the simple classification of the amazon reviews' sentiment.

### Please, download the dataset amazon_baby.csv.

In [474]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.linear_model import LogisticRegression

def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

baby_df = pd.read_csv('amazon_baby.csv')
baby_df.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


In [475]:
baby_df["rating"].value_counts()

5    107054
4     33205
3     16779
1     15183
2     11310
Name: rating, dtype: int64

## Exercise 1 (data preparation)
a) Remove punctuation from reviews using the given function.   
b) Replace all missing (nan) revies with empty "" string.  
c) Drop all the entries with rating = 3, as they have neutral sentiment.   
d) Set all positive ($\geq$4) ratings to 1 and negative($\leq$2) to -1.

In [476]:
#b)
baby_df.replace(np.nan, "", inplace=True)

In [477]:
#a)
baby_df["review"] = baby_df['review'].apply(remove_punctuation)

In [478]:
#c)
baby_df = baby_df[baby_df["rating"] != 3]
#short test:
sum(baby_df["rating"] == 3)

0

In [479]:
#d) 
baby_df["rating"] = np.where(baby_df["rating"] >= 4, 1, -1)
sum(baby_df["rating"]**2 != 1)

0

In [480]:
np.unique(baby_df["rating"])

array([-1,  1])

## CountVectorizer
In order to analyze strings, we need to assign them numerical values. We will use one of the simplest string representation, which transforms strings into the $n$ dimensional vectors. The number of dimensions will be the size of our dictionary, and then the values of the vector will represent the number of appereances of the given word in the sentence.

In [481]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
reviews_train_example = ["We like apples",
                   "We hate oranges",
                   "I adore bananas",
                   "We like like apples and oranges",
                   "They dislike bananas"]

X_train_example = vectorizer.fit_transform(reviews_train_example)

print(vectorizer.get_feature_names_out())
print(X_train_example.todense())

['adore' 'and' 'apples' 'bananas' 'dislike' 'hate' 'like' 'oranges' 'they'
 'we']
[[0 0 1 0 0 0 1 0 0 1]
 [0 0 0 0 0 1 0 1 0 1]
 [1 0 0 1 0 0 0 0 0 0]
 [0 1 1 0 0 0 2 1 0 1]
 [0 0 0 1 1 0 0 0 1 0]]


In [482]:
reviews_test_example = ["They like bananas",
                   "We hate oranges bananas and apples",
                   "We love bananas"] #New word!

X_test_example = vectorizer.transform(reviews_test_example)

print(X_test_example.todense())

[[0 0 0 1 0 0 1 0 1 0]
 [0 1 1 1 0 1 0 1 0 1]
 [0 0 0 1 0 0 0 0 0 1]]


We should acknowledge few facts. Firstly, CountVectorizer does not take order into account. Secondly, it ignores one-letter words (this can be changed during initialization). Finally, for test values, CountVectorizer ignores words which are not in it's dictionary.

## Exercise 2 
a) Split dataset into training and test sets.     
b) Transform reviews into vectors using CountVectorizer. 

In [483]:
from sklearn.model_selection import train_test_split
#a)
X = baby_df["review"]
y = baby_df["rating"]
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=44)

In [484]:
#b)
vectorizer = CountVectorizer()
reviews_train = vectorizer.fit_transform(X_train)

In [485]:
vectorizer.get_feature_names_out().shape

(121421,)

In [486]:
reviews_train.shape

(133401, 121421)

In [487]:
reviews_test = vectorizer.transform(X_test)
reviews_test.todense().shape

(33351, 121421)

## Exercise 3 
a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were).   
b) Print 10 most positive and 10 most negative words.

In [488]:
#a)
model = LogisticRegression(max_iter=300)
model.fit(reviews_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [489]:
len(model.coef_[0])

121421

In [490]:
len(vectorizer.get_feature_names_out())

121421

In [491]:
#b)
# mapping coefficients to words
occurences = sorted(zip(model.coef_[0], vectorizer.get_feature_names_out()), key=lambda x: x[0])

### 10 words associated with negative rating

In [492]:
for coef, feature in occurences[0:10]:
    print(f'{ feature }: { coef }')

dissapointed: -3.2886056722081047
nope: -2.986897200159852
worst: -2.8918858474064284
ineffective: -2.755979076789322
disappointing: -2.7450768396017007
theory: -2.6556338054405915
pointless: -2.626789554825489
worthless: -2.609001799847551
useless: -2.5393594650694427
poorly: -2.4434575625776223


### 10 words associated with positive rating

In [493]:
for coef, feature in occurences[-10:]:
    print(f'{ feature }: { coef }')

met: 2.0743560896033078
minor: 2.0899967631659084
downside: 2.0992495664725968
amazed: 2.1499148663682677
saves: 2.1992331497642406
penny: 2.205617297760363
cleans: 2.2581062318440903
pleasantly: 2.378258952543135
thankful: 2.3947134389075058
rich: 2.439677475465357


## Exercise 4 
a) Predict the sentiment of test data reviews.   
b) Predict the sentiment of test data reviews in terms of probability.   
c) Find five most positive and most negative reviews.   
d) Calculate the accuracy of predictions.

In [494]:
#a)
model.predict(reviews_test)

array([ 1, -1,  1, ...,  1,  1,  1])

In [495]:
#b)
predicted = model.predict_proba(reviews_test)

In [496]:
predicted

array([[8.60090359e-06, 9.99991399e-01],
       [9.99999993e-01, 6.66345705e-09],
       [1.08824977e-04, 9.99891175e-01],
       ...,
       [6.66646116e-05, 9.99933335e-01],
       [1.67791562e-01, 8.32208438e-01],
       [2.85173668e-02, 9.71482633e-01]])

## Most negative reviews

In [497]:
# finding 5 rows with the highest probability of having a negative review
negative_idx = sorted(zip(predicted.T[0], X_test.index), key=lambda x: x[0])[-5:]
# getting indexes of those rows
_, n_indices = zip(*negative_idx)

In [498]:
n_indices

(120209, 57234, 120707, 133297, 87026)

In [499]:
X_test[X_test.index.isin(n_indices)]

120707    The previous reviewers laud the piece of mind ...
120209    This is the first review I have ever written o...
57234     My husband and I are VERY disappointed and sho...
133297    The first monitor broke within 1 month of use ...
87026     First off I did manage to find this product fo...
Name: review, dtype: object

### Worst review


In [500]:
X_test[n_indices[-1]]

'First off I did manage to find this product for 20 less at the big box store than the price here Secondly this product is horribly manufactured  Absolute waste of money  The only thing Ill give it credit for is the assembly instruction sheet is as clear cut as Ive come across in assembleathome baby products  But thats where the quality ends  In the first step I noticed the foot pieces did not attach as described in the instructions and this is because the 3 pieces in question were supposed to be uniform to fit in the same manner in all 3 places  But poor manufacturing threw that uniformity out of the window  One piece fit fine the second piece not so much and the third piece not at all  Theyre supposed to snap into place but I noticed the 3 supposedly identical pieces slightly varied in size on the locking parts causing one piece to fit where it was supposed to but the other 2 not to  All pieces on this product snap into place using the old indented tab system for locking into slots  

## Most positive reviews

In [501]:
# finding 5 rows with the highest probability of having a positive review
positive_idx = sorted(zip(predicted.T[1], X_test.index), key=lambda x: x[0])[-5:]
_, p_indices = zip(*positive_idx)

In [502]:
p_indices

(97394, 41763, 116083, 79836, 80881)

In [503]:
X_test[X_test.index.isin(p_indices)]

97394     I first bought these when I again had to repla...
41763     After considering several lightweight stroller...
116083    Ive had this stroller for a little more than s...
79836     This handsfree breastpump bra is hands down or...
80881     I spent weeks trying to find a cribdresser mat...
Name: review, dtype: object

### Best review


In [504]:
X_test[p_indices[-1]]

'I spent weeks trying to find a cribdresser matching set that I liked I wanted to purchase a dresser that could also be used as a changing table Changing tables seem like such a waste of money since you only want to use them while your child is little With this dresser I can put a changing pad on top and then my child can use it as a dresser until she leaves homeI purchased the dresser in the cherry espresso color and loved it It still has the dark color of espresso with a tint of red I think it is gorgeous I also purchased the Delta Canton 4 in 1 Crib in Cherry EspressoDelta Canton 4in1 Convertible Crib Espresso Cherry I was worried that since the drawer was called Delta Eclipse in Cherry Espresso that the colors may not match but they matched perfectly I would highly recommend the crib as wellI was a little worried about the size of the dresser because I couldnt find the dimensions but it is the perfect size for my babys room It is 48 12 inches wide 20 14 inches deep and 34 12 inches

## Accuracy

In [505]:
#d) 
model.score(reviews_test, y_test)

0.9322359149650685

## Exercise 5
In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.


a) Redo exercises 2-5 using limited dictionary.   
b) Check the impact of all the words from the dictionary.   
c) Compare accuracy of predictions and the time of evaluation.

In [506]:
significant_words = ['love','great','easy','old','little','perfect','loves','well','able','car','broke','less','even','waste','disappointed','work','product','money','would','return']

In [507]:
#a)
X = baby_df["review"]
y = baby_df["rating"]

# splitting data
X_train_l, X_test_l, y_train_l, y_test_l = train_test_split(X, y, train_size=0.8, random_state=44)

# vectorizer is only going to look for words contained in significant_words 
vectorizer_light = CountVectorizer(vocabulary=significant_words)
reviews_train_l = vectorizer_light.fit_transform(X_train_l)
vectorizer_light.get_feature_names_out()

array(['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves',
       'well', 'able', 'car', 'broke', 'less', 'even', 'waste',
       'disappointed', 'work', 'product', 'money', 'would', 'return'],
      dtype=object)

In [508]:
reviews_test_l = vectorizer_light.transform(X_test_l)

In [509]:
reviews_test_l.todense().shape # we are working with 20 words, so this array has 20 columns

(33351, 20)

In [510]:
model_light = LogisticRegression(max_iter=300)
model_light.fit(reviews_train_l, y_train_l)

In [511]:
occurences = sorted(zip(model_light.coef_[0], vectorizer_light.get_feature_names_out()), key=lambda x: x[0])

In [512]:
for coef, feature in occurences:
    print(f'{ feature }: { coef }')

disappointed: -2.3502365302452364
return: -2.169777760390085
waste: -2.0334975133573363
broke: -1.7196312817840118
money: -0.908461138703059
work: -0.6331122200545758
even: -0.5189343589682366
would: -0.3457761818183093
product: -0.3122389486741367
less: -0.14399625404378702
car: 0.05796887113326901
old: 0.08736832940245315
able: 0.18887582262258745
little: 0.48692385074921823
well: 0.4908486695207657
great: 0.940714935249715
easy: 1.160998897328867
love: 1.327913710254084
perfect: 1.4670967566473758
loves: 1.7233483260556846


<ul>
<li>First 5 words with negative coefficients will be associated with negative review</li>
<li>Last 5 words with positive coefficients will be associated with positive review</li>
<li>The coefficients tell us how much impact a word has on classifying a review as negative or positive</li>
</ul>

In [513]:
model_light.predict(reviews_test_l)

array([1, 1, 1, ..., 1, 1, 1])

In [514]:
predicted = model_light.predict_proba(reviews_test_l)
predicted

array([[0.00196845, 0.99803155],
       [0.40136299, 0.59863701],
       [0.01870812, 0.98129188],
       ...,
       [0.05164511, 0.94835489],
       [0.21347838, 0.78652162],
       [0.01476858, 0.98523142]])

### Best reviews

In [515]:
positive_idx = sorted(zip(predicted.T[1], X_test_l.index), key=lambda x: x[0])[-5:]
_, p_indices = zip(*positive_idx)

In [516]:
p_indices

(90795, 73754, 181875, 174637, 12035)

In [517]:
X_test_l[X_test_l.index.isin(p_indices)]

90795     This diaper bag is PERFECT There are a lot of ...
174637    Over the last five years I have tried many dif...
181875    There is probably nothing more scrutinized the...
73754     Before you buy test out any walker on your bab...
12035     My daughter 8 months old loves this toy and ha...
Name: review, dtype: object

In [518]:
X_test_l[p_indices[-1]]

'My daughter 8 months old loves this toy and has for a few months I love it too  Its not only a stacking toy but several toys in one each providing development benefits The removable elephant head serves as a funnel into which you drop balls  The head itself is fun since the ears crinkle and the hair is little knots that provide tactile sensation  My daughter loves to grab it and wave it around Each stackable ring is soft and easy for my daughter to grab and wave around which she loves to do  Two of the rings the elephant feet have Velcro joining the feet that my daughter can separate and rejoin The 4 different colored balls rattle and each has a different patter pressed on it  My daughter loves to shake the balls which fit perfectly into her hands  but not her mouth and she also loves to feel the patterns on the balls When you drop the balls into the funnelchute it plays a song which you can disable  The balls then roll out the door which can be closed  Great for teaching causeandeffe

### Worst reviews

In [519]:
negative_idx = sorted(zip(predicted.T[0], X_test_l.index), key=lambda x: x[0])[-5:]
_, n_indices = zip(*negative_idx)

In [520]:
n_indices

(111079, 140418, 35763, 119175, 22491)

In [521]:
X_test_l[X_test_l.index.isin(n_indices)]

119175    Babyhaven knew that the items were never taken...
140418    I am a researchaholic in general and have rese...
22491     My wife has been sucessful using the pump but ...
35763     Day 1 Assembled it Had it up and running playi...
111079    I searched for Baby Blanket Made in the USA an...
Name: review, dtype: object

In [522]:
X_test_l[n_indices[-1]]

'My wife has been sucessful using the pump but generally speaking the pump has numerous idiosyncrasies that you should consider before purchasing it  Had we done more homework we likely would have chosen a different pump  If only you could return a personal item to Babies r Us once youve opened it we wouldMy main complaint is that the pump is not very practical  While my wife expresses a fair amount of milk using the product she obtained the same amount of milk using the Medela pump Symphony  There is a long unrelated story why we rented a Medela pump for 2 weeks but it provided a useful period to compare the products sidebysideIn order of importanceSetup  A lot of parts to plug in connect and ultimately clean not to mention misplace or loose  Medela has this pump beat handsdown on this front  Managing this process at home is a chore if youre pumping at work I would imagine this becomes even more difficult  I suppose this is why Avent dont provide the option of a car power adapter  Ave

## Comparison


In [523]:
model_light.score(reviews_test_l, y_test_l)

0.869449191928278

In [524]:
model.score(reviews_test, y_test)

0.9322359149650685

In [525]:
import sys, time

In [526]:
%%time
%%timeit
model_light.predict(reviews_test_l)

543 µs ± 7.47 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
CPU times: total: 4.36 s
Wall time: 4.42 s


In [527]:
%%time
%%timeit
model.predict(reviews_test)

7.47 ms ± 607 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
CPU times: total: 5.83 s
Wall time: 6.05 s


### After simplifying our model, the accuracy decreased by 6%, but the code runs 13 times faster per loop