### Today we are going to perform the simple classification of the amazon reviews' sentiment.

### Please, download the dataset amazon_baby.csv.

In [332]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

def remove_punctuation(text):
    import string
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

In [333]:
baby_df = pd.read_csv('amazon_baby.csv')
baby_df.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


In [334]:
baby_df

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5
...,...,...,...
183526,Baby Teething Necklace for Mom Pretty Donut Sh...,Such a great idea! very handy to have and look...,5
183527,Baby Teething Necklace for Mom Pretty Donut Sh...,This product rocks! It is a great blend of fu...,5
183528,Abstract 2 PK Baby / Toddler Training Cup (Pink),This item looks great and cool for my kids.......,5
183529,"Baby Food Freezer Tray - Bacteria Resistant, B...",I am extremely happy with this product. I have...,5


## Exercise 1 (data preparation)
a) Remove punctuation from reviews using the given function.   
b) Replace all missing (nan) revies with empty "" string.  
c) Drop all the entries with rating = 3, as they have neutral sentiment.   
d) Set all positive ($\geq$4) ratings to 1 and negative($\leq$2) to -1.

In [335]:
#a)

baby_df['review'] = baby_df['review'].apply(lambda text: remove_punctuation(str(text)))
baby_df["review"][6]

'Lovely book its bound tightly so you may not be able to add alot of photoscards aside from the designated spaces in the book Shop around before you purchase as it is currently listed at Barnes  Noble for 2995'

In [336]:
#b)
baby_df["review"] = baby_df['review'].replace(['nan'], [""])
#short test:
print(baby_df["review"][38] == baby_df["review"][38])
baby_df["review"][38]

True


''

In [337]:
#c)
baby_df = baby_df.drop(baby_df[baby_df["rating"] == 3].index)
#short test:
sum(baby_df["rating"] == 3)

0

In [338]:
#d) 
ones = baby_df.loc[baby_df["rating"] >= 4]
minus_one = baby_df.loc[baby_df["rating"] <= 2]

baby_df.loc[ones.index, 'rating'] = 1
baby_df.loc[minus_one.index, 'rating'] = -1
#short test:
sum(baby_df["rating"]**2 != 1)

0

## CountVectorizer
In order to analyze strings, we need to assign them numerical values. We will use one of the simplest string representation, which transforms strings into the $n$ dimensional vectors. The number of dimensions will be the size of our dictionary, and then the values of the vector will represent the number of appereances of the given word in the sentence.

In [339]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
reviews_train_example = ["We like apples",
                   "We hate oranges",
                   "I adore bananas",
                   "We like like apples and oranges",
                   "They dislike bananas"]

X_train_example = vectorizer.fit_transform(reviews_train_example)

print(vectorizer.get_feature_names_out())
print(X_train_example.todense())



['adore' 'and' 'apples' 'bananas' 'dislike' 'hate' 'like' 'oranges' 'they'
 'we']
[[0 0 1 0 0 0 1 0 0 1]
 [0 0 0 0 0 1 0 1 0 1]
 [1 0 0 1 0 0 0 0 0 0]
 [0 1 1 0 0 0 2 1 0 1]
 [0 0 0 1 1 0 0 0 1 0]]


In [340]:
reviews_test_example = ["They like bananas",
                   "We hate oranges bananas and apples",
                   "We love bananas"] #New word!

X_test_example = vectorizer.transform(reviews_test_example)

print(X_test_example.todense())

[[0 0 0 1 0 0 1 0 1 0]
 [0 1 1 1 0 1 0 1 0 1]
 [0 0 0 1 0 0 0 0 0 1]]


We should acknowledge few facts. Firstly, CountVectorizer does not take order into account. Secondly, it ignores one-letter words (this can be changed during initialization). Finally, for test values, CountVectorizer ignores words which are not in it's dictionary.

## Exercise 2 
a) Split dataset into training and test sets.     
b) Transform reviews into vectors using CountVectorizer. 

In [341]:
#a)
X_train, X_test, y_train, y_test = train_test_split(baby_df['review'], baby_df['rating'], test_size=0.3, random_state=34)



# X_train = np.array(X_train)
print(X_train[:5])

18352     I dont know why people complain about these Ba...
48654     This item took a long time to come and then it...
4646      your service is great and your products that a...
82322     My little man loves his giraffe it often sooth...
102169    I have had 2 of these for a few years now and ...
Name: review, dtype: object


In [342]:
#b)
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train.astype('U'))
X_test = vectorizer.transform(X_test.astype('U'))

X_train
    

<116726x111387 sparse matrix of type '<class 'numpy.int64'>'
	with 6216170 stored elements in Compressed Sparse Row format>

In [343]:
print(X_train[0])

  (0, 32510)	1
  (0, 54728)	1
  (0, 108753)	1
  (0, 72071)	1
  (0, 24793)	1
  (0, 6649)	2
  (0, 97817)	1
  (0, 13679)	1
  (0, 67975)	3
  (0, 97184)	13
  (0, 9141)	1
  (0, 81938)	1
  (0, 40738)	1
  (0, 98310)	1
  (0, 76507)	2
  (0, 11348)	2
  (0, 39429)	1
  (0, 77703)	1
  (0, 85851)	1
  (0, 67286)	5
  (0, 14442)	8
  (0, 78920)	4
  (0, 13889)	1
  (0, 9432)	4
  (0, 15320)	1
  :	:
  (0, 91688)	1
  (0, 108475)	1
  (0, 52196)	1
  (0, 85133)	1
  (0, 99668)	2
  (0, 87135)	1
  (0, 69493)	2
  (0, 104910)	1
  (0, 50345)	1
  (0, 83494)	1
  (0, 93155)	1
  (0, 77419)	1
  (0, 103121)	1
  (0, 59712)	1
  (0, 107527)	1
  (0, 64492)	1
  (0, 104524)	1
  (0, 8546)	1
  (0, 56004)	1
  (0, 37461)	1
  (0, 38793)	1
  (0, 75792)	1
  (0, 94002)	1
  (0, 63326)	1
  (0, 93984)	1


## Exercise 3 
a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were).   
b) Print 10 most positive and 10 most negative words.

In [344]:
#a)
model = LogisticRegression(max_iter=300, verbose=1, n_jobs=4)

model.fit(X_train, y_train)


  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   1 out of   1 | elapsed:    7.8s finished


In [345]:
y_train

18352     1
48654    -1
4646      1
82322     1
102169    1
         ..
42271     1
90543     1
168868    1
47832    -1
113975    1
Name: rating, Length: 116726, dtype: int64

In [346]:
#b)
print(model.coef_)
top_positive_coef_index = np.argpartition(model.coef_[0], -10)[-10:]
top_negative_coef_index = np.argpartition(model.coef_[0], 10)[:10]

top_positive_coef = vectorizer.get_feature_names_out()[top_positive_coef_index]
top_negative_coef = vectorizer.get_feature_names_out()[top_negative_coef_index]


print(f'\n10 most positive words:\n{top_positive_coef}')
print(f'\n10 most negative words:\n{top_negative_coef}')

[[0.00087309 0.02614315 0.02176042 ... 0.00032197 0.00699404 0.00692097]]

10 most positive words:
['excellent' 'con' 'girly' 'thankful' 'minor' 'amazing' 'tied' 'skeptical'
 'ply' 'pleasantly']

10 most negative words:
['unacceptable' 'poor' 'terrible' 'unusable' 'useless' 'dissapointed'
 'returning' 'worst' 'disappointing' 'concept']


## Exercise 4 
a) Predict the sentiment of test data reviews.   
b) Predict the sentiment of test data reviews in terms of probability.   
c) Find five most positive and most negative reviews.   
d) Calculate the accuracy of predictions.

In [347]:
#a)
model.predict(X_test)

array([1, 1, 1, ..., 1, 1, 1], dtype=int64)

In [348]:
#b)
result = model.predict_proba(X_test)
result

array([[3.13248661e-03, 9.96867513e-01],
       [1.62366131e-02, 9.83763387e-01],
       [4.09691123e-02, 9.59030888e-01],
       ...,
       [4.74905865e-03, 9.95250941e-01],
       [3.03182372e-06, 9.99996968e-01],
       [2.54937281e-04, 9.99745063e-01]])

In [349]:
#c)
ones_prob = result[:,1]
minus_ones_prob = result[:,0]

y_test_index = y_test.reset_index()

most_positive_reviews_indices = y_test_index['index'][np.argpartition(ones_prob, -5)[-5:]]
most_negative_reviews_indices = y_test_index['index'][np.argpartition(minus_ones_prob, 5)[:5]]

most_positive_reviews = np.array(baby_df["review"][most_positive_reviews_indices])
most_negative_reviews = np.array(baby_df["review"][most_negative_reviews_indices])

print(f'****The most positive****')
for review in most_positive_reviews:
    print(f'\n{review}')

print(f'\n\n****The most negative****')
for review in most_negative_reviews:
    print(f'\n{review}')

#hint: use the results of b)

****The most positive****

This chair is wonderful  I own a 2006 Nissan Pathfinder and a BMW 330i  I bought this seat mainly because I watched a video by a man who bought two for two kids  One being a 45 year old girl still keeping her rear facing because its the safest way to go  She said she was still comfortable and loved the chair  I figured I can keep my son rear facing as long as possible and with the air pads for his head its the safest chair  Other chairs had thin padding and underneath hard plasticUpon opening the box I was quite satisfied already  There was nothing to put together no plastic to take off of it It was ready to go  The instructions are on the side for easy install into your vehicle and the booklet is hidden on the back of the chair  You can use any of the three types of seat seat belts for this seat  For the Pathfinder I have the option of the chest cross belt or lapbelt  It also has a way to install the seat with hook beltsIn my Pathfinder it turned out that th

In [350]:
#d)

y_pred = model.predict(X_test)
print(f'Accuracy: {sum([1 for x, y in zip(y_pred, y_test) if x == y]) / X_test.shape[0]}')



Accuracy: 0.9292168072602247


## Exercise 5
In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.


a) Redo exercises 2-5 using limited dictionary.   

b) Check the impact of all the words from the dictionary.   
c) Compare accuracy of predictions and the time of evaluation.

In [351]:
significant_words = ['love','great','easy','old','little','perfect','loves','well','able','car','broke','less','even','waste','disappointed','work','product','money','would','return']

In [358]:
#a)
X_train, X_test, y_train, y_test = train_test_split(baby_df['review'], baby_df['rating'], test_size=0.3, random_state=117)

vectorizer = CountVectorizer(vocabulary=significant_words)
X_train = vectorizer.fit_transform(X_train.astype('U'))
X_test = vectorizer.transform(X_test.astype('U'))

X_train

<116726x20 sparse matrix of type '<class 'numpy.int64'>'
	with 256279 stored elements in Compressed Sparse Row format>

In [359]:
simple_model = LogisticRegression(max_iter=300, verbose=1, n_jobs=4)

simple_model.fit(X_train, y_train)

  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   1 out of   1 | elapsed:    0.0s finished


In [354]:
print(simple_model.coef_)
top_positive_coef_index = np.argpartition(simple_model.coef_[0], -10)[-10:]
top_negative_coef_index = np.argpartition(simple_model.coef_[0], 10)[:10]

top_positive_coef = vectorizer.get_feature_names_out()[top_positive_coef_index]
top_negative_coef = vectorizer.get_feature_names_out()[top_negative_coef_index]


print(f'\n10 most positive words:\n{top_positive_coef}')
print(f'\n10 most negative words:\n{top_negative_coef}')

[[ 1.35243318  0.92049801  1.17721726  0.1014735   0.49700554  1.45864627
   1.67847724  0.49962968  0.20382521  0.07720214 -1.69448849 -0.19128761
  -0.52732117 -2.05375238 -2.38047283 -0.64254248 -0.3041989  -0.88540029
  -0.33833047 -2.02408438]]

10 most positive words:
['car' 'great' 'able' 'well' 'loves' 'perfect' 'little' 'old' 'easy'
 'love']

10 most negative words:
['broke' 'return' 'would' 'money' 'product' 'work' 'disappointed' 'waste'
 'even' 'less']


In [360]:
print(simple_model.predict(X_test))
result = simple_model.predict_proba(X_test)
print(result)

[1 1 1 ... 1 1 1]
[[0.0029949  0.9970051 ]
 [0.05626221 0.94373779]
 [0.13265557 0.86734443]
 ...
 [0.04107997 0.95892003]
 [0.18813909 0.81186091]
 [0.02520938 0.97479062]]


In [361]:
# Reviews

ones_prob = result[:,1]
minus_ones_prob = result[:,0]

y_test_index = y_test.reset_index()

most_positive_reviews_indices = y_test_index['index'][np.argpartition(ones_prob, -5)[-5:]]
most_negative_reviews_indices = y_test_index['index'][np.argpartition(minus_ones_prob, 5)[:5]]

most_positive_reviews = np.array(baby_df["review"][most_positive_reviews_indices])
most_negative_reviews = np.array(baby_df["review"][most_negative_reviews_indices])

print(f'****The most positive****')
for review in most_positive_reviews:
    print(f'\n{review}')

print(f'\n\n****The most negative****')
for review in most_negative_reviews:
    print(f'\n{review}')

****The most positive****

My husband and I did COPIOUS research for car seats before finally purchasing this one for my car and the Diono Radian RXT for his car  We also looked at the Clek Foonf and Recaro ProRide  We knew we wanted extending rear facing seats so that is how we narrowed our search to just these seats  We also wanted something that was solidly built so that eliminated many of the cheaper seatsHeres the run down1 Peg Perego Premium Convertible Car SeatWhat we LIKED VERY well built solid seat RF to 45lbs FF to 70lbs or 4934 though your child will likely outgrow this seat in height before weight    the average 6 year old is 4934 Easy clean fabric fresco jersey SUPER SAFE shock absorbing foam that crumples upon impact though I hope we never test this SUPER QUICK AND EASY to install  I cant underscore this enough  For RF I just stand behind the seat leaning into it click the two latch straps into place then rock side to side while cinching the straps tight  With this method

In [362]:
y_pred = simple_model.predict(X_test)
print(f'Accuracy: {sum([1 for x, y in zip(y_pred, y_test) if x == y]) / X_test.shape[0]}')


Accuracy: 0.8676288330068365


In [363]:
#b)
# negative values have impact on negative review and positive on positive review
for name, weight in zip(significant_words, simple_model.coef_[0]):
    print(f'{name}\t\t\t : \t\t\t {weight}')

love			 : 			 1.3524331829934642
great			 : 			 0.9204980111412695
easy			 : 			 1.1772172633468225
old			 : 			 0.10147349500384725
little			 : 			 0.4970055407163873
perfect			 : 			 1.4586462701924245
loves			 : 			 1.6784772353646706
well			 : 			 0.4996296751282036
able			 : 			 0.20382520903415788
car			 : 			 0.07720213823143963
broke			 : 			 -1.6944884937577147
less			 : 			 -0.19128761186521287
even			 : 			 -0.5273211694704582
waste			 : 			 -2.0537523777790323
disappointed			 : 			 -2.3804728263154917
work			 : 			 -0.6425424775809814
product			 : 			 -0.304198901331846
money			 : 			 -0.8854002931774407
would			 : 			 -0.33833046729765315
return			 : 			 -2.0240843805152995


In [367]:
%%time
#c
# time is calculated for everything to see the difference

X_train, X_test, y_train, y_test = train_test_split(baby_df['review'], baby_df['rating'], test_size=0.3, random_state=117)

vectorizer = CountVectorizer(vocabulary=significant_words)
X_train = vectorizer.fit_transform(X_train.astype('U'))
X_test = vectorizer.transform(X_test.astype('U'))

model = LogisticRegression(max_iter=300, verbose=1, n_jobs=4)
model.fit(X_train, y_train)

print(f'Accuracy: {sum([1 for x, y in zip(model.predict(X_test), y_test) if x == y]) / X_test.shape[0]}')

  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   1 out of   1 | elapsed:    0.7s finished


Accuracy: 0.8676288330068365
CPU times: total: 7.02 s
Wall time: 8.35 s


In [368]:
%%time

X_train, X_test, y_train, y_test = train_test_split(baby_df['review'], baby_df['rating'], test_size=0.3, random_state=117)

vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train.astype('U'))
X_test = vectorizer.transform(X_test.astype('U'))

model = LogisticRegression(max_iter=300, verbose=1, n_jobs=4)
model.fit(X_train, y_train)

print(f'Accuracy: {sum([1 for x, y in zip(model.predict(X_test), y_test) if x == y]) / X_test.shape[0]}')

  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   1 out of   1 | elapsed:    7.5s finished


Accuracy: 0.9297965058169751
CPU times: total: 5.95 s
Wall time: 14 s


## Accuracy is higher in the more complex model, but total time needed to train and evaluate is also higher.
## Execution time of measuring only evaluation of test data is the same