### Today we are going to perform the simple classification of the amazon reviews' sentiment.

### Please, download the dataset amazon_baby.csv.

In [129]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.linear_model import LogisticRegression

def remove_punctuation(text):
    import string
    text = str(text) #dodana linia, bo w df mamy typ object
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

baby_df = pd.read_csv('amazon_baby.csv')
baby_df.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


## Exercise 1 (data preparation)
a) Remove punctuation from reviews using the given function.   
b) Replace all missing (nan) revies with empty "" string.  
c) Drop all the entries with rating = 3, as they have neutral sentiment.   
d) Set all positive ($\geq$4) ratings to 1 and negative($\leq$2) to -1.

In [130]:
#a)
baby_df['review'] = baby_df['review'].apply(remove_punctuation)

#short test: 
baby_df["review"][4] == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'
remove_punctuation(baby_df["review"][4]) == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'

True

In [131]:
#b)
baby_df['review'] = baby_df['review'].fillna('')

#short test:
baby_df["review"][38] == baby_df["review"][38]

True

In [132]:
#c)
baby_df = baby_df[baby_df.rating != 3]

#short test:
sum(baby_df["rating"] == 3)

0

In [133]:
#d) 
baby_df['rating'] = baby_df['rating'].replace([1, 2], -1)
baby_df['rating'] = baby_df['rating'].replace([4, 5], 1)

#short test:
sum(baby_df["rating"]**2 != 1)

0

## CountVectorizer
In order to analyze strings, we need to assign them numerical values. We will use one of the simplest string representation, which transforms strings into the $n$ dimensional vectors. The number of dimensions will be the size of our dictionary, and then the values of the vector will represent the number of appereances of the given word in the sentence.

In [134]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
reviews_train_example = ["We like apples",
                   "We hate oranges",
                   "I adore bananas",
                   "We like like apples and oranges",
                   "They dislike bananas"]

X_train_example = vectorizer.fit_transform(reviews_train_example)

print(vectorizer.get_feature_names_out())
print(X_train_example.todense())


['adore' 'and' 'apples' 'bananas' 'dislike' 'hate' 'like' 'oranges' 'they'
 'we']
[[0 0 1 0 0 0 1 0 0 1]
 [0 0 0 0 0 1 0 1 0 1]
 [1 0 0 1 0 0 0 0 0 0]
 [0 1 1 0 0 0 2 1 0 1]
 [0 0 0 1 1 0 0 0 1 0]]


In [135]:
reviews_test_example = ["They like bananas",
                   "We hate oranges bananas and apples",
                   "We love bananas"] #New word!

X_test_example = vectorizer.transform(reviews_test_example)

print(X_test_example.todense())

[[0 0 0 1 0 0 1 0 1 0]
 [0 1 1 1 0 1 0 1 0 1]
 [0 0 0 1 0 0 0 0 0 1]]


We should acknowledge few facts. Firstly, CountVectorizer does not take order into account. Secondly, it ignores one-letter words (this can be changed during initialization). Finally, for test values, CountVectorizer ignores words which are not in it's dictionary.

## Exercise 2 
a) Split dataset into training and test sets.     
b) Transform reviews into vectors using CountVectorizer. 

In [136]:
#a)
from sklearn.model_selection import train_test_split

X = baby_df['review']
y = baby_df['rating']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=43)

In [137]:
#b)
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

## Exercise 3 
a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were).   
b) Print 10 most positive and 10 most negative words.

In [138]:
#a)
from sklearn.preprocessing import StandardScaler

#bez skalowania danych nawet 200 iteracji regresji logistycznej nie wystarczyło, aby osiągnąć zbieżność
scaler = StandardScaler(with_mean = False)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model = LogisticRegression(max_iter = 200)
model.fit(X_train, y_train)

In [139]:
#b)
#uzyskanie słów i ich ratio
words = vectorizer.get_feature_names_out()
values = model.coef_[0]

#stworzenie dataframe z przypisanymi do słów ich ratio, po czym sortowanie
df_val = pd.DataFrame({'word': words, 'val': values})
df_val = df_val.sort_values(by='val')

print("Most positive words:")
result = df_val.tail(10)['word'].tolist()

for word in result:
    print(word)

print("\nMost negative words:")
result = df_val.head(10)['word'].tolist()

for word in result:
    print(word)

#można zauważyć, że nie wszystkie słowa z tego grona mają faktycznie wydźwięk pejoratywny. "And" niekoniecznie jest słowem pozytywnym
#not, something, bag, off - nie są to słowa negatywne
#ze względu na takie przypisanie wartości słowom prawdopodobna jest nieprawidłowa klasyfikacja opinii

Most positive words:
best
perfectly
and
bonus
highly
perfect
easy
loves
great
love

Most negative words:
not
poor
doesnt
waste
something
return
disappointed
bag
shame
off


## Exercise 4 
a) Predict the sentiment of test data reviews.   
b) Predict the sentiment of test data reviews in terms of probability.   
c) Find five most positive and most negative reviews.   
d) Calculate the accuracy of predictions.

In [140]:
#a)
import time

start = time.time()
y_exp = model.predict(X_test)
end = time.time()

time_1 = end-start

print("results:")
print(y_exp)
print("time of process: ", time_1)

#metoda zwraca jedynie przypisanie recenzji jako pozytywnej lub nie

results:
[ 1 -1  1 ...  1 -1  1]
time of process:  0.009692192077636719


In [141]:
#b)
start = time.time()
y_prob = model.predict_proba(X_test)
end = time.time()

time_2 = end - start

print("results:")
print(y_prob)
print("time of process: ",time_2)
#metoda dla każdej recenzji zeraca prawdopodobieństwa, że jest w klasie pozytywnych oraz negatywnych recenzji

results:
[[7.93076715e-12 1.00000000e+00]
 [1.00000000e+00 1.02883983e-21]
 [0.00000000e+00 1.00000000e+00]
 ...
 [0.00000000e+00 1.00000000e+00]
 [1.00000000e+00 1.34559848e-13]
 [2.22044605e-15 1.00000000e+00]]
time of process:  0.02027750015258789


In [142]:
#c) 
df = pd.DataFrame(y_prob)
df['review'] = baby_df['review']

best = df.nlargest(5, 1)['review'].tolist()

print("BEST REVIEWS:\n")
for r in best:
    print(r, '\n')

worst = df.nlargest(5, 0)['review'].tolist()

print("WORST REVIEWS:\n")
for r in worst:
    print(r, '\n')

#Można zauważyć, że niektóre z opinii, które sklasyfikowano jako negatywne, są właściwie pozytywne
#Nie każda z najlepszych opinii jest ponadto faktycznie jednoznacznie pozytywna
#Wynika to najprawdopodobniej z niewłaściwej klasyfikacji niektórych słów nwutralnych jako pejoratywnych
#(co zostało pokazane powyżej)

BEST REVIEWS:

Very soft and comfortable and warmer than it looksfit the full size bed perfectlywould recommend to anyone looking for this type of quilt 

When the Binky Fairy came to our house we didnt have any special gift and book to help explain to her about how important it is to stop using a pacifier This book does a great job to help prepare your child for the loss of their favorite item The doll is adorable and we made lots of cute movies with the Binky Fairy telling our daughter about what happens when the Binky Fairy comes I would highly recommend this product for any parent trying to break the pacifier or thumb sucking habit 

I love this journal and our nanny uses it everyday to track on our daughters sleep eating and other activities The layout and design make it very easy to fill in quickly with a comments column to add in details ie we ask the nanny to specify what food she had for lunch amount of milk she took specifics of play timetummy time walk to park etc I love kno

In [143]:
#d) 
from sklearn.metrics import accuracy_score

acc_1 = accuracy_score(y_test, y_exp)

print("Accuracy: ", acc_1)

Accuracy:  0.8805221284931836


## Exercise 5
In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.


a) Redo exercises 2-5 using limited dictionary.   
b) Check the impact of all the words from the dictionary.   
c) Compare accuracy of predictions and the time of evaluation.

In [144]:
significant_words = ['love','great','easy','old','little','perfect','loves','well','able','car','broke','less','even','waste','disappointed','work','product','money','would','return']

In [145]:
#a)
X = baby_df['review']
y = baby_df['rating']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=43)

#stworzenie wektoryzatora opartego o konkretne słowa
vectorizer = CountVectorizer()
vectorizer.fit_transform(significant_words)

X_train = vectorizer.transform(X_train)
X_test = vectorizer.transform(X_test)

scaler = StandardScaler(with_mean = False)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model = LogisticRegression(max_iter = 200)
model.fit(X_train, y_train)

start = time.time()
y_exp = model.predict(X_test)
end = time.time()

time_3 = end - start

start = time.time()
y_prob = model.predict_proba(X_test)
end = time.time()

time_4 = end - start

In [146]:
#b)
words = vectorizer.get_feature_names_out()
values = model.coef_[0]

df_val = pd.DataFrame({'word': words, 'val': values})
df_val = df_val.sort_values(by='val')

for x, item in df_val.iterrows():
    print(item['word'], " ", item['val'])

disappointed   -0.3459376674294728
return   -0.3162803572983196
waste   -0.2720057548578965
would   -0.2199340384599866
even   -0.21812471013186222
broke   -0.21750355660595372
money   -0.213613249865728
work   -0.2107410241825163
product   -0.15813326286239945
less   -0.029214472991915326
old   0.04236143114015611
able   0.05441341416172299
car   0.0551485971615435
well   0.24522682049492536
little   0.2861927394342769
perfect   0.48987561882017006
great   0.5719733416500196
easy   0.627480051795521
loves   0.6588816176969945
love   0.7854248883242763


In [147]:
#c)
acc_2 = accuracy_score(y_test, y_exp)

print("ACCURACY COMPARISION\n")
print("unlimited dictionary: ", acc_1)
print("limited dictionary: ", acc_2)

print("\nTIME COMPARISION\n")
print("unlimited dictionary: ", time_1)
print("unlimited dictionary, with probaility: ", time_2)
print("limited dictionary: ", time_3)
print("limited dictionary, with probability: ", time_4)


ACCURACY COMPARISION

unlimited dictionary:  0.8805221284931836
limited dictionary:  0.8659297165473954

TIME COMPARISION

unlimited dictionary:  0.009692192077636719
unlimited dictionary, with probaility:  0.02027750015258789
limited dictionary:  0.0007295608520507812
limited dictionary, with probability:  0.0009098052978515625
