## Data Science Course
### Classification 
#### Author: Pawel Jelonek


In [11]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import sys

In [2]:
def remove_punctuation(text):
    import string
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

baby_df = pd.read_csv('amazon_baby.csv')
baby_df.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


### Exercise 1 (data preparation)
#### b) Replace all missing (nan) revies with empty "" string.

In [3]:
baby_df["review"] = baby_df["review"].fillna("")
#short test:
baby_df["review"][38] == ""

True

#### a) Remove punctuation from reviews using the given function.   

In [4]:
#short test: 
baby_df["review"] = baby_df["review"].apply(lambda x: remove_punctuation(x))
baby_df["review"][4] == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'

True

#### c) Drop all the entries with $rating = 3$, as they have neutral sentiment.  

In [5]:
baby_df = baby_df.drop(baby_df[baby_df.rating == 3].index)
sum(baby_df["rating"] == 3) == 0

True

#### d) Set all positive ($\geq$4) ratings to 1 and negative($\leq$2) to -1.

In [6]:
def standardize_ratings(rating):
    if rating >= 4: 
        return 1
    else:
        return -1
baby_df["rating"] = baby_df["rating"].map(lambda x: standardize_ratings(x))
sum(baby_df["rating"]**2 != 1)

0

### CountVectorizer
#### In order to analyze strings, we need to assign them numerical values. We will use one of the simplest string representation, which transforms strings into the $n$ dimensional vectors. The number of dimensions will be the size of our dictionary, and then the values of the vector will represent the number of appereances of the given word in the sentence.

In [8]:
vectorizer = CountVectorizer()
reviews_train_example = ["We like apples",
                   "We hate oranges",
                   "I adore bananas",
                   "We like like apples and oranges",
                   "They dislike bananas"]

X_train_example = vectorizer.fit_transform(reviews_train_example)

In [10]:
reviews_test_example = ["They like bananas",
                   "We hate oranges bananas and apples",
                   "We love bananas"] #New word!

X_test_example = vectorizer.transform(reviews_test_example)

#### We should acknowledge few facts. Firstly, CountVectorizer does not take order into account. Secondly, it ignores one-letter words (this can be changed during initialization). Finally, for test values, CountVectorizer ignores words which are not in it's dictionary.

### Exercise 2 
#### a) Split dataset into training and test sets.     

In [12]:
xTrain, xTest, yTrain, yTest = train_test_split(baby_df['review'], baby_df['rating'], test_size = 0.2, random_state = 0)
vectorizer = CountVectorizer()

#### b) Transform reviews into vectors using CountVectorizer. 

In [13]:
np.set_printoptions(threshold=sys.maxsize)
reviews_vectorized = vectorizer.fit_transform(xTest)

### Exercise 3 

#### a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were).   

In [14]:
%%time
model = LogisticRegression(solver="lbfgs",max_iter=10000).fit(reviews_vectorized,yTest)

Wall time: 7.6 s


#### b) Print 10 most positive and 10 most negative words.

In [15]:
print("Top 10 positive words")
words = list(zip(vectorizer.get_feature_names(),model.coef_.flatten()))
words.sort(key=lambda w: w[1], reverse=True)

dictionary = ""
for x in words[:10]:
    print(x)
    dictionary +=str(x[0])+" "

print("\nTop 10 negative words")
for x in words[-10:]:
    print(x)
    dictionary +=str(x[0])+" "

Top 10 positive words
('lifesaver', 2.213955598161604)
('perfectly', 2.0679116719841493)
('highly', 2.0231771059892654)
('perfect', 1.9260397456299283)
('plenty', 1.9070858570960028)
('unlike', 1.8895098295785855)
('glad', 1.8385306190195196)
('love', 1.8124556174362176)
('downside', 1.7841225417277407)
('complaint', 1.751481701331609)

Top 10 negative words
('returning', -1.9486352651833716)
('returned', -1.9983037445648912)
('poorly', -2.0723897659000947)
('worst', -2.077773012469216)
('terrible', -2.10755677212824)
('useless', -2.159703740524685)
('disappointed', -2.1970904031803564)
('waste', -2.3592952789588235)
('disappointing', -2.4127341938902522)
('poor', -2.6422263937989223)


### Exercise 4 
#### a) Predict the sentiment of test data reviews.   

In [17]:
%%time
np.set_printoptions(threshold=sys.maxsize)

predictions = model.predict(reviews_vectorized)
suma0 = 0

for i in range(0, len(yTest)):
    if predictions[i] == yTest.values[i]:
        suma0 = suma0 + 1

print("Skutecznosc predykcji wynosi: "+str((suma0/len(yTest))*100)+"%")

Skutecznosc predykcji wynosi: 98.38385655602531%
Wall time: 51.9 ms


#### b) Predict the sentiment of test data reviews in terms of probability.  

In [19]:
%%time
predicted_sentiment = model.predict_proba(reviews_vectorized)

Wall time: 15 ms


#### c) Find five most positive and most negative reviews.   

In [20]:
%%time
revs = list(zip(xTest, predicted_sentiment[:,1]))
revs.sort(key = lambda x: x[1], reverse=True)
print("Top 10 reviews")
for i, x in enumerate(revs[:5]):
    print('Review {}:'.format(i+1), x[0][:50], x[1])
print("\nTop 10 worst reviews")    
for i, x in enumerate(reversed(revs[-5:])):
    print('Review {}:'.format(i+1), x[0][:50], x[1])    

Top 10 reviews
Review 1: This is a review of the 2012 Bumbleride Flite in R 1.0
Review 2: We LOVE this seat As parents to 8 children ranging 1.0
Review 3: Before my daughter was born in 2010 I bought the R 1.0
Review 4: Were keeping this stroller After much research we  1.0

Top 10 worst reviews
Review 1: This product should be in the hall of fame solely  1.1611051105337778e-16
Review 2: Edited to Add 642010  Just wanted to add that Peg  6.676728298780986e-15
Review 3: My husband and I are VERY disappointed and shocked 4.786323188982842e-14
Review 4: Initially I thought these angled bottles make a lo 3.168521884487184e-13
Review 5: THIS BASSINET IS OVERPRICED AND RIDICULOUS  If we  1.001302463424443e-12
Wall time: 38.9 ms


### Exercise 5
#### In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.

#### a) Redo exercises 2-5 using limited dictionary.   
#### b) Check the impact of all the words from the dictionary.   
#### c) Compare accuracy of predictions and the time of evaluation.

In [21]:
xTrain, xTest, yTrain, yTest= train_test_split(baby_df['review'], baby_df['rating'], test_size = 0.2, random_state = 0)

vectorizer = CountVectorizer(vocabulary=dictionary.split())
X_vectorized_to_our_dict = vectorizer.fit_transform(xTest)

In [22]:
%%time
model_with_limited_dir = LogisticRegression(solver="lbfgs",max_iter=10000).fit(X_vectorized_to_our_dict, yTest)

Wall time: 142 ms


In [23]:
%%time
predictions_with_limited_dir = model_with_limited_dir.predict(X_vectorized_to_our_dict)
suma = 0

for i in range(0, len(yTest)):
    if predictions_with_limited_dir[i] == yTest.values[i]:
        suma = suma + 1

print("Skutecznosc predykcji wynosi: "+str((suma/len(yTest))*100)+"%")
predicted_sentiment = model_with_limited_dir.predict_proba(X_vectorized_to_our_dict)

Skutecznosc predykcji wynosi: 87.13381907588978%
Wall time: 55.9 ms


In [24]:
%%time
revs2 = list(zip(xTest, predicted_sentiment[:,1]))
revs2.sort(key = lambda x: x[1], reverse=True)
print("Top 10 reviews")
for i, x in enumerate(revs2[:5]):
    print('Review {}:'.format(i+1), x[0][:50], x[1])
print("\nTop 10 worst reviews")    
for i, x in enumerate(reversed(revs2[-5:])):
    print('Review {}:'.format(i+1), x[0][:50], x[1])

Top 10 reviews
Review 1: I researched strollers for months and months befor 0.9999994053446656
Review 2: We LOVE this seat As parents to 8 children ranging 0.9999991209510133
Review 3: This diaper bag is PERFECT There are a lot of pock 0.9999960334017929
Review 4: I bought this gym for our grandaughter  Her mother 0.9999906420766114
Review 5: i love love love recaro and i love love love this  0.9999867999990523

Top 10 worst reviews
Review 1: I have five children We have owned several strolle 1.361683647842494e-05
Review 2: I registered for this product and received it and  4.1426989855789376e-05
Review 3: If we only knew when we registered how terrible th 0.00019288833706413138
Review 4: Basically I just want to second Dressy Grad Studen 0.00024966938898068566
Review 5: This was possibly the most disappointing electroni 0.00025022700525714173
Wall time: 26.9 ms


In [19]:
print("Data for limited dict")
len(model_with_limited_dir.coef_[0])
len(vectorizer.get_feature_names())
for i in range(0,len(vectorizer.get_feature_names())):
    print(str(i+1)+" = "+str(vectorizer.get_feature_names()[i])+" = "+str(model_with_limited_dir.coef_[0][i]))
print("\nData for unlimited dict")
index = 1
for x in words[:10]:
    print(str(index)+" = "+x[0]+" = "+str(x[1]))
    index = index+1

for x in words[-10:]:
    print(str(index)+" = "+x[0]+" = "+str(x[1]))
    index = index +1

Data for limited dict
1 = lifesaver = 1.9739819111588028
2 = perfectly = 1.3599040828476456
3 = highly = 1.7338576222649003
4 = perfect = 1.339694409210868
5 = plenty = 1.5045715293359114
6 = unlike = 1.0689782392214933
7 = glad = 1.175944820931581
8 = love = 1.3790911093516982
9 = downside = 1.9334764431053262
10 = complaint = 1.2036867753882554
11 = returning = -2.728921469081398
12 = returned = -2.329746552102015
13 = poorly = -2.647683464759656
14 = worst = -2.351636691726289
15 = terrible = -2.2474315296611747
16 = useless = -2.006925464008881
17 = disappointed = -2.437097349989623
18 = waste = -2.5792479431372084
19 = disappointing = -2.5642139743772567
20 = poor = -2.2297623286054433

Data for unlimited dict
1 = lifesaver = 2.213955598161604
2 = perfectly = 2.0679116719841493
3 = highly = 2.0231771059892654
4 = perfect = 1.9260397456299283
5 = plenty = 1.9070858570960028
6 = unlike = 1.8895098295785855
7 = glad = 1.8385306190195196
8 = love = 1.8124556174362176
9 = downside = 1.