## AMAZON BABY SENTIMENT ANALYSIS

Here we will be compare and evaluate 3 different models (Logistic Regression, Support Vector Machines, and K-Nearest Neighbors) using 2 types of vectorizers (Count Vectorizer & TF-IDF Vecorizer) in classigying the sentiment of reviews based off Amazon baby products.

### LOAD PACKAGES AND DATA

In [2]:
from zipfile import ZipFile
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

SEED = 100

In [3]:
with ZipFile('amazon_baby.csv.zip', 'r') as zipObj:
# Extract all the contents of zip file in current directory
    zipObj.extractall()

df = pd.read_csv('amazon_baby.csv')

### PREPROCESS DATA

In [4]:
# Getting rid of null values
df = df.dropna()

# Taking a 30% representative sample
np.random.seed(SEED)
df1 = df.sample(frac = 0.3)

# Adding the sentiments column
df1['sentiments'] = df1.rating.apply(lambda x: 0 if x in [1, 2] else 1)

In [5]:
df1.head()

Unnamed: 0,name,review,rating,sentiments
182314,Levana Sophia 2.4&quot; Digital Video Baby Mon...,The Picture quality is amazing but the screen ...,3,1
14599,Single Animal Jingle Bell (Assorted Styles),My little guy loves this. He enjoys shaking it...,5,1
51886,Baby Einstein Take Along Tunes,This is the BEST toy! My son has had it for al...,5,1
27494,Fisher-Price Rainforest Melodies and Lights De...,Easy to assemble (though I believe Fisher-Pric...,5,1
84879,Sassy Crib and Floor Mirror,"This was for my 5 month old - it is cute, she ...",4,1


In [6]:
X = df1['review']
y = df1['sentiments']

In [8]:
y.unique()

array([1, 0], dtype=int64)

### COUNT VECTORIZER

* Used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y,  test_size = 0.5, random_state=SEED)
cv = CountVectorizer()

#Vectorizing the text data
ctmTr = cv.fit_transform(X_train)
X_test_dtm = cv.transform(X_test)

**LOGISTIC REGRESSION**

In [23]:
#Training the model
lr = LogisticRegression(solver='lbfgs', max_iter=1000)
lr.fit(ctmTr, y_train)

#Accuracy score
lr_score = lr.score(X_test_dtm, y_test)
print("Results for Logistic Regression with CountVectorizer")
print(lr_score)

#Predicting the labels for test data
y_pred_lr = lr.predict(X_test_dtm)

#Confusion matrix
cm_lr = confusion_matrix(y_test, y_pred_lr)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_lr).ravel()
print(tn, fp, fn, tp)

#True positive and true negative rates
tpr_lr = round(tp/(tp + fn), 4)
tnr_lr = round(tn/(tn+fp), 4)
print(tpr_lr, tnr_lr)

Results for Logistic Regression with CountVectorizer
0.9042327655530376
2306 1618 1002 22432
0.9572 0.5877


**SUPPORT VECTOR MACHINE**

In [24]:
#Training the model
svcl = svm.SVC()
svcl.fit(ctmTr, y_train)
svcl_score = svcl.score(X_test_dtm, y_test)
print("Results for Support Vector Machine with CountVectorizer")
print(svcl_score)
y_pred_sv = svcl.predict(X_test_dtm)

#Confusion matrix
cm_sv = confusion_matrix(y_test, y_pred_sv)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_sv).ravel()
print(tn, fp, fn, tp)
tpr_sv = round(tp/(tp + fn), 4)
tnr_sv = round(tn/(tn+fp), 4)
print(tpr_sv, tnr_sv)

Results for Support Vector Machine with CountVectorizer
0.891549089845749
1159 2765 202 23232
0.9914 0.2954


**K-NEAREST NEIGHBOR**

In [26]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(ctmTr, y_train)
knn_score = knn.score(X_test_dtm, y_test)
print("Results for KNN Classifier with CountVectorizer")
print(knn_score)
y_pred_knn = knn.predict(X_test_dtm)

#Confusion matrix
cm_knn = confusion_matrix(y_test, y_pred_knn)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_knn).ravel()
print(tn, fp, fn, tp)
tpr_knn = round(tp/(tp + fn), 4)
tnr_knn = round(tn/(tn+fp), 4)
print(tpr_knn, tnr_knn)

Results for KNN Classifier with CountVectorizer
0.8563491483295562
269 3655 275 23159
0.9883 0.0686


### TFIDF VECTORIZER

Text vectorizer that transforms the text into a usable vector. It combines 2 concepts, Term Frequency (TF) and Document Frequency (DF):
1. The term frequency is the number of occurrences of a specific term in a document. Term frequency indicates how important a specific term in a document. Term frequency represents every text from the data as a matrix whose rows are the number of documents and columns are the number of distinct terms throughout all documents
2. Document frequency is the number of documents containing a specific term. Document frequency indicates how common the term is.
    * Inverse document frequency (IDF) is the weight of a term, it aims to reduce the weight of a term if the term’s occurrences are scattered throughout all the documents

In [30]:
# tfidf vectorizer
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

**LOGISTIC REGRESSION**

In [31]:
lr = LogisticRegression()
lr.fit(X_train_vec, y_train)
lr_score = lr.score(X_test_vec, y_test)
print("Results for Logistic Regression with tfidf")
print(lr_score)
y_pred_lr = lr.predict(X_test_vec)

# Confusion matrix
cm_knn = confusion_matrix(y_test, y_pred_lr)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_lr).ravel()
print(tn, fp, fn, tp)
tpr_knn = round(tp/(tp + fn), 4)
tnr_knn = round(tn/(tn+fp), 4)
print(tpr_knn, tnr_knn)

Results for Logistic Regression with tfidf
0.9030996417866803
1676 2248 403 23031
0.9828 0.4271


**SUPPORT VECTOR MACHINE**

In [32]:
#params = {'kernel':('linear', 'rbf'), 'C':[1, 10, 100]}
svcl = svm.SVC(kernel = 'rbf')

#clf_sv = GridSearchCV(svcl, params)
svcl.fit(X_train_vec, y_train)
svcl_score = svcl.score(X_test_vec, y_test)
print("Results for Support Vector Machine with tfidf")
print(svcl_score)
y_pred_sv = svcl.predict(X_test_vec)

#Confusion matrix
cm_sv = confusion_matrix(y_test, y_pred_sv)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_sv).ravel()
print(tn, fp, fn, tp)
tpr_sv = round(tp/(tp + fn), 4)
tnr_sv = round(tn/(tn+fp), 4)
print(tpr_sv, tnr_sv)

Results for Support Vector Machine with tfidf
0.9050734702829154
1679 2245 352 23082
0.985 0.4279


**K-NEAREST NEIGHBOR**

In [33]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_vec, y_train)
knn_score = knn.score(X_test_vec, y_test)
print("Results for KNN Classifier with tfidf")
print(knn_score)
y_pred_knn = knn.predict(X_test_vec)

#Confusion matrix
cm_knn = confusion_matrix(y_test, y_pred_knn)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_knn).ravel()
print(tn, fp, fn, tp)
tpr_knn = round(tp/(tp + fn), 4)
tnr_knn = round(tn/(tn+fp), 4)
print(tpr_knn, tnr_knn)

Results for KNN Classifier with tfidf
0.8631478909276994
610 3314 430 23004
0.9817 0.1555
