# <strong><big>A SENTIMENT ANALYSIS MINI-PROJECT USING AMAZON DATASET :</big></strong>
<big><em>Real-world data science project in Python utilizing the sci-kit learn library.<br> 
    In this, we create a model that automatically classifies text as having a positive or negative sentiment.<br>
    Also it identifies/classifies what the product type is from the input text.<br>
    We accomplish this by using Amazon reviews as training data.</em></big><br><br>
    <big>Models trained: Support Vector Machines(SVM), Decision Trees, Naive Bayes Classifier, Logistic Regression</big><br><br>
    <big>Team Members:</big><br>

    
<table border="1" align="left" style="width: 25%">
	  <tr>
	    <th><big>Name</big></th>
	    <th><big>Roll Number</big></th>
	  </tr>
	  <tr>
	    <td align="center"><em><strong>Anushka Khuspe</strong></em></td>
	    <td align="center"><em><strong>106</strong></em></td>
	  </tr>
	  <tr>
	    <td align="center"><em><strong>Om Parghale</strong></em></td>
	    <td align="center"><em><strong>099</strong></em></td>
	  </tr>
	  <tr>
	    <td align="center"><em><strong>Shivani Nyamgouda</strong></em></td>
	    <td align="center"><em><strong>103</strong></em></td>
	  </tr>
</table>   


## **Load In Data**

### Data Class

In [1]:
import random

class Sentiment:
    NEGATIVE="NEGATIVE"
    POSITIVE="POSITIVE"
    NEUTRAL="NEUTRAL"
    

class Review:
    def __init__(self,text,score):
        self.text=text
        self.score=score
        self.sentiment=self.get_sentiment()
        
    def get_sentiment(self):
        if self.score<=2:
            return Sentiment.NEGATIVE
        elif self.score==3:
            return Sentiment.NEUTRAL
        else: #Score of 4 or 5
            return Sentiment.POSITIVE
        
class ReviewContainer:
    def __init__(self,reviews):
        self.reviews=reviews
        
    def get_text(self):
        return [x.text for x in self.reviews]
    
    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]
        
    def evenly_distributed(self):
        negative = list(filter(lambda x: x.sentiment==Sentiment.NEGATIVE,self.reviews))
        positive = list(filter(lambda x: x.sentiment==Sentiment.POSITIVE,self.reviews))
        #neutral = list(filter(lambda x: x.sentiment==Sentiment.NEUTRAL,self.reviews))
        
        positive_shrunk=positive[:len(negative)]
        # neutral_shrunk=neutral[:len(negative)]
        # self.reviews=negative+positive_shrunk+neutral_shrunk
        self.reviews=negative+positive_shrunk
        random.shuffle(self.reviews)
        # print(negative[0].text)
        # print(len(negative))
        # print(len(neutral))
        # print(len(positive))
        
        

### Load in Data

In [2]:
import json
import numpy as np

reviews=[]
with open('Books_small_10000.json') as f:
    for line in f:
        review=json.loads(line)
        reviews.append(Review(review['reviewText'],review['overall']))
        
reviews[10].text,reviews[10].sentiment

("My only complaint about this book is that it is much too short. I love this author and this series, and I can't wait for the next installment.",
 'POSITIVE')

## **Data Prep**

### Splitting the data for training and testing

In [3]:
from sklearn.model_selection import train_test_split

training,testing=train_test_split(reviews,test_size=0.33,random_state=42)

train_container = ReviewContainer(training)
test_container = ReviewContainer(testing)


In [4]:
train_container.evenly_distributed()
x_train=train_container.get_text()
y_train=train_container.get_sentiment()

test_container.evenly_distributed()
x_test=test_container.get_text()
y_test=test_container.get_sentiment()

print(y_train.count(Sentiment.POSITIVE))
print(y_train.count(Sentiment.NEGATIVE))

436
436


### TFIDF Vectorization

In [5]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

vectorizer = TfidfVectorizer()
x_train_vectors=vectorizer.fit_transform(x_train)
x_test_vectors=vectorizer.transform(x_test)

## Classification

### Support Vector Machines

In [6]:
from sklearn import svm

# object creation
clf_svm=svm.SVC(kernel='rbf',C=128.0)
# fitting the model for training in a svm
clf_svm.fit(x_train_vectors,y_train)

#prediction testing on first vector


In [7]:
# Prediction testing on a random vector
print(x_test[12],'\n',"True Label =",y_test[12])
print("\n Predicted Label = ",clf_svm.predict(x_test_vectors[12]),'\n\n')

print(x_test[54],'\n',"True Label =",y_test[54])
print("\n Predicted Label = ",clf_svm.predict(x_test_vectors[54]))

Several stories beautifully combined into a great read.  Robie and Reel are such a good pairing.  And now I'd love to go back to them as the trio of Jerome, Julie, and Min progress in their unique relationship.  There has to be another story with the five of them. 
 True Label = POSITIVE

 Predicted Label =  ['POSITIVE'] 


I really liked this book. It's not your typical kind of romance novel. It's down-to-earth and funny and tells a story that could actually happen. Cindy is a smart woman who never gives up and learns from her mistakes. This is why she's ultimately able to enter into a relationship that's good for her, and find love on equal terms. But there's wisdom in here for anyone who wants to change their life for the better. I'm off to buy The Sugar Ticket to see what happens to Cindy next! 
 True Label = POSITIVE

 Predicted Label =  ['POSITIVE']


### Decision Trees

In [8]:
from sklearn.tree import DecisionTreeClassifier

#object creation
clf_dec = DecisionTreeClassifier()
#fitting the model for training in a DecTree
clf_dec.fit(x_train_vectors,y_train)

In [9]:
# Prediction testing on a random vector
print(x_test[37],'\n',"True Label =",y_test[37])
print("\n Predicted Label = ",clf_svm.predict(x_test_vectors[37]),'\n\n')

print(x_test[61],'\n',"True Label =",y_test[61])
print("\n Predicted Label = ",clf_svm.predict(x_test_vectors[61]))

The 2 star doesn't have anything to do with Tris' decision. It was disappointing because he previous 2 books had such an intensity that I actually felt let down by the author. The explanation on how the 'experiments' were conducted was a bit insulting, being a person in science myself, which is big part on the reason for the low score. I felt too much explanation was given on the last book that wasn't even hinted on the previous ones. I do think a lot of it could have been utilized earlier in the story and not have to rush everything for the end.On e other hand, I did like how Four came down from the pedestal and joined the rest of the world, but thought extreme and unnecessary. Did appreciate the fights and doubts on their relationship, making it more real to me, but again I felt it was to extreme and too dull. It did continue building on the importance of forgiveness and self-confidence and therefore the second star, but after the hype on the first 2 books, I don't feel this one live

### Naive Bayes

In [10]:
from sklearn.naive_bayes import GaussianNB

# object creation
clf_nb=GaussianNB()

#fitting the model for training in a NB
x_train_vectors_dense = x_train_vectors.toarray()
x_test_vectors_dense  = x_test_vectors.toarray()
clf_nb.fit(x_train_vectors_dense, y_train)

In [11]:
# Prediction testing on a random vector
print(x_test[69],'\n',"True Label =",y_test[69])
print("\n Predicted Label = ",clf_svm.predict(x_test_vectors[69]),'\n\n')

print(x_test[88],'\n',"True Label =",y_test[88])
print("\n Predicted Label = ",clf_svm.predict(x_test_vectors[88]))

fast paced with an ending you don't see coming.  Well written and engaging in so many levels.  Great job Jeff Carson 
 True Label = POSITIVE

 Predicted Label =  ['POSITIVE'] 


Right away one could tell that this book was not true historical fiction. The innocent , beautiful, chaste, but learned bath maid enraptured everyone she met including a prince, a doctor, and a brewer. In the meantime religion fights with science and bloodletting fights with medicine.  Many of the characters were so stock it was like shopping at a big box store. However it is a quick read and if you like ripped bodice type of romance novels this will fit the bill. 
 True Label = NEGATIVE

 Predicted Label =  ['POSITIVE']


### Logistic Regression

In [12]:
from sklearn.linear_model import LogisticRegression

# object creation
clf_log = LogisticRegression(max_iter=1000)

#fitting the model for training in a NB
clf_log.fit(x_train_vectors,y_train)

In [13]:
# Prediction testing on a random vector
print(x_test[35],'\n',"True Label =",y_test[35])
print("\n Predicted Label = ",clf_svm.predict(x_test_vectors[35]),'\n\n')

print(x_test[99],'\n',"True Label =",y_test[99])
print("\n Predicted Label = ",clf_svm.predict(x_test_vectors[99]))

If you ever wondered why people are possessed with mountaim climbing Andy Kirkpatrick's book will help you understand. Andy explores his inner self with a discerning eye toward answering the &#34;why do I do it&#34; question. This memoir is well written and properly edited and is a worthwhile read. He brings the reader along with him on a number of interesting journeys. Highly recommended 
 True Label = POSITIVE

 Predicted Label =  ['POSITIVE'] 


All you ever wanted to know about surrendering yourself to someone else. This is not a new world for some. 
 True Label = POSITIVE

 Predicted Label =  ['NEGATIVE']


## **Evaluation**

In [14]:
# Mean accuracy on the given test data and labels
print("Mean Accuracy:\n")
print("Support Vector Machine score = ",clf_svm.score(x_test_vectors,y_test),'\n')
print("Decision Tree score = ",clf_dec.score(x_test_vectors,y_test),'\n')
print("Naive Bayes score = ",clf_nb.score(x_test_vectors_dense,y_test),'\n')
print("Logistic regression score = ",clf_log.score(x_test_vectors,y_test),'\n')

Mean Accuracy:

Support Vector Machine score =  0.8197115384615384 

Decision Tree score =  0.6370192307692307 

Naive Bayes score =  0.6610576923076923 

Logistic regression score =  0.8052884615384616 



In [15]:
# F1 Scores:
from sklearn.metrics import f1_score
print("Support Vector Machine F1 score = ",f1_score(y_test,clf_svm.predict(x_test_vectors),average=None,labels=[Sentiment.POSITIVE,Sentiment.NEGATIVE]),'\n')
print("Decision Tree F1 score = ",f1_score(y_test,clf_dec.predict(x_test_vectors),average=None,labels=[Sentiment.POSITIVE,Sentiment.NEGATIVE]),'\n')
print("Naive Bayes F1 score = ",f1_score(y_test,clf_nb.predict(x_test_vectors_dense),average=None,labels=[Sentiment.POSITIVE,Sentiment.NEGATIVE]),'\n')
print("Logistic regression F1 score = ",f1_score(y_test,clf_log.predict(x_test_vectors),average=None,labels=[Sentiment.POSITIVE,Sentiment.NEGATIVE]),'\n')

Support Vector Machine F1 score =  [0.82269504 0.81662592] 

Decision Tree F1 score =  [0.63961814 0.63438257] 

Naive Bayes F1 score =  [0.65693431 0.66508314] 

Logistic regression F1 score =  [0.80291971 0.80760095] 



### Optimization using GridSearch (cross validation)

In [16]:
from sklearn.model_selection import GridSearchCV
abc_svc=svm.SVC(kernel='rbf',C=4.0)
abc_svc.fit(x_train_vectors,y_train)

print(abc_svc.score(x_test_vectors,y_test))

0.8197115384615384


In [31]:
kernel=['linear','rbf','poly']
for i in kernel:
    model=svm.SVC(kernel=i,C=4.0)
    model.fit(x_train_vectors,y_train)
    print('For kernel',i)
    print("Accuracy is: ",model.score(x_test_vectors,y_test))

For kernel linear
Accuracy is:  0.8052884615384616
For kernel rbf
Accuracy is:  0.8197115384615384
For kernel poly
Accuracy is:  0.7716346153846154


In [36]:
import pandas as pd
for i in range(0,10):
    model=svm.SVC(kernel='poly',degree=i,C=100,)
    model.fit(x_train_vectors,y_train)
    print("Accuracy on training data: ",model.score(x_train_vectors,y_train))
    print("Accuracy on testing data: ",model.score(x_test_vectors,y_test))
    

Accuracy on training data:  0.5
Accuracy on testing data:  0.5
Accuracy on training data:  1.0
Accuracy on testing data:  0.8028846153846154
Accuracy on training data:  1.0
Accuracy on testing data:  0.8197115384615384
Accuracy on training data:  1.0
Accuracy on testing data:  0.7716346153846154
Accuracy on training data:  1.0
Accuracy on testing data:  0.7067307692307693
Accuracy on training data:  1.0
Accuracy on testing data:  0.6658653846153846
Accuracy on training data:  1.0
Accuracy on testing data:  0.6514423076923077
Accuracy on training data:  1.0
Accuracy on testing data:  0.6298076923076923
Accuracy on training data:  1.0
Accuracy on testing data:  0.6201923076923077
Accuracy on training data:  1.0
Accuracy on testing data:  0.6201923076923077


## **Qualitative Testing** 

In [17]:
test_review=["Such a boring read,couldn't finish!!"]
new_test=vectorizer.transform(test_review)
print(clf_svm.predict(new_test))

['NEGATIVE']


In [18]:
test_review=["Had an amazing experience reading this thrilling book!!"]
new_test=vectorizer.transform(test_review)
print(clf_svm.predict(new_test))

['POSITIVE']


In [19]:
test_review=["What the hell was that!!"]
new_test=vectorizer.transform(test_review)
print(clf_svm.predict(new_test))

['NEGATIVE']


In [20]:
test_review=[" This is wonderful book, inspiring and wise. My uncle was taken by the Nazis and was almost dead due to typhus when the camp at Dachau was liberated. He was discovered in a heap of bodies by a doctor who noticed a flicker of his eyelids. He was taken to hospital in Budapest and survived until 1967. This book gave me an insight into what he must have suffered. He never complained was always cheerful and full of mischief. The second half of the book about logotherapy is also very interesting and worth reading. "]
new_test=vectorizer.transform(test_review)
print(clf_svm.predict(new_test))

['POSITIVE']


In [21]:
test_review=['Too many technical terms in physchology domain makes reading very uncomfortable. Not much of a self help book as tagged.']
new_test=vectorizer.transform(test_review)
print(clf_svm.predict(new_test))

['NEGATIVE']


## Saving Model

In [22]:
import pickle

with open('./models/sentiment_class_svm.pkl', 'wb') as f:
    pickle.dump(clf_svm, f)
    
with open('./models/sentiment_class_svm_vectorizer.pkl', 'wb') as f:
    pickle.dump(vectorizer, f)

# **Thank You 😄**