### In this, we create a model that automatically classifies text as having a positive or negative sentiment.Also it identifies/classifies what the product type is from the input text.
#### We accomplish this by using Amazon reviews as training data.
#### Models trained: Support Vector Machines(SVM), Decision Trees, Naive Bayes Classifier, Logistic Regression</big>

#### <em>A project by Om Parghale</em>

## **Load In Data**

### Data Class

In [1]:
import random

class Sentiment:
    NEGATIVE="NEGATIVE"
    POSITIVE="POSITIVE"
    NEUTRAL="NEUTRAL"
    

class Review:
    def __init__(self,text,score):
        self.text=text
        self.score=score
        self.sentiment=self.get_sentiment()
        
    def get_sentiment(self):
        if self.score<=2:
            return Sentiment.NEGATIVE
        elif self.score==3:
            return Sentiment.NEUTRAL
        else: #Score of 4 or 5
            return Sentiment.POSITIVE
        
class ReviewContainer:
    def __init__(self,reviews):
        self.reviews=reviews
        
    def get_text(self):
        return [x.text for x in self.reviews]
    
    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]
        
    def evenly_distributed(self):
        negative = list(filter(lambda x: x.sentiment==Sentiment.NEGATIVE,self.reviews))
        positive = list(filter(lambda x: x.sentiment==Sentiment.POSITIVE,self.reviews))
        #neutral = list(filter(lambda x: x.sentiment==Sentiment.NEUTRAL,self.reviews))
        
        positive_shrunk=positive[:len(negative)]
        # neutral_shrunk=neutral[:len(negative)]
        # self.reviews=negative+positive_shrunk+neutral_shrunk
        self.reviews=negative+positive_shrunk
        random.shuffle(self.reviews)
        # print(negative[0].text)
        # print(len(negative))
        # print(len(neutral))
        # print(len(positive))
        
        

### Load in Data

In [2]:
import json
import numpy as np

reviews=[]
with open('Books_small_10000.json') as f:
    for line in f:
        review=json.loads(line)
        reviews.append(Review(review['reviewText'],review['overall']))
        
reviews[10].text,reviews[10].sentiment

("My only complaint about this book is that it is much too short. I love this author and this series, and I can't wait for the next installment.",
 'POSITIVE')

## **Data Prep**

### Splitting the data for training and testing

In [3]:
from sklearn.model_selection import train_test_split

training,testing=train_test_split(reviews,test_size=0.33,random_state=42)

train_container = ReviewContainer(training)
test_container = ReviewContainer(testing)


In [4]:
train_container.evenly_distributed()
x_train=train_container.get_text()
y_train=train_container.get_sentiment()

test_container.evenly_distributed()
x_test=test_container.get_text()
y_test=test_container.get_sentiment()

print(y_train.count(Sentiment.POSITIVE))
print(y_train.count(Sentiment.NEGATIVE))

436
436


### TFIDF Vectorization

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
x_train_vectors=vectorizer.fit_transform(x_train)
x_test_vectors=vectorizer.transform(x_test)

## Classification

### Support Vector Machines

In [6]:
from sklearn import svm

# object creation
clf_svm=svm.SVC(kernel='rbf',C=128.0)
# fitting the model for training in a svm
clf_svm.fit(x_train_vectors,y_train)

#prediction testing on first vector


In [7]:
# Prediction testing on a random vector
print(x_test[12],'\n',"True Label =",y_test[12])
print("\n Predicted Label = ",clf_svm.predict(x_test_vectors[12]),'\n\n')

print(x_test[54],'\n',"True Label =",y_test[54])
print("\n Predicted Label = ",clf_svm.predict(x_test_vectors[54]))

Sweet Southern Betrayal is the third book in author Robin Covington&#8217;s The Boys Are Back in Town Series and oh my, these boys can back into my town any time they&#8217;d like!  I adored this book!Successful attorney Teague Elliot has the world in the palm of his hand.  He&#8217;s just about to land the title of youngest partner in his uber successful Washington D.C. law firm and he couldn&#8217;t be happier.  Of course, this is just another stepping stone in his plan to enter the world of politics.  He&#8217;s also managed to keep his nose clean and the skeletons out of his closet his entire life in prep to take The Oval Office one day.  This brings us to the present&#8230;when he heads to Las Vegas with his friends and ends up in bed with a Vegas showgirl and absolutely no memory of why she&#8217;s here.  Or what he may have done the night before.  Will &#8220;what happens in Vegas&#8221; actually stay in Vegas?  Never has Teague needed it that saying to be more true than right n

### Decision Trees

In [8]:
from sklearn.tree import DecisionTreeClassifier

#object creation
clf_dec = DecisionTreeClassifier()
#fitting the model for training in a DecTree
clf_dec.fit(x_train_vectors,y_train)

In [9]:
# Prediction testing on a random vector
print(x_test[37],'\n',"True Label =",y_test[37])
print("\n Predicted Label = ",clf_svm.predict(x_test_vectors[37]),'\n\n')

print(x_test[61],'\n',"True Label =",y_test[61])
print("\n Predicted Label = ",clf_svm.predict(x_test_vectors[61]))

If you are a Winterson fan, you will love this little book of stories. Intelligent, well-written, wildly imaginative and insightful, each story is a meal in itself and somehow part of a greater thread of intuitive wisdom that holds the collection together. Wonderfully enjoyable read, thought-provoking and delightful. 
 True Label = POSITIVE

 Predicted Label =  ['POSITIVE'] 


This book is not relevant to modern life. It would be very difficult to recreate such a lifestyle as most people no longer live on farms. 
 True Label = NEGATIVE

 Predicted Label =  ['NEGATIVE']


### Naive Bayes

In [10]:
from sklearn.naive_bayes import GaussianNB

# object creation
clf_nb=GaussianNB()

#fitting the model for training in a NB
x_train_vectors_dense = x_train_vectors.toarray()
x_test_vectors_dense  = x_test_vectors.toarray()
clf_nb.fit(x_train_vectors_dense, y_train)

In [11]:
# Prediction testing on a random vector
print(x_test[69],'\n',"True Label =",y_test[69])
print("\n Predicted Label = ",clf_svm.predict(x_test_vectors[69]),'\n\n')

print(x_test[88],'\n',"True Label =",y_test[88])
print("\n Predicted Label = ",clf_svm.predict(x_test_vectors[88]))

This odd and pretentious novel is based on the true case of an innocent man who falsely confessed to a series of homicides. The nation is on edge in the wake of a series of mysterious disappearances. The targets, all older, solitary sorts, vanish and their presumed abductor leaves nary a clue but for a marked playing card. Oda Sotatsu is a young man living a life both unfulfilling and uninteresting. That is until he meets a troublesome couple, the supposedly charismatic Sato Kakuzo and his girlfriend, the alluring,Jito Joo. Clearly disturbed, they play games and place wagers where the loser has to physically harm himself. They attach to Oda, inducing him into  a wager after plying him with alcohol. After losing the game, he signs a detailed confession admitting culpability in the disappearances. Joo delivers the confession to the police and Oda is soon arrested, imprisoned, abused, tried and convicted. He is subsequently sentenced to death by hanging and executed, remaining silent thro

### Logistic Regression

In [12]:
from sklearn.linear_model import LogisticRegression

# object creation
clf_log = LogisticRegression(max_iter=1000)

#fitting the model for training in a NB
clf_log.fit(x_train_vectors,y_train)

In [13]:
# Prediction testing on a random vector
print(x_test[35],'\n',"True Label =",y_test[35])
print("\n Predicted Label = ",clf_svm.predict(x_test_vectors[35]),'\n\n')

print(x_test[99],'\n',"True Label =",y_test[99])
print("\n Predicted Label = ",clf_svm.predict(x_test_vectors[99]))

I really have enjoyed this series and can't wait for the next one to come out. I especially liked that the author gave Kid and Jason their own stories, even though they occurred at the same time. She could have easily put them both into one book and jumped between locations. I am so glad she didn't.The last two installments have not had as much sensuality as the previous 3 books, but the story line is there, so you don't necessarily miss it. I do like how she has not just dropped the other characters in sacrifice to the current beau.This is a great series. 
 True Label = POSITIVE

 Predicted Label =  ['POSITIVE'] 


This book is outstanding. The format is user friendly and it covers the core with lots of examples. It is well worth the purchase. 
 True Label = POSITIVE

 Predicted Label =  ['POSITIVE']


## **Evaluation**

In [14]:
# Mean accuracy on the given test data and labels
print("Mean Accuracy:\n")
print("Support Vector Machine score = ",clf_svm.score(x_test_vectors,y_test),'\n')
print("Decision Tree score = ",clf_dec.score(x_test_vectors,y_test),'\n')
print("Naive Bayes score = ",clf_nb.score(x_test_vectors_dense,y_test),'\n')
print("Logistic regression score = ",clf_log.score(x_test_vectors,y_test),'\n')

Mean Accuracy:

Support Vector Machine score =  0.8197115384615384 

Decision Tree score =  0.6706730769230769 

Naive Bayes score =  0.6610576923076923 

Logistic regression score =  0.8052884615384616 



In [15]:
# F1 Scores:
from sklearn.metrics import f1_score
print("Support Vector Machine F1 score = ",f1_score(y_test,clf_svm.predict(x_test_vectors),average=None,labels=[Sentiment.POSITIVE,Sentiment.NEGATIVE]),'\n')
print("Decision Tree F1 score = ",f1_score(y_test,clf_dec.predict(x_test_vectors),average=None,labels=[Sentiment.POSITIVE,Sentiment.NEGATIVE]),'\n')
print("Naive Bayes F1 score = ",f1_score(y_test,clf_nb.predict(x_test_vectors_dense),average=None,labels=[Sentiment.POSITIVE,Sentiment.NEGATIVE]),'\n')
print("Logistic regression F1 score = ",f1_score(y_test,clf_log.predict(x_test_vectors),average=None,labels=[Sentiment.POSITIVE,Sentiment.NEGATIVE]),'\n')

Support Vector Machine F1 score =  [0.82269504 0.81662592] 

Decision Tree F1 score =  [0.66666667 0.67458432] 

Naive Bayes F1 score =  [0.65693431 0.66508314] 

Logistic regression F1 score =  [0.80291971 0.80760095] 



### Optimization using GridSearch (cross validation)

In [16]:
from sklearn.model_selection import GridSearchCV
abc_svc=svm.SVC(kernel='rbf',C=4.0)
abc_svc.fit(x_train_vectors,y_train)

print(abc_svc.score(x_test_vectors,y_test))

0.8197115384615384


In [17]:
kernel=['linear','rbf','poly']
for i in kernel:
    model=svm.SVC(kernel=i,C=4.0)
    model.fit(x_train_vectors,y_train)
    print('For kernel',i)
    print("Accuracy is: ",model.score(x_test_vectors,y_test))

For kernel linear
Accuracy is:  0.8052884615384616
For kernel rbf
Accuracy is:  0.8197115384615384
For kernel poly
Accuracy is:  0.7716346153846154


In [18]:
import pandas as pd
for i in range(1,10):
    model=svm.SVC(kernel='poly',degree=i,C=100,)
    model.fit(x_train_vectors,y_train)
    print("Accuracy on training data: ",model.score(x_train_vectors,y_train))
    print("Accuracy on testing data: ",model.score(x_test_vectors,y_test))
    

Accuracy on training data:  0.5
Accuracy on testing data:  0.5
Accuracy on training data:  1.0
Accuracy on testing data:  0.8028846153846154
Accuracy on training data:  1.0
Accuracy on testing data:  0.8197115384615384
Accuracy on training data:  1.0
Accuracy on testing data:  0.7716346153846154
Accuracy on training data:  1.0
Accuracy on testing data:  0.7067307692307693
Accuracy on training data:  1.0
Accuracy on testing data:  0.6658653846153846
Accuracy on training data:  1.0
Accuracy on testing data:  0.6514423076923077
Accuracy on training data:  1.0
Accuracy on testing data:  0.6298076923076923
Accuracy on training data:  1.0
Accuracy on testing data:  0.6201923076923077
Accuracy on training data:  1.0
Accuracy on testing data:  0.6201923076923077


## **Qualitative Testing** 

In [19]:
test_review=["Such a boring read,couldn't finish!!"]
new_test=vectorizer.transform(test_review)
print(clf_svm.predict(new_test))

['NEGATIVE']


In [20]:
test_review=["Had an amazing experience reading this thrilling book!!"]
new_test=vectorizer.transform(test_review)
print(clf_svm.predict(new_test))

['POSITIVE']


In [21]:
test_review=["What the hell was that!!"]
new_test=vectorizer.transform(test_review)
print(clf_svm.predict(new_test))

['NEGATIVE']


In [22]:
test_review=[" This is wonderful book, inspiring and wise. My uncle was taken by the Nazis and was almost dead due to typhus when the camp at Dachau was liberated. He was discovered in a heap of bodies by a doctor who noticed a flicker of his eyelids. He was taken to hospital in Budapest and survived until 1967. This book gave me an insight into what he must have suffered. He never complained was always cheerful and full of mischief. The second half of the book about logotherapy is also very interesting and worth reading. "]
new_test=vectorizer.transform(test_review)
print(clf_svm.predict(new_test))

['POSITIVE']


In [23]:
test_review=['Too many technical terms in physchology domain makes reading very uncomfortable. Not much of a self help book as tagged.']
new_test=vectorizer.transform(test_review)
print(clf_svm.predict(new_test))

['NEGATIVE']


## Saving Model

In [24]:
import pickle

with open('./models/sentiment_class_svm.pkl', 'wb') as f:
    pickle.dump(clf_svm, f)
    
with open('./models/sentiment_class_svm_vectorizer.pkl', 'wb') as f:
    pickle.dump(vectorizer, f)

# **Thank You 😄**