<h1>Sentiment analysis on amazon product reviews</h1>

<h6>In this notebook I make a real world python machine learning project using the sci-kit learn library. In it I build a model that automatically classifies text as either having a positive or negative sentiment. I do this by using amazon reviews as the training data and using multiple machine learning algorithms for classifying the data. Upon finding the best model, I store it in a pkl file so that we don't need to re-train the data</h6>

<p>For our analysis, I have taken a dataset of 10000 user reviews and trimmed it to keep only two columns which are review text and rating of the product.</p>

<p>Creating an enumeration class that stores types of sentiments. For our analysis we take 3 types of sentiments</p>

In [1]:
import random
class Sentiment:
    Negative="Negative"
    Neutral="Neutral"
    Positive="Positive"


<p>Since the values of ratings are from 1-5, I converted them in 3 categorical values namely positive, negative and neutral usign the method get_sentiment of the class below.</p>

In [2]:
class Review:
    def __init__(self,text,score):
        self.text=text
        self.score=score
        self.sentiment=self.get_sentiment()
    def get_sentiment(self):
        if self.score<=2:
            return Sentiment.Negative
        elif self.score==3:
            return Sentiment.Neutral
        else:
            return Sentiment.Positive

<h5>Trainig data is heavily biased towards positive data since 85% reviews are positive and less than 15% or so are negative 


<p>Creating a container class which separates text, sentiments and the reviews</p>

In [3]:
class ReviewContainer:
    def __init__(self,reviews):
        self.reviews=reviews
    def get_text(self):
        return [x.text for x in self.reviews]
    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]
    def evenly_distribute(self):
        negative = list(filter(lambda x: x.sentiment == Sentiment.Negative, self.reviews))
        positive = list(filter(lambda x: x.sentiment == Sentiment.Positive, self.reviews))
        positive_shrunk = positive[:len(negative)]
        self.reviews = negative + positive_shrunk
        random.shuffle(self.reviews)

<p>Loading the dataset, loading the data from file and storing it in a list 'reviews'</p>

In [4]:
import json
file_name='books_small_10000.json'
reviews=[]
with open(file_name) as f:
    for line in f:
        review=json.loads(line)
        reviews.append(Review(review['reviewText'],review['overall']))

<p>Creating sets for training and test data</p>

In [5]:

from sklearn.model_selection import train_test_split
training,test=train_test_split(reviews,test_size=0.33,random_state=42)

train_container = ReviewContainer(training)

test_container = ReviewContainer(test)
train_container.evenly_distribute()
train_x = train_container.get_text()
train_y = train_container.get_sentiment()

test_container.evenly_distribute()
test_x = test_container.get_text()
test_y = test_container.get_sentiment()


<p>Tokenizing words with sklearn</p>

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
train_x_vectors = vectorizer.fit_transform(train_x)
test_x_vectors = vectorizer.transform(test_x)
computeIFD(train_x_vectors)

NameError: name 'computeIFD' is not defined

In [None]:

# sum_words = train_x_vectors.sum(axis=0) 
# words_freq = [(word, sum_words[0, idx]) for word, idx in     vectorizer.vocabulary_.items()]
# words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
# for i in range(0,len(words_freq)):
#     print(words_freq[i] )

## Classification of data with multiple algorithms


#### SVM algorithm

In [None]:
from sklearn import svm

clf_svm = svm.SVC(kernel='linear')

clf_svm.fit(train_x_vectors, train_y)
count=0
for i in range(0,len(test_x)):
    if(clf_svm.predict(test_x_vectors[i])==test_y[i]):
        count+=1;
print(f'Correctly predicted {count} times.')
print(f'Accurary is {count*100//len(test_x)} %')
    

#### Decision tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

clf_dec=DecisionTreeClassifier()
clf_dec.fit(train_x_vectors,train_y)
count=0
for i in range(0,len(test_x)):
    if(clf_dec.predict(test_x_vectors[i])==test_y[i]):
        count+=1
    
print(f'Correctly predicted {count} times.')
print(f'Accurary is {count*100//len(test_x)} %')

#### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression()
clf_log.fit(train_x_vectors, train_y)
count=0
for i in range(0,len(test_x)):
    if(clf_log.predict(test_x_vectors[i])==test_y[i]):
        count+=1
    
print(f'Correctly predicted {count} times.')
print(f'Accurary is {count*100//len(test_x)} %')

## Evaluation

In [None]:
#mean accuracy
print(clf_svm.score(test_x_vectors,test_y))
print(clf_dec.score(test_x_vectors,test_y))
print(clf_log.score(test_x_vectors,test_y))

<p> For now, let's skip neutral reviews as they are average</p>

###### Finding the F1 score for positive and negative reviews using each model

In [None]:
from sklearn.metrics import f1_score
f1_score(test_y,clf_svm.predict(test_x_vectors),average=None,labels=[Sentiment.Positive,Sentiment.Negative])


In [None]:

f1_score(test_y,clf_dec.predict(test_x_vectors),average=None,labels=[Sentiment.Positive,Sentiment.Neutral,Sentiment.Negative])


In [None]:

f1_score(test_y,clf_log.predict(test_x_vectors),average=None,labels=[Sentiment.Positive,Sentiment.Neutral,Sentiment.Negative])

###### <p>Since the F1 for SVM is highest,let's check if linear/rbf svm is better</p>

In [None]:
from sklearn.model_selection import GridSearchCV
parameters={'kernel':('linear','rbf'),'C':(1,4,8,16,32)}
svc=svm.SVC()
clf=GridSearchCV(svc,parameters,cv=5)
clf.fit(train_x_vectors,train_y)
print(clf.score(test_x_vectors,test_y))

classification using rbf svm

In [None]:

from sklearn import svm

clf_svmr = svm.SVC(kernel='rbf')

clf_svmr.fit(train_x_vectors, train_y)
count=0
#check accuracy on the test data 
for i in range(0,len(test_x)):
    #print(f'Predicted as {clf_svm.predict(test_x_vectors[i])} and actual value is {test_y[i]}')
    if(clf_svmr.predict(test_x_vectors[i])==test_y[i]):
        count+=1;
print(f'Correctly predicted {count} times.')
print(f'Accurary is {count*100//len(test_x)} %')
  

increase in accuracy by 1%

In [None]:

print(clf_svmr.score(test_x_vectors,test_y))
f1_score(test_y,clf_svmr.predict(test_x_vectors),average=None,labels=[Sentiment.Positive,Sentiment.Negative])


###### The RBF_SVM model is giving the highest accurary so we are saving model using pickle so that we dont need to train it againabs

In [None]:
import pickle
with open('sentiment_classifier.pkl','wb') as f:
    pickle.dump(clf_svmr,f)

###### loading the saved and trained model

In [None]:
with open('sentiment_classifier.pkl','rb') as f:
    loaded_clf=pickle.load(f)
  

In [None]:
#check if it's working
loaded_clf.predict(test_x_vectors[0])