# SCIKIT PYHTON MACHINE LEARNING

+ Data Class 

In [97]:
import random

class Sentiment:
  NEGATIVE="NEGATIVE"
  NEUTRAL="NEUTRAL"
  POSITIVE="POSITIVE"

class Review:
  def __init__(self, text, score):
    self.text=text
    self.score=score
    self.sentiment=self.get_sentiment()

  def get_sentiment(self):
    if self.score<=2:
      return Sentiment.NEGATIVE
    elif self.score==3:
      return Sentiment.NEUTRAL
    else: # Score of 4 or 5
      return Sentiment.POSITIVE

class ReviewContainer:
  def __init__(self, reviews):
    self.reviews=reviews

  def get_text(self):
    return [x.text for x in self.reviews]

  def get_sentiment(self):
    return [x.sentiment for x in self.reviews]

  def evenly_distribute(self):
    negative=list(filter(lambda x: x.sentiment==Sentiment.NEGATIVE, self.reviews))
    positive=list(filter(lambda x: x.sentiment==Sentiment.POSITIVE, self.reviews))
    positive_shrunk=positive[:len(negative)]
    self.reviews=negative+positive_shrunk
    random.shuffle(self.reviews)

## Load Data

In [96]:
import json

file_name='./Books_Data_10000.json'

# Array to store the reviews
reviews=[]

# With is used to open and close the file automatically and for loop is used to read the file line by line, and the letter f is used to represent the file
with open(file_name) as f:
  for line in f:
    # print(line)
    ## json.loads() converts a text string in json format to a python object
    review=json.loads(line)
    # print(f"Reviewer Text: {review['reviewText']}")
    # print(f"Overall: {review['overall']}")
    reviews.append(Review(review['reviewText'], review['overall']))

# Review with reviewerText and overall
# print(reviews[7])

# If only I would like to know the reviewerText
print(reviews[7].text)

# If only I would like to know the overall
print(reviews[7].score)

# Know the sentiment based on the overall
print(reviews[7].sentiment)


This is the First book in the Trilogy, and I'm looking forward to reading the second book.  I liked how the main characters interacted with famous characters in western history.
5.0
POSITIVE


## Prepare Data

In [95]:
# len(reviews)
from sklearn.model_selection import train_test_split

# This line of code is used to split the data into training and test data
training,test = train_test_split(reviews, test_size=0.33, random_state=42)

train_container=ReviewContainer(training)
test_container=ReviewContainer(test)


In [65]:
# The data that will be used for testing
print(len(test))

# The data that will be used for training
print(len(training))

3300
6700


In [66]:
print(training[0].text)
print(training[0].score)
print(training[0].sentiment)

Olivia Hampton arrives at the Dunraven family home as cataloger of their extensive library. What she doesn't expect is a broken carriage wheel on the way. Nor a young girl whose mind is clearly gone, an old man in need of care himself (and doesn&#8217;t quite seem all there in Olivia&#8217;s opinion). Furthermore, Marion Dunraven, the only sane one of the bunch and the one Olivia is inexplicable drawn to, seems captive to everyone in the dusty old house. More importantly, she doesn't expect to fall in love with Dunraven's daughter Marion.Can Olivia truly believe the stories of sadness and death that surround the house, or are they all just local neighborhood rumor?Was that carriage trouble just a coincidence or a supernatural sign to stay away? If she remains, will the Castle&#8217;s dark shadows take Olivia down with them or will she and Marion long enough to declare their love?Patty G. Henderson has created an atmospheric and intriguing story in her Gothic tale. I found this to be an

In [108]:
train_container.evenly_distribute()
train_x = train_container.get_text()
train_y = train_container.get_sentiment()

test_container.evenly_distribute()
test_x = test_container.get_text()
test_y = test_container.get_sentiment()

print(train_y.count(Sentiment.POSITIVE))
print(train_y.count(Sentiment.NEGATIVE))

436
436


+ Bag of words vectorization

In [142]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# This book is great!
# This book was so bad

vectorizer = TfidfVectorizer()
train_x_vectors = vectorizer.fit_transform(train_x) 

test_x_vectors=vectorizer.transform(test_x)

print(train_x[0])
print(train_x_vectors[0].toarray())

Soup..Er..Myrtle. Ms. Myrlte was talking to her dog Mtlock when Bettie Easton called. She called about the M.E.L.O.N.S (the letters stand for Mature Elegant Ladies Open Nice Suggestion) The first time she told me about it, I said it made us sound like old hookers.Bettie had been right about one thing.Doris Phillips met me at the door of the Soup kitchen just tickld pink to have a little help.Myrtle had discovered a identity theft ring but did not know who was doing it.
[[0. 0. 0. ... 0. 0. 0.]]


## Classification

+ Linear SVM


In [143]:
from sklearn import svm

# Linear Kernel
clf_svm=svm.SVC(kernel='linear')

# Fit the data
clf_svm.fit(train_x_vectors, train_y)

test_x[0]
test_x_vectors[0]

test_x[0]

# Predict the sentiment
clf_svm.predict(test_x_vectors[0])


array(['POSITIVE'], dtype='<U8')

+ Decision Tree

In [144]:
from sklearn.tree import DecisionTreeClassifier

clf_dec=DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)

clf_dec.predict(test_x_vectors[0])

array(['NEGATIVE'], dtype='<U8')

+ Native Bayes

In [145]:
from sklearn.naive_bayes import GaussianNB

clf_gnb=GaussianNB()
clf_gnb.fit(train_x_vectors.toarray(), train_y)

clf_gnb.predict(test_x_vectors[0].toarray())

array(['NEGATIVE'], dtype='<U8')

+ Logistic Regression

In [146]:
from sklearn.linear_model import LogisticRegression

clf_log=LogisticRegression()
clf_log.fit(train_x_vectors, train_y)

clf_log.predict(test_x_vectors[0])

array(['POSITIVE'], dtype='<U8')

## Evaluation

In [147]:
# Mean accuracy

# Evaluate the model with the score and the svm model
eva_svm=clf_svm.score(test_x_vectors, test_y)

# Evaluate the model with the score and the decision tree model
eva_dec=clf_dec.score(test_x_vectors, test_y)

# Evaluate the model with the score and the naive bayes model
eva_gnb=clf_gnb.score(test_x_vectors.toarray(), test_y)

# Evaluate the model with the score and the logistic regression model
eva_log=clf_log.score(test_x_vectors, test_y)

print(eva_svm)
print(eva_dec)
print(eva_gnb)
print(eva_log)

0.8076923076923077
0.6490384615384616
0.6610576923076923
0.8052884615384616


In [139]:
# F1 Scores 
from sklearn.metrics import f1_score

# It predicts the general sentiment of the review
f1_svm=f1_score(test_y, clf_svm.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE])

f1_dec=f1_score(test_y, clf_dec.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE])

f1_gnb=f1_score(test_y, clf_gnb.predict(test_x_vectors.toarray()), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE])

f1_log=f1_score(test_y, clf_log.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE])

print(f1_svm)
print(f1_dec)
print(f1_gnb)
print(f1_log)

[0.8028169  0.79310345]
[0.64470588 0.62899263]
[0.59574468 0.66666667]
[0.82051282 0.808933  ]


In [140]:
print(test_y.count(Sentiment.POSITIVE))
print(test_y.count(Sentiment.NEGATIVE))
print(test_y.count(Sentiment.NEUTRAL))

208
208
0


In [138]:
test_set=['I thoroughly enjoyed this, 5 stars', 'it was very bored', 'very fun']
new_test=vectorizer.transform(test_set)
clf_svm.predict(new_test)

array(['POSITIVE', 'NEGATIVE', 'POSITIVE'], dtype='<U8')