In [115]:
import random
'''
    Enums Class
'''
class SentimentEnums:
    NEGATIVE = 'NEGATIVE'
    POSITIVE = 'POSITIVE'
    NEUTRAL = 'NEUTRAL'

'''
    Data Class
'''

class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()
    
    def get_sentiment(self):
        if self.score <= 2:
            return SentimentEnums.NEGATIVE
        elif self.score == 3:
            return SentimentEnums.NEUTRAL
        else: #Score of 4 or 5
            return SentimentEnums.POSITIVE
        
class ReviewContainer:
    def __init__(self, reviews):
        self.reviews = reviews
        
    def evenly_distribute(self):
#         look at all the reviews and mapp every sentiment that are negative
        negative = list(filter(lambda x: x.sentiment == SentimentEnums.NEGATIVE, self.reviews))
#         print(negative)
        positive = list(filter(lambda x: x.sentiment == SentimentEnums.POSITIVE, self.reviews))
        positive_shrunk = positive[:len(negative)]
        self.reviews = negative + positive_shrunk
        random.shuffle(self.reviews) #shuffle so that we don't know if we will be having positive or negative next so as to balance the data... this is simply data preparation and preprocessing
        print(negative[0].text)
        print(len(negative))
        print(len(positive))
    
    def get_text(self):
        return [x.text for x in self.reviews]
    
    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]
    
        

In [101]:
import json

file_name = './book_small_bigdata.json'

reviews = []

with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        reviews.append(Review(review['reviewText'], review['overall']))
reviews[5].text

'I hoped for Mia to have some peace in this book, but her story is so real and raw.  Broken World was so touching and emotional because you go from Mia\'s trauma to her trying to cope.  I love the way the story displays how there is no "just bouncing back" from being sexually assaulted.  Mia showed us how those demons come for you every day and how sometimes they best you. I was so in the moment with Broken World and hurt with Mia because she was surrounded by people but so alone and I understood her feelings.  I found myself wishing I could give her some of my courage and strength or even just to be there for her.  Thank you Lizzy for putting a great character\'s voice on a strong subject and making it so that other peoples story may be heard through Mia\'s.'

In [80]:
len(reviews)

10000

## Data Preparation

In [116]:
from sklearn.model_selection import train_test_split

# creates feature encoding on the data inform of bag of words since ML models doesn't work well when dealing with texts so we need to represent each reviews in a bag of words --> one hot encoding
training, test = train_test_split(reviews, test_size=0.33, random_state=42) # using 33% in test
train_container = ReviewContainer(training)
test_container = ReviewContainer(test) # shows us that we have 436 pieces of text and 5611 positives
len(train_container.reviews)

6700

In [82]:
#len(test) #len(training)
print(training[0].sentiment)

POSITIVE


In [129]:
# pass the training set into our bag of words/one hot encoding
# this represent our test dataset which is 67% of of the entire 1000 reviews we have
train_container.evenly_distribute()
train_x = train_container.get_text()
train_y = train_container.get_sentiment()

# this represent our test dataset which is 33% of of the entire 1000 reviews we have
test_x = test_container.get_text()
test_y = test_container.get_sentiment()

test_container.evenly_distribute()

train_y.count(SentimentEnums.POSITIVE)
train_y.count(SentimentEnums.NEGATIVE)

print(train_y.count(SentimentEnums.POSITIVE))
print(train_y.count(SentimentEnums.NEGATIVE))

train_x[0]  #display the text
train_y[0]  #display the sentiment

# test_x[0]


It was just one of those books that never went anywhere. I like books that get your attention in the beginning and not drag out until a quarter way through. I decided to give it an early death - delete!
436
5611
Story is very inaccurate with modern words, phrases and actions.  In the second chapter the author has the bagpipes playing "Amazing Grace" and according to her it is a song as old as time.  As someone who learned to play Amazing Grace on the piano I can state for a fact the song is not old as time. It was not even published until 1779; author has the book set in 1714. 65 years before John Newton wrote and published the songFiona and Juliet speak like they are in the 21 century. Not a young miss in the early 18th century.I have no problem reading about God in books. My problem is when authors take too much leeway and write using modern phrases in historical books.Really, wondering if this author did any 'real' research or just used what she remembered from high school world his

'NEGATIVE'

## Bag of words/ One hot encoding vectorization

In [146]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# This book is great!
# This book was so bad
vectorizer = TfidfVectorizer()
# fix the vectorizer into our training data and then transform it using the fit_transform method on the training data u could use vectorizer.fit(train_x) and train_x_vectors = vectorizer.transform(train_x) but this would is a long process
train_x_vectors = vectorizer.fit_transform(train_x) # <670(rows)x7372(columns) sparse matrix of typ
print(train_x[0]) # Review in text -> Vivid characters and descriptions. The author has created a tale that grabs your attention and I couldn't put it down.
print(train_x_vectors[0].toarray()) # Feature encoding result of the above text -> [[0 0 0 ... 0 0 0]]

# train_x_vectors
# train_y

Too bogged down with details that don't matter.  Not his best work.  But..... It it's a good story, with a great ending.
[[0. 0. 0. ... 0. 0. 0.]]


## Classification

### Linear Support Vector Machine (SVM)

In [147]:
from sklearn import svm
classifier_svm = svm.SVC(kernel='linear') #returns a classifier
classifier_svm.fit(train_x_vectors, train_y)
test_x_vectors = vectorizer.transform(test_x)
test_x[0]
classifier_svm.predict(test_x_vectors[0])
# classifier_svm.predict()

array(['POSITIVE'], dtype='<U8')

#### Decision Tree

In [148]:
from sklearn.tree import DecisionTreeClassifier
clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)
# clf_dec.predict(test_x_vectors[0])

DecisionTreeClassifier()

#### Naive Bayes

In [149]:
from sklearn.naive_bayes import GaussianNB
clf_gnb = DecisionTreeClassifier()
clf_gnb.fit(train_x_vectors, train_y)
clf_gnb.predict(test_x_vectors[0])

array(['NEGATIVE'], dtype='<U8')

#### Logistic Regression

In [150]:
from sklearn.linear_model import LogisticRegression
clf_log = LogisticRegression()
clf_log.fit(train_x_vectors, train_y)
clf_log.predict(test_x_vectors[0])

array(['POSITIVE'], dtype='<U8')

## Evaluation

In [151]:
# Mean Accuracy
print(classifier_svm.score(test_x_vectors, test_y))
print(clf_dec.score(test_x_vectors, test_y)) #checking the accuracy of our model
print(clf_gnb.score(test_x_vectors, test_y))
print(clf_log.score(test_x_vectors, test_y))

0.7412121212121212
0.566969696969697
0.546060606060606
0.7333333333333333


In [152]:
# F1 Scores -> data scientists watch for this
from sklearn.metrics import f1_score
f1_score(test_y, classifier_svm.predict(test_x_vectors), average=None, labels=[SentimentEnums.POSITIVE, SentimentEnums.NEGATIVE])# array([0.91319444, 0.21052632, 0.22222222]) this means our model for f1score is very good for positive but trash for neutral and negative labels
# f1_score(test_y, clf_dec.predict(test_x_vectors), average=None, labels=[SentimentEnums.POSITIVE, SentimentEnums.NEUTRAL, SentimentEnums.NEGATIVE])# array([0.86971831, 0.13793103, 0.05882353]) this means our model for f1score is very good for positive but trash for neutral and negative labels
# f1_score(test_y, clf_gnb.predict(test_x_vectors), average=None, labels=[SentimentEnums.POSITIVE, SentimentEnums.NEUTRAL, SentimentEnums.NEGATIVE])# array([0.86619718, 0.1       , 0.        ]) this means our model for f1score is very good for positive but trash for neutral and negative labels

# Basically our model is bad and biased because it only predicts positive sentiments alone and bad at other labels


array([0.87656461, 0.3142329 ])

In [153]:
train_y[0:5] # we have alot of positive labels  in our first five training set this means that since our model keeps doing too well on the positive label its because of the kind of dataset we've got ['POSITIVE', 'POSITIVE', 'POSITIVE', 'POSITIVE', 'POSITIVE']
train_y.count(SentimentEnums.POSITIVE) # 552 -> meaning, we've got 552 positive result set. Rememmber that we had 670 rows labels in total and 552 of them were positive this implies that our model will definitely be heavily biased! No matter the ML Library used to predict it either Decision tree or SVM. e.t.c
train_y.count(SentimentEnums.NEGATIVE) # 47 NEGATIVE label -> so the rest labels (neutral) will be the remaining numbers 

436

In [154]:
test_y.count(SentimentEnums.POSITIVE)

2767

In [155]:
test_test = ['very fun', "bad book do not buy", 'horrible waste of time']
new_test = vectorizer.transform(test_test)
classifier_svm.predict(new_test)

array(['POSITIVE', 'NEGATIVE', 'NEGATIVE'], dtype='<U8')

### Tuning our model (with Grid Search)

In [156]:
from sklearn.model_selection import GridSearchCV

# checks the linear and rbf kernel and compare with the C values
parameters = {'kernel' : ('linear', 'rbf'), 'C' : (1,4,8,16,32)} 
svc = svm.SVC()
clf = GridSearchCV(svc, parameters, cv=5)
clf.fit(train_x_vectors, train_y)

GridSearchCV(cv=5, estimator=SVC(),
             param_grid={'C': (1, 4, 8, 16, 32), 'kernel': ('linear', 'rbf')})

In [157]:
print(clf.score(test_x_vectors, test_y))

0.7493939393939394


## Saving Model

In [158]:
import pickle 

with open('./models/sentiment_classifier.pkl', 'wb') as f:
    pickle.dump(clf, f) #dumps all parameter in the clf model into our sentiment file we are about to create

### Load Model

In [159]:
with open('./models/sentiment_classifier.pkl', 'rb') as f:
    loaded_clf = pickle.load(f)

In [160]:
print(test_x[0])
loaded_clf.predict(test_x_vectors[0])

was sent an Arc of this book for an honest review and here it is = This is the kind of book that you want to read while sitting in front of the fire with a cup of hot apple cider and a blanket over your legs.I have read many of Jaci Burton's books and have never been disappointed. This first book in her new Hope series does not disappoint either.This is the story of Emma, a new vet who has come back home to open her own practice and Luke McCormack, a police officer in the same town.Both have been previously burned by love so both have issues but, that doesn't stop them from acting on that attraction.This book pulls you in from the first page, wraps you up and doesn't let you go until the end.I loved it!


array(['POSITIVE'], dtype='<U8')