### **Data Class**

In [16]:
import random

class Sentiment:
    NEGATIVE = "NEGATIVE"
    NEUTRAL = "NEUTRAL"
    POSITIVE = "POSITIVE"

class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()
    
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else:
            return Sentiment.POSITIVE

# Positive and Negative Sentiments are not evenly distributed enough for training
class ReviewContainer:
    def __init__(self, reviews):
        self.reviews = reviews
    
    def get_text(self):
        return [x.text for x in self.reviews]

    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]
    
    def evenly_distribute(self):
        negative = list(filter(lambda x : x.sentiment == Sentiment.NEGATIVE, self.reviews))
        positive = list(filter(lambda x : x.sentiment == Sentiment.POSITIVE, self.reviews))
        positive_shrunk = positive[:len(negative)]
        self.reviews = negative + positive_shrunk
        random.shuffle(self.reviews)

### **Load Data**

In [17]:
import json

file_name = './data/sentiment/Books_small_10000.json'

reviews = []
with open(file_name) as f :
    for line in f:
        review = json.loads(line)
        reviews.append(Review(review['reviewText'], review['overall']))

reviews[0].text

"I bought both boxed sets, books 1-5.  Really a great series!  Start book 1 three weeks ago and just finished book 5.  Sloane Monroe is a great character and being able to follow her through both private life and her PI life gets a reader very involved!  Although clues may be right in front of the reader, there are twists and turns that keep one guessing until the last page!  These are books you won't be disappointed with."

### **Prep Data**

In [18]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(reviews, test_size=0.33, random_state=42)

train_container = ReviewContainer(train)
test_container = ReviewContainer(test)

train_container.evenly_distribute()
X_train = train_container.get_text()
y_train = train_container.get_sentiment()

test_container.evenly_distribute()
X_test = test_container.get_text()
y_test = test_container.get_sentiment()

y_train.count(Sentiment.POSITIVE)

436

##### Bag of Words Vectorization

In [34]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vectorizer = TfidfVectorizer()
X_train_vectors = vectorizer.fit_transform(X_train)
X_test_vectors = vectorizer.transform(X_test)

### **Classification**

#### Linear SVM

In [35]:
from sklearn.svm import SVC

svm = SVC(kernel='linear')
svm.fit(X_train_vectors, y_train)

print(X_test[0])
print(svm.predict(X_test_vectors[0]))

In this book, James Rollins displays his skill at catering to carefree, undiscerning consumers. That is, people who think that if a movie or a book is not saturated with action, it is not worth their time. So, even when the story does not require an action sequence, Rollins inserts one. For discerning readers, the result is poisonous, agonizing pulp.Not only is the action excessive, much of it is implausible. The early boat chase on the Yangtze River? Not only can it be deleted without affecting the story, the end of the chase is silly. It is inappropriate for a serious novel, though fine for a comic book. Later on Rollins inserts a sequence reminiscent of John Wayne's prominent Rooster Cogburn sequence in TRUE GRIT. The difference is that Rooster Cogburn's is more realistic. And near the end of the book, we have a super-sandstorm that is miraculously cooperative. Its timing is too perfect. It seems to tell the characters, "Tell me when I should arrive, to make your story suspenseful. 

#### Decision Tree

In [36]:
from sklearn.tree import DecisionTreeClassifier

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train_vectors, y_train)

decision_tree.predict(X_test_vectors[0])

array(['NEGATIVE'], dtype='<U8')

#### Naive Bayes

In [37]:
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import CategoricalNB
from sklearn.naive_bayes import MultinomialNB

gaussian_nb = GaussianNB()
gaussian_nb.fit(X_train_vectors.toarray(), y_train) # Gaussian NB takes dense matrix as input

gaussian_nb.predict(X_test_vectors[0].toarray())

bernoulli_nb = BernoulliNB()
bernoulli_nb.fit(X_train_vectors.toarray(), y_train)

bernoulli_nb.predict(X_test_vectors[0].toarray())

categorical_nb = CategoricalNB()
categorical_nb.fit(X_train_vectors.toarray(), y_train)

#categorical_nb.predict(X_test_vectors[0].toarray())

multinomial_nb = MultinomialNB()
multinomial_nb.fit(X_train_vectors.toarray(), y_train)

multinomial_nb.predict(X_test_vectors[0].toarray())

array(['NEGATIVE'], dtype='<U8')

#### Logistic Regression

In [38]:
from sklearn.linear_model import LogisticRegression

logistic = LogisticRegression(max_iter=1000)
logistic.fit(X_train_vectors, y_train)

logistic.predict(X_test_vectors[0])

array(['NEGATIVE'], dtype='<U8')

### **Evaluation**

#### Mean Accuracy

In [39]:
print("SVM Score : ", end="")
print(svm.score(X_test_vectors, y_test))
print("Decision Tree Score : ", end="")
print(decision_tree.score(X_test_vectors, y_test))
print("Gaussian Naive Bayes Score : ", end="")
print(gaussian_nb.score(X_test_vectors.toarray(), y_test))
print("Bernoulli Naive Bayes Score : ", end="")
print(bernoulli_nb.score(X_test_vectors.toarray(), y_test))
#print("Categorical Naive Bayes Score : ", end="")
#print(categorical_nb.score(X_test_vectors.toarray(), y_test))
print("Multinomial Naive Bayes Score : ", end="")
print(multinomial_nb.score(X_test_vectors.toarray(), y_test))
print("Logistic Regression Score : ", end="")
print(logistic.score(X_test_vectors, y_test))

SVM Score : 0.8076923076923077
Decision Tree Score : 0.6658653846153846
Gaussian Naive Bayes Score : 0.6610576923076923
Bernoulli Naive Bayes Score : 0.8269230769230769
Multinomial Naive Bayes Score : 0.8125
Logistic Regression Score : 0.8052884615384616


#### F1 Score

In [40]:
from sklearn.metrics import f1_score

f1_score(y_test, svm.predict(X_test_vectors), average=None, 
         labels=(Sentiment.POSITIVE, Sentiment.NEGATIVE))

array([0.80582524, 0.80952381])

In [41]:
test_set = ["I thoroughly enjoyed this, 5 stars", "bad book do not buy", "horrible waste of time", "not great",
            "very light and enjoyable read", "2 stars", "this book is badly underrated"]
new_test = vectorizer.transform(test_set)

svm.predict(new_test)

array(['POSITIVE', 'NEGATIVE', 'NEGATIVE', 'NEGATIVE', 'POSITIVE',
       'POSITIVE', 'POSITIVE'], dtype='<U8')

### **Model Tuning (Grid Search)**

In [43]:
from sklearn.model_selection import GridSearchCV

parameters = {'kernel': ('linear', 'rbf'),
              'C' : (1,4,8,16,32)}

svc = SVC()
classifier = GridSearchCV(svc, parameters, cv=5)
classifier.fit(X_train_vectors, y_train)

In [44]:
classifier.score(X_test_vectors, y_test)

0.8100961538461539