# **Detecting if a statement is positive or negative**
## with sklearn

### [Amazon dataset link](https://cseweb.ucsd.edu/~jmcauley/datasets.html#amazon_reviews)

### [books_small.json source](https://github.com/KeithGalli/sklearn/blob/master/data/sentiment/Books_small.json)

## Data Classes

In [1]:
import random
import pickle

class sentiment:
    NEGATIVE = 'NEGATIVE'
    NEUTRAL = 'NEUTRAL'
    POSITIVE = 'POSITIVE'

class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()

    def get_sentiment(self):
        if self.score <= 2:
            return sentiment.NEGATIVE
        elif self.score == 3:
            return sentiment.NEUTRAL
        else:
            return sentiment.POSITIVE
        
class ReviewContainer:
    def __init__(self, reviews):
        self.reviews = reviews

    def evenly_dual(self):
        #filters the reviews for where reviews is negative
        negative = list(filter(lambda x: x.sentiment == sentiment.NEGATIVE, self.reviews))
        len_neg = len(negative)
        positive = list(filter(lambda x: x.sentiment == sentiment.POSITIVE, self.reviews))
        len_pos = len(positive)
        prime_len = min(len_neg, len_pos)
        if prime_len == len_pos:
            neg_shrunk = negative[:prime_len]
            self.reviews = positive + neg_shrunk
        else:
            pos_shrunk = positive[:prime_len]
            self.reviews = negative + pos_shrunk
        random.shuffle(self.reviews)


    def evenly_trio(self):
        negative = list(filter(lambda x: x.sentiment == sentiment.NEGATIVE, self.reviews))
        len_neg = len(negative)
        positive = list(filter(lambda x: x.sentiment == sentiment.POSITIVE, self.reviews))
        len_pos = len(positive)
        neutral = list(filter(lambda x: x.sentiment == sentiment.NEUTRAL, self.reviews))
        len_neu = len(neutral)
        prime_len = min(len_pos, len_neg, len_neu)
        if prime_len == len_pos:
            neu_shrunk = neutral[:prime_len]
            neg_shrunk = negative[:prime_len]
            self.reviews = positive + neg_shrunk + neu_shrunk
        elif prime_len == len_neg:
            neu_shrunk = neutral[:prime_len]
            pos_shrunk = positive[:prime_len]
            self.reviews = negative + pos_shrunk + neu_shrunk
        else:
            neg_shrunk = negative[:prime_len]
            pos_shrunk = positive[:prime_len]
            self.reviews = neutral + pos_shrunk + neg_shrunk
        random.shuffle(self.reviews)

    def get_text(self):
        return [ x.text for x in self.reviews]
    
    def get_y(self):
        return [ y.sentiment for y in self.reviews]
    

# Load Data

In [2]:
import json

file_name_s = 'books_small.json'
file_name = 'books_small_10000.json'

'''
we only need the reviews text and rating
'''
reviews = []


with open(file_name) as f:
    for line in f:
        inline = json.loads(line)
        reviews.append(Review(inline['reviewText'],inline['overall']))

In [3]:
reviews[5].text

'I hoped for Mia to have some peace in this book, but her story is so real and raw.  Broken World was so touching and emotional because you go from Mia\'s trauma to her trying to cope.  I love the way the story displays how there is no "just bouncing back" from being sexually assaulted.  Mia showed us how those demons come for you every day and how sometimes they best you. I was so in the moment with Broken World and hurt with Mia because she was surrounded by people but so alone and I understood her feelings.  I found myself wishing I could give her some of my courage and strength or even just to be there for her.  Thank you Lizzy for putting a great character\'s voice on a strong subject and making it so that other peoples story may be heard through Mia\'s.'

## Bag of words

> Bag of Words (BoW) is a method for extracting features from text documents. It creates a "bag" of the individual words in each document, disregarding grammar and word order, to create a numerical representation of the text. This representation can then be used in natural language processing (NLP) applications, such as sentiment analysis, word embedding, or topic modelling.
>
> The bag of words approach is simple to understand and implement. It requires minimal preprocessing compared to other methods and is very effective at capturing the meaning in a text.

<table>
<tr>
<th></th>
<th>This</th>
<th>book</th>
<th>is</th>
<th>great</th>
<th>was</th>
<th>so</th>
<th>bad</th>
</tr>
<tr>
<td>This book is great!</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>This book was so bad</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>

<tr>
<td>was a very great book</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
</tr>
</table>

> the 1s and 0s are vectors that maps to each other


In [4]:
from sklearn.model_selection import train_test_split

train, test = train_test_split( reviews, test_size=0.33, random_state=42)

In [5]:
train_container_2 = ReviewContainer(train)
test_container_2 = ReviewContainer(test)
# train_container_3 = ReviewContainer(train)
# test_container_3 = ReviewContainer(test)

In [6]:
train_container_2.evenly_dual()
# train_container_3.evenly_trio()

train_x = train_container_2.get_text()
train_y = train_container_2.get_y()

# train_x3 = train_container_3.get_text()
# train_y3 = train_container_3.get_y()

test_container_2.evenly_dual()
# test_container_3.evenly_trio()

test_x = test_container_2.get_text()
test_y = test_container_2.get_y()

# test_x3 = test_container_3.get_text()
# test_y3 = test_container_3.get_y()

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

## using something different than countvectorizer
[source for TfidVectorizer](https://youtu.be/M9Itm95JzL0?t=5005)

In [57]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

In [9]:
with open('./models/vectorized_text.pickle', 'wb') as f:
    pickle.dump(vectorizer, f)

In [8]:

#fiting the train data and transforming it to matrics
train_x_vector = vectorizer.fit_transform(train_x)
# train_x3_vector = vectorizer.fit_transform(train_x3)

# only transforming the test train data and not fiting it into the model
test_x_vector = vectorizer.transform(test_x)
# test_x3_vector = vectorizer.transform(test_x3)

## Classification

### Linear SVM

In [59]:
from sklearn import svm
linear_svm_model = svm.SVC(kernel='linear')
linear_svm_model.fit(train_x_vector, train_y)

# linear_svm_model3 = svm.SVC(kernel='linear')
# linear_svm_model3.fit(train_x3_vector, train_y3)

In [97]:

print(test_x[0], test_y[0], sep="\n")
sup = linear_svm_model.predict(test_x_vector[0])

the book was meant for young adults  and I as an older adult still enjoyed the story and I have always enjoyed Sandra Dallas books and her style of writing whether for the young or old
POSITIVE


In [98]:
sup

array(['POSITIVE'], dtype='<U8')

## Decision Tree

In [60]:
from sklearn.tree import DecisionTreeClassifier

dec_tree = DecisionTreeClassifier()
dec_tree.fit(train_x_vector, train_y)

# dec_tree3 = DecisionTreeClassifier()
# dec_tree3.fit(train_x3_vector, train_y3)

In [10]:
dec_tree.predict(test_x_vector[0])

array(['POSITIVE'], dtype='<U8')

## Naive Bayes

In [61]:
from sklearn.naive_bayes import GaussianNB

guass = GaussianNB()
guass.fit(train_x_vector.toarray(), train_y)

# guass3 = GaussianNB()
# guass3.fit(train_x3_vector.toarray(), train_y3)

In [12]:
guass.predict(test_x_vector[0].toarray())

array(['POSITIVE'], dtype='<U8')

In [62]:
import warnings
warnings.filterwarnings('ignore')

## Logistic regression

In [9]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(train_x_vector,train_y)

# log_reg3 = LogisticRegression()
# log_reg3.fit(train_x3_vector,train_y3)

In [10]:
log_reg.predict(test_x_vector[0])

array(['NEGATIVE'], dtype='<U8')

### Linear regression

# i am retiring LINEAR REGRESSION

In [None]:
# def turn_no(sentiment):
#     if sentiment == 'POSITIVE':
#         return 2
#     if sentiment == 'NEGATIVE':
#         return 0
#     if sentiment == 'NEUTRAL':
#         return 1

In [None]:
# from sklearn.linear_model import LinearRegression

# train_y_no = [ turn_no(i) for i in train_y]

# lin_reg = LinearRegression().fit(train_x_vector, train_y_no)

In [None]:
# lin_reg.predict(test_x_vector[0])

## RATING THE MODELS

# SCORES

In [64]:
# svm 2
linear_svm_model.score(test_x_vector, test_y)

0.8076923076923077

In [None]:
# svm 3
# linear_svm_model3.score(test_x3_vector, test_y3)

In [65]:
# decision tree
dec_tree.score(test_x_vector, test_y)

0.6322115384615384

In [None]:
# decision tree 3
# dec_tree3.score(test_x3_vector, test_y3)

In [66]:
# gaussian naive bayes 2
guass.score(test_x_vector.toarray(), test_y)

0.6610576923076923

In [None]:
# gaussian naive bayes 3
# guass3.score(test_x3_vector.toarray(), test_y3)

In [11]:
#logistic regression
log_reg.score(test_x_vector, test_y)

0.8149038461538461

In [None]:
#Linear regression
# ty = [ turn_no(i) for i in test_y]
# tx = test_x_vector

# def scoring(model,x,y,way='round'):
#     if way == 'round':
#         y_pred = np.round(model.predict(x))
#     if way == 'ceil':
#         y_pred = np.ceil(model.predict(x))
#     accuracy = np.sum(y_pred ==y)/len(y)
#     return accuracy

# print('with ceil', scoring(lin_reg,tx,ty,'ceil'))
# print('with round', scoring(lin_reg,tx,ty))

# F1 scores

In [68]:

from sklearn.metrics import f1_score
pnn = ['POSITIVE','NEUTRAL','NEGATIVE']
pn = ['POSITIVE','NEGATIVE']

# svm
f1_score(test_y, linear_svm_model.predict(test_x_vector), average=None, labels=pn)

array([0.80582524, 0.80952381])

In [69]:
#logistic regression
f1_score(test_y, log_reg.predict(test_x_vector), average=None, labels=pn)

array([0.80291971, 0.80760095])

In [70]:
#decision tree
f1_score(test_y, dec_tree.predict(test_x_vector), average=None, labels=pn)

array([0.62407862, 0.64      ])

In [71]:
#guassian naive bayes
f1_score(test_y, guass.predict(test_x_vector.toarray()), average=None, labels=pn)

array([0.65693431, 0.66508314])

> From the length of the results, postive data seems to be too much
>
> The 'books_small.json' file
> Thus making the model to be bias towards the positive results more,
>
> meaning, it will treat the results more positive than negative
>
> we balance it out by either making them equal in length or find more data that has more to be balanced. #I choose finding more data

so i used the 'books_small_1000.json' file
> so we will make the model have equal number of categories using the negative as the lead
>
> i will make 2 model, 1 for postive and negative only & positive negative and neutral only

In [None]:
# print(sentiment.POSITIVE, train_y.count(sentiment.POSITIVE))
# print(sentiment.NEGATIVE, train_y.count(sentiment.NEGATIVE))
# print(sentiment.NEUTRAL, train_y.count(sentiment.NEUTRAL))

### Seems SVM and Logistic regression are good for it
## TESTING IT

In [12]:
test_set = ['good reviews','try to get this for mental health ','it sucks ']
new_test = vectorizer.transform(test_set)

linear_svm_model.predict(new_test)

NameError: name 'linear_svm_model' is not defined

In [16]:
test_set = ['bad','ugh ','it sucks ']
new_test = vectorizer.transform(test_set)

In [17]:
log_reg.predict(new_test)

array(['NEGATIVE', 'NEGATIVE', 'NEGATIVE'], dtype='<U8')

### Logistic regression with countvectorizer
### svm with Tfipvectorizer

## Tuning model with Grid search

In [None]:
import pandas as pd

In [82]:
from sklearn.model_selection import GridSearchCV

parameters =  {'kernel': ('linear', 'rbf', 'sigmoid'), 'C': (1,4,8,16,32)}
svc = svm.SVC()

grid_svm = GridSearchCV(svc, parameters, cv=5)
grid_svm.fit(train_x_vector, train_y)

In [None]:
sorted(grid_svm.cv_results_.keys())

In [None]:
print(grid_svm.cv_results_)

In [89]:

results_df = pd.DataFrame(grid_svm.cv_results_)
results_df = results_df.sort_values(by=['rank_test_score'])
results_df = results_df.set_index(
    results_df["params"].apply(lambda x: "_".join(str(val) for val in x.values()))
).rename_axis("kernel")
results_df[["params", "rank_test_score", "mean_test_score", "std_test_score"]]

Unnamed: 0_level_0,params,rank_test_score,mean_test_score,std_test_score
kernel,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
4_rbf,"{'C': 4, 'kernel': 'rbf'}",1,0.836007,0.031859
8_rbf,"{'C': 8, 'kernel': 'rbf'}",1,0.836007,0.031859
16_rbf,"{'C': 16, 'kernel': 'rbf'}",1,0.836007,0.031859
32_rbf,"{'C': 32, 'kernel': 'rbf'}",1,0.836007,0.031859
1_rbf,"{'C': 1, 'kernel': 'rbf'}",5,0.831435,0.04112
1_sigmoid,"{'C': 1, 'kernel': 'sigmoid'}",6,0.829149,0.030447
1_linear,"{'C': 1, 'kernel': 'linear'}",7,0.824552,0.029452
4_sigmoid,"{'C': 4, 'kernel': 'sigmoid'}",8,0.815369,0.020905
4_linear,"{'C': 4, 'kernel': 'linear'}",9,0.814233,0.019899
8_linear,"{'C': 8, 'kernel': 'linear'}",9,0.814233,0.019899


In [90]:
grid_svm.score(test_x_vector,test_y)

0.8197115384615384

In [91]:
# svm 2
linear_svm_model.score(test_x_vector, test_y)

0.8076923076923077

# SAVING THE MODEL

In [94]:
with open('./models/svm_tfipvector_k_rbf_c4.pickle', 'wb') as f:
    pickle.dump(grid_svm, f)

In [96]:
with open('./models/log_reg_tfipvector.pickle','wb') as f:
    pickle.dump(log_reg, f)

In [33]:
# with open('./models/log_reg_countVectorizer.pickle','wb') as f:
#     pickle.dump(log_reg, f)

# LOAD MODEL

In [18]:
with open('./models/svm_tfipvector_k_rbf_c4.pickle', 'rb') as f:
    loaded_svm = pickle.load(f)

In [21]:
with open('./models/log_reg_tfipvector.pickle', 'rb') as f:
    loaded_og = pickle.load(f)

In [None]:
## THE BEST ONE
with open('./models/log_reg_countVectorizer.pickle','rb') as f:
    loaded_log_CountVectorizer = pickle.load(f)

In [34]:
log_reg.predict(test_x_vector[188])

array(['NEGATIVE'], dtype='<U8')

In [35]:
loaded_og.predict(test_x_vector[188])

array(['NEGATIVE'], dtype='<U8')

In [36]:
print(test_x[188], test_y[188], sep="\n")

Not sufficiently interesting for me to read more of this author. The humour was broad and predictable. The setting , in rural US and 'gee whiz' big cities are different, was rather twee. Murder, tick, humour, tick, romance, tick, - writing by tick-box. Not badly written, but not very interesting either.
NEGATIVE
