# Logistic Regression & Naive Bayes for Sentiment Analysis:


In this analysis we compare utilizing naive bayes model and a logistic regression model to predict if an amazon review regarding a pillow product is good or bad. 

Amazon pillow: https://www.amazon.com/Utopia-Bedding-Gusseted-Premium-Quality/dp/B09DSRLTQH/ref=sr_1_6?crid=1QEB93GTYC4U2&keywords=pillow&qid=1642479006&sprefix=pillow%2Caps%2C98&sr=8-6&th=1

# Steps: 


1.   Extract the data from Amazon using Amazon Review Export (https://chrome.google.com/webstore/detail/amazon-review-export/ikphihiljfhlmpokjbmkhliphckfpcph?hl=en-US) for the testing and training data set. 
2.   Download the data and define a good vs. bad review by adding a column associating a star rating of 4 or 5 to "**Positive**", 3 to "**Neutra**l", and 1 or 2 to "**Negative**". 

3.   Store in a Github Repository. https://github.com/mazal-tov/amazon_reviews.

4.   Import python packages to clean the data and run the models. 

5.   Perform data cleanup and structure the data for model implentation.

6.   Train both models on the train dataset and review results once testing the model on the testing set.




# Results Summary: 


*   The Naive Bayes model has an interesting output. Precision and Recall have inverse outputs due to the data having a greater concentration in the postive review space. To improve this outcome the 4 star ratings were removed. The accuracy of this model is 55%, although the weighted average of precision for this model is 73%. Ideally adding hyperparameters may help improve this model in the future to better balance the data set.  

*   The Logistic Regression model performed very well. It had an accuracy of 89% and precision and recall of also 89%. 

Overall the logistic model generally performs better for this case. We also learn this as the primary difference between generative and discriminative models, where Naive Bayes being the generative does not perfom as well as the discriminative Logistic regression model on classfication of data. 



**References:**



1. https://www.youtube.com/watch?v=Xh6wFH3Fh7Q&list=WL&index=51&t=527s

2. https://www.youtube.com/watch?v=RZYjsw6P4nI&list=WL&index=48

3. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

4. https://towardsdatascience.com/sentiment-analysis-with-text-mining-13dd2b33de27

5. https://colab.research.google.com/github/guillermo-carrasco/logistic-sentiment/blob/master/Sentiment%20analysis%20with%20Logistic%20Regression.ipynb#scrollTo=p-tLdKhRVAdk

6. https://web.stanford.edu/~jurafsky/slp3/

7. https://towardsdatascience.com/natural-language-processing-count-vectorization-with-scikit-learn-e7804269bb5e

8. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html





# Import the Libraries

In [1]:
import pandas as pd
import numpy as np
import nltk
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn import linear_model 
from matplotlib import pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import re
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

# Load the data from Github

In [2]:
#load data from amazon with the new pos_neg label
df = pd.read_csv('https://raw.githubusercontent.com/mazal-tov/amazon_reviews/main/amazon_review.csv')

#print the dataframe
df

Unnamed: 0,review_ID,review_name,reviewers_name,review_rating,review_date,review_text,pos_neg
0,R2CFE5CNT75T4H,"Order them, you won't be sorry!",Special K,5,17-Apr-18,I was very skeptical about ordering these lol ...,positive
1,R32OAT2YT696AH,Beyond Impressed. I Put These Pillows Through...,DavesNotHere,5,21-Dec-17,As far as pillows go these would be 4 or 5 sta...,positive
2,R3NE66NT41LBRA,Dont BUY this pillow,garrick hileman,1,25-Jun-19,DONT BUY THEESE PILLOWS THEY CAME VACUUM SEALE...,negative
3,R26PM7HDCQ7UMN,Don't buy,M workman,1,23-Sep-18,I wouldn't know never received all of them. Th...,negative
4,R7QGLJKP9VI73,Wonderfully comfortable,Kindle Customer,5,5-Feb-18,My husband & I are both side sleepers who have...,positive
...,...,...,...,...,...,...,...
4928,R1DOCOGSYXR8C1,"Absolutely disgusting to have a ""brand new"" p...",vivian,1,16-Feb-17,Have not even taken it out of the plastic and ...,negative
4929,R12JNUNPFR3VZA,.......Two months later. Flat.,MatthewGB,1,24-Jul-17,I started to feel about a week ago that these ...,negative
4930,R30WWZR8ASUL3U,Not Impressed. Pillow Appears to be Used and ...,--wolverine--,1,15-Oct-17,These Utopia Bedding gusseted pillows are subs...,negative
4931,R2O6QWWCL32IAP,FLAT,J. Foster,1,18-Nov-17,I spent the good part of a Thursday night tryi...,negative


In [3]:
#check counts of unique values in the set --- our label should have 3 values 
df.nunique()

review_ID         4933
review_name       3360
reviewers_name    4201
review_rating        5
review_date       1639
review_text       4723
pos_neg              3
dtype: int64

In [4]:
#Donut plot of the rating distribution in the data 

amazon_donut = px.pie(df, names='pos_neg', height=300, width=600, hole=0.7,

title='Amazon Reviews',

color_discrete_sequence=['#b20710', '#221f1f'])

amazon_donut.update_traces(hovertemplate=None, textposition='outside',

textinfo='percent+label', rotation=90)

amazon_donut.update_layout(margin=dict(t=60, b=30, l=0, r=0), showlegend=False,

plot_bgcolor='#333', paper_bgcolor='#333',

title_font=dict(size=25, color='#5a8d93', family="Lato, sans-serif"),

font=dict(size=17, color='#5a8d93'),

hoverlabel=dict(bgcolor="#640", font_size=13,

font_family="Lato, sans-serif"))

In [5]:
df[["pos_neg"]].value_counts()

pos_neg 
positive    3290
negative    1196
neutral      447
dtype: int64

In [6]:
# Replace ham with 0 and spam with 1
df['pos_neg'] = df['pos_neg'].replace(['positive','negative', 'neutral'],[0, 1, 2])

In [7]:
#Remove the neutral ratings because they are probably going to be difficult to identify. 
#This is due to some people writing positive reviews but only providing 3 stars and vice versa for negative reviews. 
df = df.drop(df[df.pos_neg == 2].index)

In [8]:
# Removing the 4 star rating reviews to help balance the data
df = df.drop(df[df.review_rating == 4].index)

In [9]:
#check the new counts of the data with less positive reviews
df[["pos_neg"]].value_counts()

pos_neg
0          2739
1          1196
dtype: int64

In [10]:
#change the data type to string
df['review_text'].astype(str)

0       I was very skeptical about ordering these lol ...
1       As far as pillows go these would be 4 or 5 sta...
2       DONT BUY THEESE PILLOWS THEY CAME VACUUM SEALE...
3       I wouldn't know never received all of them. Th...
4       My husband & I are both side sleepers who have...
                              ...                        
4928    Have not even taken it out of the plastic and ...
4929    I started to feel about a week ago that these ...
4930    These Utopia Bedding gusseted pillows are subs...
4931    I spent the good part of a Thursday night tryi...
4932    I don't know if I would hate the pillows or no...
Name: review_text, Length: 3935, dtype: object

In [11]:
df['review_text']=df['review_text'].apply(str)

In [12]:
#Now we need to clean up the data so we remove punctuation and irrelavent words or text that is not a word.
# this is especially important since people in amazon can type in mulitple laungues. 
# we are only exploring english and stemming words so love loved and loves = love .. i.e. keeping root of word
from nltk.corpus import stopwords 
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english")

stop_words = set(stopwords.words("english"))

df['review_text'] = df.review_text.str.replace("[^\w\s]", "").str.lower()

df['review_text'] = df['review_text'].apply(lambda x: ' '.join([item for item in x.split() if item not in stop_words]))




The default value of regex will change from True to False in a future version.



In [13]:
#print the new dataframe with cleaned reviews
df['review_text']

0       skeptical ordering lol im side sleeper major p...
1       far pillows go would 4 5 stars 2 almost 35 mon...
2       dont buy theese pillows came vacuum sealed inc...
3       wouldnt know never received others seem ok cal...
4       husband side sleepers disk nerve issues necks ...
                              ...                        
4928    even taken plastic see stain either dirt mold ...
4929    started feel week ago flattening compared set ...
4930    utopia bedding gusseted pillows substandard es...
4931    spent good part thursday night trying fluff pi...
4932    dont know would hate pillows since received bl...
Name: review_text, Length: 3935, dtype: object

In [14]:
#split the dataframe between labels and text
df_y = df['pos_neg']
df_x = df['review_text']

In [15]:
#Use count vectorizer to count words in corpus and generate a vector for our models
cv = CountVectorizer(max_df=0.90, min_df=2, max_features=2000, stop_words='english')

In [16]:
#check data type
type(df_x)

pandas.core.series.Series

In [17]:
#change data type to list
df_x = df_x.values.tolist()

In [18]:
#check data type
type(df_x)

list

In [19]:
#use the count vectorizer and generate an array
#each row has a count for number of times a specific word (from all the words in the corpus-- i.e. position of each count) is in the row's record
x_traincv=cv.fit_transform(df_x).toarray()
x_traincv

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [20]:
#check total count of each word in corpus
print(x_traincv.sum(axis=0))

[29 15  2 ... 29  3  5]


In [21]:
#le = LabelEncoder()
#y = le.fit_transform(df_y)

#set up input y for model - labels
y = df_y

In [22]:
#print y
y

0       0
1       0
2       1
3       1
4       0
       ..
4928    1
4929    1
4930    1
4931    1
4932    1
Name: pos_neg, Length: 3935, dtype: int64

In [23]:
#split the data into testing and training
x_train, x_test, y_train, y_test = train_test_split(x_traincv,y,test_size= 0.30, random_state = 0)

In [24]:
#train the model with the training set
bayes_classifier = GaussianNB()
bayes_classifier.fit(x_train, y_train)

GaussianNB()

In [25]:
# test the model
# run the confusion matrix
y_pred = bayes_classifier.predict(x_test)
cm = confusion_matrix(y_test, y_pred)
cm

array([[332, 480],
       [ 48, 321]])

In [26]:
# print the accuray of the model
print ("Accuracy : %0.5f \n\n" % accuracy_score(y_test, bayes_classifier.predict(x_test)))
print (classification_report(y_test, bayes_classifier.predict(x_test)))

Accuracy : 0.55292 


              precision    recall  f1-score   support

           0       0.87      0.41      0.56       812
           1       0.40      0.87      0.55       369

    accuracy                           0.55      1181
   macro avg       0.64      0.64      0.55      1181
weighted avg       0.73      0.55      0.55      1181



Linear Regression Model

In [27]:
#import additional libraries
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

In [28]:
#set up inputs -- rerun count vectorizer
reg = linear_model.LogisticRegression()
vectorizer = CountVectorizer(max_df=0.90, min_df=2, max_features=2000, stop_words='english')
x=vectorizer.fit_transform(df_x)

In [29]:
# split the data into training and testing
Xtrain, Xtest, ytrain, ytest = train_test_split(x, y, test_size=0.3, random_state=42)

In [30]:
#train the LR model 
lr = LogisticRegression()
lr.fit(Xtrain,ytrain)

LogisticRegression()

In [31]:
#print model results
print ("Accuracy : %0.5f \n\n" % accuracy_score(ytest,lr.predict(Xtest)))
print (classification_report(ytest, lr.predict(Xtest)))

Accuracy : 0.89162 


              precision    recall  f1-score   support

           0       0.90      0.95      0.92       821
           1       0.87      0.76      0.81       360

    accuracy                           0.89      1181
   macro avg       0.89      0.85      0.87      1181
weighted avg       0.89      0.89      0.89      1181

