# Naive Bayes Model for Amazon Reviews

This is a Naive Bayes model built to parse good and bad reviews left on Amazon for a variety of products.  The data are from the [UC-Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences).  There are total of 1000 reviews - 500 positive and 500 negative.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
import sklearn
import string
from collections import Counter
%matplotlib inline

In [2]:
reviews = pd.read_csv('sentiment labelled sentences/amazon_cells_labelled.txt', 
                      delimiter='\t',
                      header=None)

In [3]:
reviews.head()

Unnamed: 0,0,1
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


In [4]:
pd.set_option('display.max_colwidth', -1)
reviews.columns = ['review', 'sentiment']

In [5]:
reviews.describe()

Unnamed: 0,sentiment
count,1000.0
mean,0.5
std,0.50025
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


In [6]:
good = reviews.loc[reviews['sentiment'] == 1]
bad = reviews.loc[reviews['sentiment'] == 0]

In [28]:
# Rerun this cell to cycle through good or bad reviews
good['review'].sample(10)

112    its a little geeky but i think thats its sex on toast and it rocks and oozes sex right down to its battery embedded sleek stylish leather case.
877    Excellent!.                                                                                                                                    
284    I got it because it was so small and adorable.                                                                                                 
301    Now I know that I made a wise decision.                                                                                                        
673    It is well made, easy to access the phone and has a handy, detachable belt clip.                                                               
918    Works for me.                                                                                                                                  
811    #1 It Works - #2 It is Comfortable.                                                    

In [26]:
def words(review):
    translator = str.maketrans('', '', string.punctuation)
    words = review.translate(translator).lower().split()
    return words

def words_as_string(review):
    translator = str.maketrans('', '', string.punctuation)
    words = review.translate(translator).lower().split()
    words = ' '.join(words)
    return words

In [9]:
good['review_stripped'] = good['review'].apply(words)
bad['review_stripped'] = bad['review'].apply(words)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [10]:
full_list_good_review_words = []

for r in good['review_stripped']:
    for w in r:
        full_list_good_review_words.append(w)

        
full_list_bad_review_words = []

for r in bad['review_stripped']:
    for w in r:
        full_list_bad_review_words.append(w)

In [11]:
# Using Counter to programatically identify high usage words
Counter(full_list_good_review_words).most_common(75)

[('the', 237),
 ('and', 188),
 ('i', 154),
 ('is', 141),
 ('it', 128),
 ('this', 105),
 ('a', 105),
 ('great', 92),
 ('to', 86),
 ('phone', 86),
 ('my', 72),
 ('very', 69),
 ('for', 66),
 ('with', 65),
 ('good', 62),
 ('of', 49),
 ('works', 46),
 ('on', 44),
 ('have', 38),
 ('was', 36),
 ('in', 34),
 ('product', 33),
 ('that', 32),
 ('quality', 31),
 ('well', 31),
 ('headset', 31),
 ('sound', 27),
 ('excellent', 26),
 ('so', 26),
 ('price', 25),
 ('has', 24),
 ('its', 24),
 ('one', 23),
 ('are', 22),
 ('battery', 22),
 ('nice', 22),
 ('you', 21),
 ('use', 21),
 ('best', 21),
 ('had', 21),
 ('but', 21),
 ('recommend', 20),
 ('as', 20),
 ('all', 20),
 ('love', 20),
 ('ive', 19),
 ('than', 19),
 ('case', 18),
 ('like', 18),
 ('would', 17),
 ('from', 16),
 ('ear', 16),
 ('any', 15),
 ('not', 15),
 ('really', 15),
 ('comfortable', 14),
 ('easy', 14),
 ('your', 14),
 ('happy', 13),
 ('these', 13),
 ('new', 12),
 ('up', 12),
 ('fine', 12),
 ('bluetooth', 12),
 ('just', 12),
 ('been', 12),
 ('

In [12]:
# Manual review of the Counter list
list_of_good_words = ['great', 'good', 'works', 'quality', 'well', 'excellent', 
                      'best', 'recommend', 'love', 'like', 'really', 'comfortable', 
                      'easy', 'happy', 'new', 'fine', 'better']

In [13]:
Counter(full_list_bad_review_words).most_common(75)

[('the', 276),
 ('i', 162),
 ('it', 153),
 ('and', 122),
 ('a', 113),
 ('to', 110),
 ('is', 102),
 ('not', 102),
 ('this', 101),
 ('phone', 76),
 ('my', 71),
 ('of', 70),
 ('for', 55),
 ('in', 54),
 ('was', 54),
 ('that', 48),
 ('you', 47),
 ('with', 47),
 ('on', 45),
 ('have', 35),
 ('very', 34),
 ('had', 27),
 ('dont', 26),
 ('as', 25),
 ('but', 25),
 ('work', 25),
 ('if', 24),
 ('battery', 23),
 ('product', 22),
 ('all', 21),
 ('after', 21),
 ('me', 20),
 ('are', 20),
 ('use', 20),
 ('ear', 19),
 ('does', 19),
 ('its', 19),
 ('money', 18),
 ('your', 18),
 ('quality', 18),
 ('one', 17),
 ('from', 17),
 ('would', 17),
 ('out', 17),
 ('only', 17),
 ('so', 16),
 ('time', 16),
 ('headset', 16),
 ('at', 16),
 ('be', 16),
 ('or', 15),
 ('then', 15),
 ('do', 15),
 ('first', 15),
 ('poor', 15),
 ('service', 15),
 ('when', 15),
 ('no', 14),
 ('get', 14),
 ('up', 14),
 ('what', 14),
 ('waste', 14),
 ('sound', 14),
 ('doesnt', 14),
 ('buy', 14),
 ('bad', 14),
 ('worst', 14),
 ('could', 13),
 ('

In [14]:
list_of_bad_words = ['not', "dont", 'work', 'out', 'poor', 'no', "doesnt", 'bad', 'worst']

In [15]:
# Remove upper case and punctuation from reviews
reviews['review_stripped'] = reviews['review'].apply(words_as_string)

In [16]:
# Turing sentiment from 1/0 to boolean
reviews['sentiment'] = (reviews['sentiment'] == 1)

In [17]:
for gw in list_of_good_words:
    # Adding spaces around each word to eliminate pattern matching inside other words.
    reviews[str(gw)] = reviews['review_stripped'].str.contains(' ' + str(gw) + ' ')
    
    
    
# Found model had poor performance when the list of bad words was included
#
# for bw in list_of_bad_words:
#     reviews[str(bw)] = reviews['review_stripped'].str.contains(' ' + str(bw) + ' ')

In [18]:
reviews.head()

Unnamed: 0,review,sentiment,review_stripped,great,good,works,quality,well,excellent,best,recommend,love,like,really,comfortable,easy,happy,new,fine,better
0,So there is no way for me to plug it in here in the US unless I go by a converter.,False,so there is no way for me to plug it in here in the us unless i go by a converter,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,"Good case, Excellent value.",True,good case excellent value,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False
2,Great for the jawbone.,True,great for the jawbone,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,Tied to charger for conversations lasting more than 45 minutes.MAJOR PROBLEMS!!,False,tied to charger for conversations lasting more than 45 minutesmajor problems,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,The mic is great.,True,the mic is great,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [19]:
# Setting up the Bernoulli Naive Bayes model
from sklearn.naive_bayes import BernoulliNB

data = reviews.iloc[:, 3:]
target = reviews.iloc[:, 1]

bnb = BernoulliNB()

bnb.fit(data, target)

y_pred = bnb.predict(data)

print("Number of mislabeled points out of {} points: {}".format(
    data.shape[0],
    (target != y_pred).sum()
))

Number of mislabeled points out of 1000 points: 344


In [30]:
# Adding more good words to try to increase model performance
more_good_words = ['extended battery', 'capacity', 'must have', 'exceptional', 'perfectly',
                   'satisfied', 'it works']

for mgw in more_good_words:
    reviews[str(mgw)] = reviews['review_stripped'].str.contains(' ' + str(mgw) + ' ')

In [31]:
data2 = reviews.iloc[:, 3:]
target2 = reviews.iloc[:, 1]

bnb2 = BernoulliNB()

bnb2.fit(data2, target2)
y_pred2 = bnb2.predict(data2)

print("Number of mislabeled points out of {} points: {}".format(
    data2.shape[0],
    (target2 != y_pred2).sum()
))

Number of mislabeled points out of 1000 points: 338


# Model Performance
The model is not performing very well at only 66% accuracy.