## 3.6.3 Amazon Reviews

In this kernel I will train a model based on the reviews of musical instruments to try and predict good or bad reviews. Then I will use this model to predict good and bad reviews in other review sets to see if my model performs will in other situations or overfits for my particular type.

In [1]:
import pandas as pd
import numpy as np
import gzip
import json
from pandas.io.json import json_normalize
from pprint import pprint
from sklearn import tree
from sklearn.model_selection import cross_val_score
from sklearn import ensemble
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR

In [29]:
#loading data from a json can be difficult. luckily the amazon site with the data offered direct code to load this into a
#dataframe
def parse(path):
    g = gzip.open(path)
    for l in g:
        yield eval(l)

def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')

In [30]:
#we load the set not during the function so it is easier to load other data sets later
df = getDF(r"C:\Users\jmfra\Downloads\reviews_Musical_Instruments_5.json.gz")

In [31]:
#lets take a look
df.head()

Unnamed: 0,reviewerName,asin,helpful,reviewText,reviewTime,reviewerID,summary,overall,unixReviewTime
0,"cassandra tu ""Yeah, well, that's just like, u...",1384719342,"[0, 0]","Not much to write about here, but it does exac...","02 28, 2014",A2IBPI20UZIR0U,good,5.0,1393545600
1,Jake,1384719342,"[13, 14]",The product does exactly as it should and is q...,"03 16, 2013",A14VAT5EAX3D9S,Jake,5.0,1363392000
2,"Rick Bennette ""Rick Bennette""",1384719342,"[1, 1]",The primary job of this device is to block the...,"08 28, 2013",A195EZSQDW3E21,It Does The Job Well,5.0,1377648000
3,"RustyBill ""Sunday Rocker""",1384719342,"[0, 0]",Nice windscreen protects my MXL mic and preven...,"02 14, 2014",A2C00NNG1ZQQG2,GOOD WINDSCREEN FOR THE MONEY,5.0,1392336000
4,SEAN MASLANKA,1384719342,"[0, 0]",This pop filter is great. It looks and perform...,"02 21, 2014",A94QU4C90B1AX,No more pops when I record my vocals.,5.0,1392940800


In [32]:
#first thing we notice is some categories are not going to be helpful in identifying a good review. 
#asin is just a product id
#reviewerID is just an arbitrary assignment to each account. both this and name will be useless unless someone is filling out
#a large amount of reviews and giving them all similar score
print(df['reviewerName'].groupby(df['reviewerName']).count().describe())
df['reviewerID'].groupby(df['reviewerID']).count().describe()

count    1397.000000
mean        7.325698
std         4.289890
min         1.000000
25%         5.000000
50%         6.000000
75%         8.000000
max        66.000000
Name: reviewerName, dtype: float64


count    1429.000000
mean        7.180546
std         3.731858
min         5.000000
25%         5.000000
50%         6.000000
75%         8.000000
max        42.000000
Name: reviewerID, dtype: float64

In [33]:
#in 10000 or so reviews, the most one person ever made is 66 so we can ignore these
#The day and time of the review shouldnt be useful either
df = df.drop(['asin', 'reviewerID', 'reviewerName', 'reviewTime', 'unixReviewTime'], axis=1)

In [34]:
#lets reevaluate
df.head()

Unnamed: 0,helpful,reviewText,summary,overall
0,"[0, 0]","Not much to write about here, but it does exac...",good,5.0
1,"[13, 14]",The product does exactly as it should and is q...,Jake,5.0
2,"[1, 1]",The primary job of this device is to block the...,It Does The Job Well,5.0
3,"[0, 0]",Nice windscreen protects my MXL mic and preven...,GOOD WINDSCREEN FOR THE MONEY,5.0
4,"[0, 0]",This pop filter is great. It looks and perform...,No more pops when I record my vocals.,5.0


In [35]:
#lets take a look at the counts for ratings
df['overall'].groupby(df['overall']).count()

overall
1.0     217
2.0     250
3.0     772
4.0    2084
5.0    6938
Name: overall, dtype: int64

In [36]:
#it will be much easier to perform sentiment analysis than to perform a regression with the remaining data, so lets get rid of 
#the last independent variable
df = df.drop('helpful', axis=1)

In [10]:
#actually coming up with good and bad identifiers would takea  very long time with no better method than guess and check
#to test which ones improve the model. Instead of doing this, I will load some positively toned and negatively toned
#words from a list I found online. In pervious models, these lists have performed very well at their base and will be
#a good identifier of which model fits the data the best
good = pd.read_csv(r"C:\Users\jmfra\OneDrive\Documents\Thinkful Data Science Files\2.2.7\positive.txt", header=None)
good.columns = ['x']
bad = pd.read_csv(r"C:\Users\jmfra\OneDrive\Documents\Thinkful Data Science Files\2.2.7\negative.txt", header=None)
bad.columns = ['x']
goodwords = list(good.values.flatten())
badwords = list(bad.values.flatten())

In [37]:
#sentiment analysis works way better when the outcome variable is a boolean based on the numbers above, we are going to
#define a good rating as 4 or above since the count drops off so much after that
df['overall']  = np.where(df['overall'] >= 4,1,0)

In [38]:
#lets make the overall a true false to make it easier to run on
df['overall_bool'] = (df['overall'] == 0)

In [13]:
#these next two cells create boolean values for every wor in the lists we created above. There purpose is to tell us if the 
#word shows up in the review and then our model will hopefully choose good or bad based on the total number of matching words
#from each list. so for example, if a review say "good" "bad" and "amazing" and these are the only words that show up in the
#lists, we would expect it to say it is a good review because the count of good words is 2 and the count of bad words is 1
keywords_good = goodwords

for key in keywords_good:
    df[str(key)] = df.reviewText.str.contains(str(key), case=False)

In [14]:
keywords_bad = badwords

for key in keywords_bad:
    df[str(key)] = df.reviewText.str.contains(str(key), case=True)

In [15]:
#the next 4 cells are going to select the dependent and independent variable and then train 3 different models off these 
#variables
X = df[keywords_good + keywords_bad]
y = df['overall_bool']

In [16]:
bnb = BernoulliNB()
x_b = cross_val_score(bnb, X, y, cv=10).mean()

In [17]:
decision_tree = tree.DecisionTreeClassifier(
    criterion='entropy',
    random_state = 300
)
x_d = cross_val_score(decision_tree, X, y, cv=10).mean()

In [18]:
rfc = ensemble.RandomForestClassifier()
x_r = cross_val_score(rfc, X, y, cv=10).mean()

In [19]:
#now that we have the base scores for each classifier, lets run a for loop to not only do a train and test split on the
#data to see if the data is statistically the same to the entire set. Then lets test multiple test sizes to see which
#one performs the best across the board
dfr2 = pd.DataFrame()
bernoulli_scores = [x_b]
tree_scores = [x_d]
forest_scores = [x_r]
test_sizes = [.1,.2,.3,.4,.5]
for p in test_sizes:
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=p, random_state=0)
    bernoulli_fit = bnb.fit(X_train, y_train)
    x_b2 = cross_val_score(bernoulli_fit, X_test, y_test, cv=10).mean()
    decision_tree_fit = decision_tree.fit(X_train,y_train)
    x_d2 = cross_val_score(decision_tree_fit, X_test, y_test, cv=10).mean()
    forest_fit = rfc.fit(X_train, y_train)
    x_r2 = cross_val_score(forest_fit, X_test, y_test, cv=10).mean()
    bernoulli_scores.append(x_b2)
    tree_scores.append(x_d2)
    forest_scores.append(x_r2)

In [20]:
#lets add these scores to a data frame to easily analyze them
dfr2['bernoulli'] = bernoulli_scores
dfr2.index = ['total_set','test_set=.1','test_set=.2','test_set=.3','test_set=.4','test_set=.5']
dfr2['decision_tree'] = tree_scores
dfr2['random_forest'] = forest_scores
dfr2

Unnamed: 0,bernoulli,decision_tree,random_forest
total_set,0.851766,0.817855,0.87438
test_set=.1,0.883179,0.808135,0.886111
test_set=.2,0.884569,0.826581,0.887974
test_set=.3,0.876915,0.824283,0.882752
test_set=.4,0.867724,0.829962,0.88039
test_set=.5,0.86533,0.829667,0.876632


In [21]:
#as you can see from the numbers, random forest classifiers performs better than the other two in every category
#although bernoulli classification actually hold its own. It also reaches a peak at the test size of .2.
#now lets test it on two other data sets to see if we can get good results with these word lists and classifier of our choice
df_2 = getDF(r"C:\Users\jmfra\Downloads\reviews_Amazon_Instant_Video_5.json.gz")
df_3 = getDF(r"C:\Users\jmfra\Downloads\reviews_Digital_Music_5.json.gz")

In [22]:
#we have to do the same exact data preperation for these two sets as the original one 
for key in keywords_good:
    df_2[str(key)] = df_2.reviewText.str.contains(str(key), case=False)    

In [23]:
for key in keywords_good:
    df_3[str(key)] = df_3.reviewText.str.contains(str(key), case=False)

In [24]:
for key in keywords_bad:
    df_2[str(key)] = df_2.reviewText.str.contains(str(key), case=True)

In [25]:
for key in keywords_bad:
    df_3[str(key)] = df_3.reviewText.str.contains(str(key), case=True)

In [26]:
df_2['overall']  = np.where(df_2['overall'] >= 4,1,0)
df_2['overall_bool'] = (df_2['overall'] == 0)
df_3['overall']  = np.where(df_3['overall'] >= 4,1,0)
df_3['overall_bool'] = (df_3['overall'] == 0)
X2 = df[keywords_good + keywords_bad]
y2 = df['overall_bool']
X3 = df[keywords_good + keywords_bad]
y3 = df['overall_bool']

In [27]:
#using the split from earlier and a fit on our training set, we can predict good or bad in the new set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=0)
forest_fit = rfc.fit(X_train, y_train)
x_r22 = cross_val_score(forest_fit, X2, y2, cv=10).mean()
x_r23 = cross_val_score(forest_fit, X3, y3, cv=10).mean()
print('instant video review socre:',x_r22)
print('digital music review score:',x_r23)

instant video review socre: 0.877985308939
digital music review score: 0.876036464792


As you can see, the score only dropped by roughly .01 implying that our word list is not only a pretty good identifier of good and bad reviews, but that random forest classifiers are likely going to outperform their counterparts in these scenarios