## Amazon Review Sentiment Analysis

This project creates a model used to predict whether an Amazon review is helpful based on the words in the review.

### Methods/Libraries Used

* gzip to extract data from .gz Amazon file
* Pandas to see/transform the data
* NLTK to remove stopwords from reviews
* Scikit-learn's CountVectorizer to create a bag of words model used for the decision trees
* Scikit-learn's RandomForestClassifier to create the prediction model

In [1]:
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import gzip
import re
import nltk
import numpy as np
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
pd.options.mode.chained_assignment = None  # default='warn'

### 1. Read in the data into a dataframe
Note: Data is located here http://jmcauley.ucsd.edu/data/amazon/

Use gzip and pandas to get a dataframe of the data.

In [2]:
#Taken from Julian McAuley: http://jmcauley.ucsd.edu/data/amazon/
def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield eval(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

df = getDF('health_and_personal.json.gz')

In [3]:
print df.shape
df.head()

(346355, 9)


Unnamed: 0,reviewerID,asin,reviewerName,helpful,unixReviewTime,reviewText,overall,reviewTime,summary
0,ALC5GH8CAMAI7,159985130X,AnnN,"[1, 1]",1294185600,This is a great little gadget to have around. ...,5,"01 5, 2011",Handy little gadget
1,AHKSURW85PJUE,159985130X,"AZ buyer ""AZ buyer""","[1, 1]",1329523200,I would recommend this for a travel magnifier ...,4,"02 18, 2012",Small & may need to encourage battery
2,A38RMU1Y5TDP9,159985130X,"Bob Tobias ""Robert Tobias""","[75, 77]",1275955200,What I liked was the quality of the lens and t...,4,"06 8, 2010",Very good but not great
3,A1XZUG7DFXXOS4,159985130X,Cat lover,"[56, 60]",1202428800,Love the Great point light pocket magnifier! ...,4,"02 8, 2008",great addition to your purse
4,A1MS3M7M7AM13X,159985130X,Cricketoes,"[1, 1]",1313452800,This is very nice. You pull out on the magnifi...,5,"08 16, 2011",Very nice and convenient.


### 2. Create missing columns
In order to determine if a review is helpful, the ratio of the helpful to unhelpful counts for each review needs to be calculated. That is done by dividing helpful votes/unhelpful votes, but that data is stored in one column, so new columns needs to be created from the exisiting 'helpful' column. A simple for loop and lists are used to do so.

In [4]:
#We need to have some sort of label in order to figure out if a review is helpful or not helpful, so we'll need to define an
#arbitrary ratio + 
a = df['helpful'][:]
helpful = []
notHelpful = []
for i in a:
    helpful.append(i[0])
    notHelpful.append(i[1])
df['helpUp'] = helpful
df['helpDown'] = notHelpful
df['hRatio'] = df.helpUp/df.helpDown
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,unixReviewTime,reviewText,overall,reviewTime,summary,helpUp,helpDown,hRatio
0,ALC5GH8CAMAI7,159985130X,AnnN,"[1, 1]",1294185600,This is a great little gadget to have around. ...,5,"01 5, 2011",Handy little gadget,1,1,1.0
1,AHKSURW85PJUE,159985130X,"AZ buyer ""AZ buyer""","[1, 1]",1329523200,I would recommend this for a travel magnifier ...,4,"02 18, 2012",Small & may need to encourage battery,1,1,1.0
2,A38RMU1Y5TDP9,159985130X,"Bob Tobias ""Robert Tobias""","[75, 77]",1275955200,What I liked was the quality of the lens and t...,4,"06 8, 2010",Very good but not great,75,77,0.974026
3,A1XZUG7DFXXOS4,159985130X,Cat lover,"[56, 60]",1202428800,Love the Great point light pocket magnifier! ...,4,"02 8, 2008",great addition to your purse,56,60,0.933333
4,A1MS3M7M7AM13X,159985130X,Cricketoes,"[1, 1]",1313452800,This is very nice. You pull out on the magnifi...,5,"08 16, 2011",Very nice and convenient.,1,1,1.0


In [5]:
df['helpUp'][df.helpUp == 0].count()
df['helpUp'][df.helpDown == 0].count()

185557

### 3. Subset the data
Determining what constitutes a 'helpful' review is tough, and if we're going off ratios reviews where 1 out of 1 person found it helpful wil lbe considered very helpful even though it might not be. I arbitrarily decided that a review is more credible if at least 3 people voted it to be helpful. I then subsetted the dataframe for those reviews. 

After that, I created four classifiers for the data and assigned the column to the dataframe:
* 0% - 25% is considered useless
* 25% - 50% is considered unhelpful
* 50% - 75% is considered moderarely helpful
* 75% - 100% is considered helpful

Lastly, I subsetted the first 10000 rows as the training dataset.

In [6]:
#Remove all 0 - 2 reviews
df2 = df.loc[df.helpUp > 2]
df2 = df2.reset_index(drop=True)
#0-25% = useless, 25-50% is unhelpful, 50-75% is moderate,  75-100% is helpful
senti = pd.Series(np.random.randn(len(df2.asin)))
df2 = df2.assign(sentiment = senti.values)
df2.loc[:,'sentiment'][df2['hRatio']<0.25] = 'useless'
df2.loc[:,'sentiment'][(df2['hRatio']>0.25) & (df2['hRatio']<0.50)] = 'unhelpful'
df2.loc[:,'sentiment'][(df2['hRatio']>0.50) & (df2['hRatio']<0.75)] = 'moderate'
df2.loc[:,'sentiment'][(df2['hRatio']>0.75) & (df2['hRatio']<= 1)] = 'helpful'
train = df2[0:10000]

### 4. Clean the reviews 
Clean the reviews so they can be used for a bag of words model.

In [8]:
#Function to clean a review. 
#1. Use regex to remove all non letters
#2. Lowercase all words and split the review into words
#3. remove all "stopwords" from the review
#4. Join back the final words into the cleaned review
samp = train['reviewText'][0]
def to_words(review):
    letters = re.sub("[^a-zA-Z]", " ", review)
    words = letters.lower().split()
    good_words = [w for w in words if not w in stopwords.words("english")]
    return( " ".join( good_words ))   

clean_review = to_words( samp )
print clean_review

#Clean all the reviews and append the reviews to a list.
print "\n Cleaning and parsing the training set Amazon reviews...\n"
train_rvws = []
num_rvws = train['reviewText'].size
for i in xrange( 0, num_rvws ):
    if( (i+1)%1000 == 0 ):
        print "Review %d of %d\n" % ( i+1, num_rvws )                                                                    
    train_rvws.append( to_words( train["reviewText"][i] ))


liked quality lens built light lens discernable distortion anywhere magnified everything evenly without ripples distortion seen low cost magnifiers light nice touch easy use want pull lens bit focused close center look provides nice even coverage like brightness actually dimmness light focused leds lots brighter know seen also light focuses center field view lens close focused properly bottom line good value magnifier could made great better quality control btw feel honest effective reviews take place first hand experiences lacking online shopping always appreciated help received reviewers work hard return favor best hope found review helpful anything thought lacking unclear leave comment fix

 Cleaning and parsing the training set Amazon reviews...

Review 1000 of 10000

Review 2000 of 10000

Review 3000 of 10000

Review 4000 of 10000

Review 5000 of 10000

Review 6000 of 10000

Review 7000 of 10000

Review 8000 of 10000

Review 9000 of 10000

Review 10000 of 10000



### 5. Train the model
In order to train the model, we will give the model certain words (features) to look for to classify the review. These words are chosen from the cleaned reviews.

CountVectorizer is an object by scikit-learn used to create the features from text, and you can see the attributes can be modified to fit certain needs. fit_transform() trains the model and creates vectors from the reviews. We convert the vectors to numpy arrays for faster processing.

Finally use RandomForestClassifier to create the model in an object called forest.

In [9]:
#Get the features to train the decision tree using CountVectorizer
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 1000) 
train_data_features = vectorizer.fit_transform(train_rvws)
train_data_features = train_data_features.toarray()

In [11]:
#See the bag of words and features
vocab = vectorizer.get_feature_names()
print vocab

# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features, axis=0)

# For each, print the vocabulary word and the number of times it appears in the training set
# Uncomment if you want to print
#for tag, count in zip(vocab, dist):
#    print count, tag

[u'able', u'absolutely', u'absorption', u'accurate', u'acid', u'acids', u'acne', u'across', u'action', u'active', u'actually', u'add', u'added', u'addition', u'age', u'ago', u'aid', u'air', u'alcohol', u'allergic', u'allergies', u'allergy', u'almost', u'alone', u'along', u'already', u'also', u'alternative', u'although', u'always', u'amazing', u'amazon', u'amino', u'amount', u'another', u'anti', u'antibiotics', u'anxiety', u'anymore', u'anyone', u'anything', u'anyway', u'anywhere', u'applied', u'apply', u'area', u'areas', u'around', u'arrived', u'asleep', u'available', u'average', u'avoid', u'away', u'baby', u'back', u'bacteria', u'bad', u'bag', u'bar', u'bars', u'base', u'based', u'basically', u'bath', u'bathroom', u'batteries', u'battery', u'beard', u'become', u'bed', u'began', u'behind', u'believe', u'benefit', u'benefits', u'best', u'better', u'big', u'bit', u'black', u'blade', u'blades', u'blood', u'blue', u'body', u'bottle', u'bottles', u'bottom', u'bought', u'box', u'brain', u'br

In [12]:
print "Training the random forest..."
from sklearn.ensemble import RandomForestClassifier

# Initialize a Random Forest classifier with 100 trees
forest = RandomForestClassifier(n_estimators = 100) 

forest = forest.fit( train_data_features, train["sentiment"] )

Training the random forest...


### Repeat on the test dataset
Here we repeat all the same steps on the test dataset and use the forest model on the features gotten from the test data. The ouput is then converted into a dictionary with the ASIN and results.

In [13]:
test = df2[25000:35000]
test.index = range(10000)
test_rvws = []
num_rvws = test['reviewText'].size
for i in xrange(0, num_rvws):
    if( (i+1)%1000 == 0 ):
        print "Review %d of %d\n" % ( i+1, num_rvws )                                                                    
    test_rvws.append( to_words( test["reviewText"][i] ))

test_data_features = vectorizer.transform(test_rvws)
test_data_features = test_data_features.toarray()
result = forest.predict(test_data_features)
output = pd.DataFrame( data={"id":test["asin"], "sentiment":result} )

Review 1000 of 10000

Review 2000 of 10000

Review 3000 of 10000

Review 4000 of 10000

Review 5000 of 10000

Review 6000 of 10000

Review 7000 of 10000

Review 8000 of 10000

Review 9000 of 10000

Review 10000 of 10000



### 6. Test the accuracy
See how the results of the test line up with the results of the model using a for loop.

In [14]:
#confirm ASIN line up
print output.head()
print test[['asin','sentiment']].head()
count = 0.0

#loop through each row in the sentiment column and compare if they are equal, if so, add + 1 to the count
for i in xrange(0,len(output)):
    if test['sentiment'][i] == output['sentiment'][i]:
        count += 1
acc = 100*(count/len(output))
print "Model is {}% accurate".format(acc)

           id sentiment
0  B001OBZG6C   helpful
1  B001OBZHGG   helpful
2  B001OBZHGG   helpful
3  B001OCEVAS   helpful
4  B001OCEVAS   helpful
         asin sentiment
0  B001OBZG6C   helpful
1  B001OBZHGG   helpful
2  B001OBZHGG   helpful
3  B001OCEVAS  moderate
4  B001OCEVAS   helpful
Model is 69.52% accurate
