# Project: Amazon Product Review Sentiment Analyzer

As part of the assessment for this course, you will work on a mini project using an Amazon Product Review dataset (`toys_n_games.csv`). The dataset was originally taken from https://jmcauley.ucsd.edu/data/amazon/ but has been processed in a form easy to read in. 

The main objective of this project is **to create a sentiment analyzer for Amazon Product Reviews**.

We have broken up this notebook into different section to guide you in doing this project. Together in a group, work on the project by completing this notebook and present your work.

## Load the Dataset

In [1]:
import pandas as pd

In [2]:
#load in the dataset
df = pd.read_csv("toys_n_games.csv")
df.head()

Unnamed: 0,summary,rating,review
0,Magnetic board,5.0,I like the item pricing. My granddaughter want...
1,it works pretty good for moving to different a...,4.0,Love the magnet easel... great for moving to d...
2,love this!,5.0,Both sides are magnetic. A real plus when you...
3,Daughters love it,5.0,Bought one a few years ago for my daughter and...
4,Great to have so he can play with his alphabet...,4.0,I have a stainless steel refrigerator therefor...


## Generate the Training Set and Testing Set
When working with sentiment analysis, we often need to find a quick and easy way to generate training data for building our sentiment analyzer. For product reviews, one easy way to generate the training data is to treat:
- reviews with rating 5/5 as positive reviews
- reviews with rating 1/5 as negative reviews

It is not always the case that the data is balanced. So, we should do some **data exploration** to get the sensing of the dataset we have.

### Data Exploration

In [3]:
#there's 102k reviews with rating 5.0
df[df["rating"] == 5.0]

Unnamed: 0,summary,rating,review
0,Magnetic board,5.0,I like the item pricing. My granddaughter want...
2,love this!,5.0,Both sides are magnetic. A real plus when you...
3,Daughters love it,5.0,Bought one a few years ago for my daughter and...
7,Great,5.0,My granddaughter really really likes this. I l...
10,Works great!,5.0,Very nice to use with magnet letters and pract...
...,...,...,...
167591,.,5.0,"Ive owned other helicopters in the past, the o..."
167592,Very fun,5.0,This drone is very fun and super duarable. Its...
167593,Coolest toy on the market,5.0,This is my brother's most prized toy. It's ext...
167594,A great idea for kids!,5.0,This Panther Drone toy is awesome. I definitel...


In [4]:
#but there's only 4.7k reviews with rating 1.0
df[df["rating"] == 1.0]

Unnamed: 0,summary,rating,review
157,A crappy cardboard ghost of the original. Har...,1.0,A crappy cardboard ghost of the original. Har...
165,Lots of FUN!! Very CHEAP made!!!,1.0,We have this same game but it was made in 1967...
186,Booooorrrring,1.0,Hated this product.Predictable. Not fun. It ...
191,Disappointing,1.0,"I had high hopes for this game, as I am a big ..."
298,not too impressed,1.0,thought this was a book with pages to illustra...
...,...,...,...
167211,A Waste of Money,1.0,Don't waste your time or money with this ill-c...
167215,Disappointed.,1.0,It look interesting and fun in the beginning b...
167220,Little let down Rocket Racoon!,1.0,Man this figure is a big let down. He has no a...
167407,Flimsy and not exciting,1.0,Very flimsy and not that exciting. If you hav...


So in order to ensure that the data is balanced, we will sample the same number of reviews for each category. For example 4500 positive reviews and 4500 negative reviews.

### Task: Generate the Training Dataset and Testing Dataset
To keep things simple, we will just use the `review` column and ignore the `summary` column.

In [5]:
#here are some additional pandas dataframe manipulation codes to help you along

#note that the index is not in continuous numbering because we have filtered it by rating == 5.0
first_10rows = df[df["rating"] == 5.0][0:10]
first_10rows

Unnamed: 0,summary,rating,review
0,Magnetic board,5.0,I like the item pricing. My granddaughter want...
2,love this!,5.0,Both sides are magnetic. A real plus when you...
3,Daughters love it,5.0,Bought one a few years ago for my daughter and...
7,Great,5.0,My granddaughter really really likes this. I l...
10,Works great!,5.0,Very nice to use with magnet letters and pract...
11,Perfect magnetic board!!!,5.0,Bought this board along with Melissa and Doug ...
13,great magnet board,5.0,sturdy and perfect for coffee table. magnets s...
14,Every child should have this!,5.0,We had purchased Melissa & Doug magnets & had ...
15,Quality product,5.0,We purchased this to go with the magnetic lett...
16,great size,5.0,This easel is the perfect size for my three ye...


In [6]:
#add a new "polarity" column
first_10rows["polarity"] = "positive"
first_10rows

Unnamed: 0,summary,rating,review,polarity
0,Magnetic board,5.0,I like the item pricing. My granddaughter want...,positive
2,love this!,5.0,Both sides are magnetic. A real plus when you...,positive
3,Daughters love it,5.0,Bought one a few years ago for my daughter and...,positive
7,Great,5.0,My granddaughter really really likes this. I l...,positive
10,Works great!,5.0,Very nice to use with magnet letters and pract...,positive
11,Perfect magnetic board!!!,5.0,Bought this board along with Melissa and Doug ...,positive
13,great magnet board,5.0,sturdy and perfect for coffee table. magnets s...,positive
14,Every child should have this!,5.0,We had purchased Melissa & Doug magnets & had ...,positive
15,Quality product,5.0,We purchased this to go with the magnetic lett...,positive
16,great size,5.0,This easel is the perfect size for my three ye...,positive


In [7]:
another_10rows = df[df["rating"] == 5.0][100:110]
another_10rows["polarity"] = "positive"
another_10rows

Unnamed: 0,summary,rating,review,polarity
180,This boardgame summarizes Ravenloft D&D,5.0,"Let""s face it- D&D; Ravenloft campaign setting...",positive
182,"Can be difficult, but always fun",5.0,"This is a very fun game, and a great introduct...",positive
183,Great value,5.0,I am new to Dungeons & Dragons concept so I wa...,positive
184,good buy,5.0,I got this game from a local shop and I must s...,positive
185,"A fast, enjoyable board game",5.0,Wrath of Ashardalon gets everything right for ...,positive
187,Another great D&D board game.,5.0,I have loved all the D&D board games from Wiza...,positive
188,Fun fun fun!,5.0,"I own this and Castle Ravenloft, which are the...",positive
190,Best Dungeon Crawler,5.0,This is in my opinion the best dungeon crawler...,positive
195,Legend of Drizzt,5.0,"Well, this is an awesome D&D dungeon crawling ...",positive
196,Great value. Good replay.,5.0,I was lucky enough to snag this off Amazon for...,positive


In [8]:
#this is how you can concatentate 2 dataframes
pd.concat([first_10rows, another_10rows])

Unnamed: 0,summary,rating,review,polarity
0,Magnetic board,5.0,I like the item pricing. My granddaughter want...,positive
2,love this!,5.0,Both sides are magnetic. A real plus when you...,positive
3,Daughters love it,5.0,Bought one a few years ago for my daughter and...,positive
7,Great,5.0,My granddaughter really really likes this. I l...,positive
10,Works great!,5.0,Very nice to use with magnet letters and pract...,positive
11,Perfect magnetic board!!!,5.0,Bought this board along with Melissa and Doug ...,positive
13,great magnet board,5.0,sturdy and perfect for coffee table. magnets s...,positive
14,Every child should have this!,5.0,We had purchased Melissa & Doug magnets & had ...,positive
15,Quality product,5.0,We purchased this to go with the magnetic lett...,positive
16,great size,5.0,This easel is the perfect size for my three ye...,positive


### Task: Perform Data Modeling
**Tip: when defining the feature extraction function, it's a good idea to convert the input into `str` type before tokenization**

In [9]:
#here are some additional pandas dataframe manipulation codes to help you along

#this will allow you to convert a pandas dataframe column into a list
first_10rows["review"].tolist()

['I like the item pricing. My granddaughter wanted to mark on it but I wanted it just for the letters.',
 "Both sides are magnetic.  A real plus when you're entertaining more than one child.  The four-year old can find the letters for the words, while the two-year old can find the pictures the words spell.  (I bought letters and magnetic pictures to go with this board).  Both grandkids liked it a lot, which means I like it a lot as well.  Have not even introduced markers, as this will be used strictly as a magnetic board.",
 'Bought one a few years ago for my daughter and she loves it, still using it today. For the holidays we bought one for our niece and she loved it too.',
 'My granddaughter really really likes this. I love that you can just fold it up and put it away. Would definately recommend.',
 'Very nice to use with magnet letters and practice spelling. Holds the letters great and folds together nicely for storage. You can use it on a table or even the floor.',
 'Bought this bo

In [10]:
#you can iterate 2 list using the zip() function
for review, polarity in zip(first_10rows["review"].tolist(), first_10rows["polarity"].tolist()):
    print(review, polarity)

I like the item pricing. My granddaughter wanted to mark on it but I wanted it just for the letters. positive
Both sides are magnetic.  A real plus when you're entertaining more than one child.  The four-year old can find the letters for the words, while the two-year old can find the pictures the words spell.  (I bought letters and magnetic pictures to go with this board).  Both grandkids liked it a lot, which means I like it a lot as well.  Have not even introduced markers, as this will be used strictly as a magnetic board. positive
Bought one a few years ago for my daughter and she loves it, still using it today. For the holidays we bought one for our niece and she loved it too. positive
My granddaughter really really likes this. I love that you can just fold it up and put it away. Would definately recommend. positive
Very nice to use with magnet letters and practice spelling. Holds the letters great and folds together nicely for storage. You can use it on a table or even the floor. 

### Task: Evaluate the Performance of your Model(s)
Try different text preprocessing techniques to generalize the data and report your findings how each type of feature configuration affects the performance

In [99]:
import pandas as pd
import random
#load in the dataset
df = pd.read_csv("toys_n_games.csv")
df.head()

Unnamed: 0,summary,rating,review
0,Magnetic board,5.0,I like the item pricing. My granddaughter want...
1,it works pretty good for moving to different a...,4.0,Love the magnet easel... great for moving to d...
2,love this!,5.0,Both sides are magnetic. A real plus when you...
3,Daughters love it,5.0,Bought one a few years ago for my daughter and...
4,Great to have so he can play with his alphabet...,4.0,I have a stainless steel refrigerator therefor...


In [100]:
positive_rows = df[df["rating"] == 5.0][0:1000]
positive_rows = positive_rows["review"].tolist()
# positive_rows["polarity"] = "positive"
# positive_rows

In [101]:
for i in positive_rows:
    output = nltk.word_tokenize(i)
    print(output)

['I', 'like', 'the', 'item', 'pricing', '.', 'My', 'granddaughter', 'wanted', 'to', 'mark', 'on', 'it', 'but', 'I', 'wanted', 'it', 'just', 'for', 'the', 'letters', '.']
['Both', 'sides', 'are', 'magnetic', '.', 'A', 'real', 'plus', 'when', 'you', "'re", 'entertaining', 'more', 'than', 'one', 'child', '.', 'The', 'four-year', 'old', 'can', 'find', 'the', 'letters', 'for', 'the', 'words', ',', 'while', 'the', 'two-year', 'old', 'can', 'find', 'the', 'pictures', 'the', 'words', 'spell', '.', '(', 'I', 'bought', 'letters', 'and', 'magnetic', 'pictures', 'to', 'go', 'with', 'this', 'board', ')', '.', 'Both', 'grandkids', 'liked', 'it', 'a', 'lot', ',', 'which', 'means', 'I', 'like', 'it', 'a', 'lot', 'as', 'well', '.', 'Have', 'not', 'even', 'introduced', 'markers', ',', 'as', 'this', 'will', 'be', 'used', 'strictly', 'as', 'a', 'magnetic', 'board', '.']
['Bought', 'one', 'a', 'few', 'years', 'ago', 'for', 'my', 'daughter', 'and', 'she', 'loves', 'it', ',', 'still', 'using', 'it', 'today',

['I', 'highly', 'recommend', 'The', 'Elf', 'on', 'the', 'Shelf', '.', 'We', 'have', 'done', 'it', 'for', '8', 'days', 'now', 'and', 'my', 'daughter', 'loves', 'waking', 'up', 'every', 'morning', 'and', 'finding', 'out', 'where', 'and', 'what', 'Joey', 'the', 'elf', 'has', 'done', '.', 'I', 'ca', "n't", 'wait', 'to', 'carry', 'on', 'this', 'tradition', '.']
['Loved', 'it', '!', 'Too', 'bad', 'the', 'Elf', 'forgets', 'some', 'nights', 'to', 'move', 'on', 'a', 'different', 'shelf', ',', 'oops', '...', 'Oh', ',', 'and', 'when', 'the', 'kids', 'ask', "'Mommy", ',', 'why', 'does', 'the', 'Elf', 'has', 'a', 'tag', 'on', 'his', 'back', '?', "'", 'just', 'answer', 'with', "'It", 'says', "'if", 'found', ',', 'return', 'to', 'the', 'North', 'Pole', '!', "''"]
['I', "'ll", 'admit', ',', 'I', 'like', 'it', ',', 'we', 'have', 'a', '4', 'and', '6', 'year', 'old', 'and', 'they', 'ate', 'it', 'up', '.', 'It', 'made', 'Christmas', 'really', 'fun', 'this', 'year', '.', 'Some', 'drawbacks', ':', '1', ')',

['I', 'do', "n't", 'think', 'I', 'have', 'ever', 'played', 'a', 'board', 'game', 'that', 'so', 'perfectly', 'captured', 'the', 'feel', 'of', 'the', 'source', 'story', '.', 'Lies', '.', 'Betrayal', '.', 'Paranoia', '.', 'Enemies', 'in', 'every', 'corner', 'and', 'the', 'universe', 'stacked', 'against', 'you', '.', 'As', 'the', 'human', 'player', ',', 'you', 'know', 'full', 'well', 'that', 'if', 'you', 'win', 'it', 'is', 'only', 'going', 'to', 'be', 'by', 'the', 'skin', 'of', 'your', 'teeth', ',', 'and', 'certainly', 'not', 'with', 'all', 'of', 'your', 'crew', 'intact', '.', 'In', 'fact', ',', 'after', 'several', 'games', 'we', 'finally', 'had', 'our', 'first', '``', 'win', "''", 'last', 'night', 'after', 'a', 'hard', 'fought', 'battle', '.', 'It', 'was', 'a', 'brilliant', ',', 'suspenseful', 'game', 'with', 'the', 'power', 'balance', 'of', 'the', 'Cylon', 'players', 'and', 'the', 'human', 'players', 'coming', 'down', 'to', 'a', 'single', 'flipped', 'card.The', 'game', 'play', 'is', 'com

In [102]:
rows_negative = df[df["rating"] == 1.0][0:1000]
rows_negative= rows_negative["review"].tolist()

In [103]:
testing_data = [(nltk.word_tokenize(row),"pos") for row in positive_rows]

In [106]:
test_data = [(nltk.word_tokenize(str(row)),"pos") for row in positive_rows] + [(nltk.word_tokenize(str(row)),"neg") for row in rows_negative] 
len(testing_data)

2000

In [108]:
size_dataset = len(test_data)
size_dataset

train_size = int(0.7*size_dataset)
test_size = size_dataset - train_size

print("size:", size_dataset, "train_size:", train_size, "test_size:", test_size)
training_data = test_data*train_size
testing_data = test_data*test_size

size: 2000 train_size: 1400 test_size: 600


In [112]:
def extract_unigram_feature(words):
    #each word is a feature
    #there are many entries so need to use dict() instead of {} literal
    return dict((word, True) for word in words)

In [113]:
training_data[0]

(['I',
  'like',
  'the',
  'item',
  'pricing',
  '.',
  'My',
  'granddaughter',
  'wanted',
  'to',
  'mark',
  'on',
  'it',
  'but',
  'I',
  'wanted',
  'it',
  'just',
  'for',
  'the',
  'letters',
  '.'],
 'pos')

In [114]:
training_feature_vector = [(extract_unigram_feature(words), label) for (words, label) in training_data]
print(training_feature_vector[:1])

[({'I': True, 'like': True, 'the': True, 'item': True, 'pricing': True, '.': True, 'My': True, 'granddaughter': True, 'wanted': True, 'to': True, 'mark': True, 'on': True, 'it': True, 'but': True, 'just': True, 'for': True, 'letters': True}, 'pos')]


In [115]:
testing_feature_vector = [(extract_unigram_feature(words), label) for (words, label) in testing_data]
print(testing_feature_vector[:1])

[({'I': True, 'like': True, 'the': True, 'item': True, 'pricing': True, '.': True, 'My': True, 'granddaughter': True, 'wanted': True, 'to': True, 'mark': True, 'on': True, 'it': True, 'but': True, 'just': True, 'for': True, 'letters': True}, 'pos')]


In [None]:
nb_classifier = nltk.NaiveBayesClassifier.train(training_feature_vector)

In [None]:
prediction = nb_classifier.classify(testing_feature_vector[0][0])
actual_class = testing_feature_vector[0][1]
print("Prediction:", prediction, " Actual class:", actual_class)

In [None]:
nltk.classify.accuracy(nb_classifier, testing_feature_vector)

### Confusion matrix

In [None]:
predictions = [nb_classifier.classify(feature_vect[0]) for feature_vect in testing_feature_vector]
actual = [feature_vect[1] for feature_vect in testing_feature_vector]

In [None]:
labels = ["pos","neg"]

In [None]:
cm = confusion_matrix(actual, predictions, labels=labels)
cm

In [None]:
pd.DataFrame(cm,
             index = labels, 
             columns = labels)

### Submission By: Andrea Er, Phey Song wei, Sergio Kun, Glen Loy