# 422 - Big Data - Bonus Assignment - Amazon review Sentiment Analysis

Team Nugget - Crystal, Daria, Mei, Pallavi, Sarah

March 10th, 2018

### Requirements

* http://jmcauley.ucsd.edu/data/amazon/
* Amazon Reviews and Ratings Data
* Includes metadata for things like ‘related products’
* Use the ‘small’ 5-core books dataset
* Use the free text of the reviews to do a sentiment analysis
* Group sentiment into Negative, Neutral, Positive
* Understand characteristics of the hardware you’re running this on
* Time execution and measure memory usage of both training phase and model usage phase
* Estimate (in words and numbers, show work) time (and resources) required to handle full book dataset with daily updates
* Estimate (in words and numbers, show work) time and resources required to use daily-built models to classify sentiment of a new review within 100 milliseconds

<img src="System-Config.png">

### Import Libraries & Data

In [13]:
import json
import pandas as pd
import numpy as np
import resource
from memory_profiler import profile
from time import time
%load_ext memory_profiler

import re, nltk
from sklearn.feature_extraction.text import CountVectorizer        
from nltk.stem.porter import PorterStemmer
nltk.download('punkt')

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

The memory_profiler extension is already loaded. To reload it, use:
  %reload_ext memory_profiler


In [5]:
#data = pd.read_json("review_Books_5.json", lines = True)
data = pd.read_json("sample_100k.json", lines = True) # sampled 100k rows from 5-core book using terminal
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 9 columns):
asin              100000 non-null object
helpful           100000 non-null object
overall           100000 non-null int64
reviewText        100000 non-null object
reviewTime        100000 non-null object
reviewerID        100000 non-null object
reviewerName      99951 non-null object
summary           100000 non-null object
unixReviewTime    100000 non-null int64
dtypes: int64(2), object(7)
memory usage: 7.6+ MB


In [6]:
data.head()

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,000100039X,"[0, 0]",5,Spiritually and mentally inspiring! A book tha...,"12 16, 2012",A10000012B7CGYKOMPQ4L,Adam,Wonderful!,1355616000
1,000100039X,"[0, 2]",5,This is one my must have books. It is a master...,"12 11, 2003",A2S166WSCFIFP5,"adead_poet@hotmail.com ""adead_poet@hotmail.com""",close to god,1071100800
2,000100039X,"[0, 0]",5,This book provides a reflection that you can a...,"01 18, 2014",A1BM81XB4QHOA3,"Ahoro Blethends ""Seriously""",Must Read for Life Afficianados,1390003200
3,000100039X,"[0, 0]",5,I first read THE PROPHET in college back in th...,"09 27, 2011",A1MOSTXNIO5MPJ,Alan Krug,Timeless for every good and bad time in your l...,1317081600
4,000100039X,"[7, 9]",5,A timeless classic. It is a very demanding an...,"10 7, 2002",A2XQ5LZHTD4AFT,Alaturka,A Modern Rumi,1033948800


In [7]:
data.overall.unique()

array([5, 3, 2, 4, 1])

### Pre-process Data

In [8]:
def assign_sentiment(data):    
    if data['overall'] > 3:
        value = 1                 # 1 implies POSITIVE SENTIMENT
    elif data['overall'] < 3:
        value = -1                # -1 implies NEGATIVE SENTIMENT
    else:
        value = 0                 # 0 implies NEUTRAL SENTIMENT
    return value

data['sentiment'] = data.apply(assign_sentiment, axis=1)

In [9]:
review = data[['reviewText','sentiment']]
review.head()

Unnamed: 0,reviewText,sentiment
0,Spiritually and mentally inspiring! A book tha...,1
1,This is one my must have books. It is a master...,1
2,This book provides a reflection that you can a...,1
3,I first read THE PROPHET in college back in th...,1
4,A timeless classic. It is a very demanding an...,1


In [11]:
nltk.download('punkt')

#######
# based on http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    # remove non letters
    text = re.sub("[^a-zA-Z]", " ", text)
    # tokenize
    tokens = nltk.word_tokenize(text)
    # stem
    stems = stem_tokens(tokens, stemmer)
    return stems
######## 

vectorizer = CountVectorizer(
    analyzer = 'word',
    tokenizer = tokenize,
    lowercase = True,
    stop_words = 'english',
    max_features = 85
)

[nltk_data] Downloading package punkt to /Users/pallavi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Model Training Phase

In [12]:
t1 = time()
%memit corpus_data_features = vectorizer.fit_transform(review.reviewText.tolist())
%memit corpus_data_features_nd = corpus_data_features.toarray()
corpus_data_features_nd.shape
%memit X_train, X_test, y_train, y_test  = train_test_split(corpus_data_features_nd, review.sentiment, train_size=0.7, random_state=1)
%memit log_model = LogisticRegression()
%memit log_model = log_model.fit(X=X_train, y=y_train)
t2 = time()
print("Time:",format(t2-t1))

peak memory: 640.33 MiB, increment: 1.66 MiB
peak memory: 649.57 MiB, increment: 129.46 MiB
peak memory: 587.59 MiB, increment: 2.88 MiB
peak memory: 458.90 MiB, increment: 0.05 MiB
peak memory: 532.99 MiB, increment: 74.09 MiB
Time: 489.6602680683136


### Model Usage Phase

In [14]:
t1 = time()
%memit y_pred = log_model.predict(X_test)
print(accuracy_score(y_test, y_pred))
t2 = time()
print("Time:",format(t2-t1))

peak memory: 563.96 MiB, increment: 37.80 MiB
0.7923
Time: 0.41533803939819336
