# Dataset

#### This dataset consists of reviews of fine foods from amazon. <br>The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. <br> It also includes reviews from all other Amazon categories. <br><br>Data set: https://www.kaggle.com/snap/amazon-fine-food-reviews

#### Dataset info:

Number of reviews: 568,454 <br>
Number of users: 256,059<br>
Number of products: 74,258<br>
Timespan: Oct 1999 - Oct 2012<br>
Number of Attributes/Columns in data: 10<br><br>

<b>Attribute Information:</b><br>

Id - unique identifier for the review<br>
ProductId - unique identifier for the product<br>
UserId - unqiue identifier for the user<br>
ProfileName<br>
HelpfulnessNumerator - number of users who found the review helpful<br>
HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not<br>
Score - rating between 1 and 5<br>
Time - timestamp for the review<br>
Summary - brief summary of the review<br>
Text - text of the review<br>

<br>
The dataset is .sqlite file

In [109]:
# Imports

import warnings
warnings.filterwarnings('ignore')
import re
import os
import string

# normal DS imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3 

# sklearn imports
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc

# NLP imports
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk import FreqDist

# For Word2Vec
import gensim
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

In [4]:
# Load the dataset using sqlite

con = sqlite3.connect('database.sqlite')

original_data = pd.read_sql_query(""" SELECT * FROM Reviews """, con)

original_data.shape

(568454, 10)

# Objective

#### Our objective is to determine whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2) given a review

Since we want to predict the reviews which are either positive or negative, we can drop the data where 'Score' is 3. <br>
We will also change 'Score' column logic to indicate either positive or negative and predict this ourselves based on sentiment in 'Text' column.

In [19]:
filtered_data = original_data[original_data['Score']!=3]
filtered_data.drop('Score', axis=1)


# Lets replace 'Score' as follows 
def partition(x):
    if x > 3:
        return 1
    return 0

actualScores = filtered_data['Score']
positiveNegative = actualScores.map(partition)
filtered_data['Score'] = positiveNegative
filtered_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,1,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,0,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,1,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,0,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,1,1350777600,Great taffy,Great taffy at a great price. There was a wid...


# Data Cleaning: Deduplication

In [20]:
cleaning= pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3 AND UserId="AR5J8UI46CURR"
ORDER BY ProductID
""", con)
cleaning.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,78445,B000HDL1RQ,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
1,138317,B000HDOPYC,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
2,138277,B000HDOPYM,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
3,73791,B000HDOPZG,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
4,155049,B000PAQ75C,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...


<b>Observation 1:</b> We observed that when a product has variations, amazon combines the reviews for all the variations. Since it make sense to drop the variation product reviews and keep distinct ones. The reason being, it will be very easy for our ML model to predict the review of product variations based on original product review. <b> Lets remove the duplicates </b>

In [21]:
# Lets sort data based on ProductId in ascending order
sorted_data = filtered_data.sort_values('ProductId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

In [23]:
# Lets do deduplciation
final = sorted_data.drop_duplicates(subset={"UserId", "ProfileName","Time", "Text"}, keep="first", inplace=False)

final.shape

(364173, 10)

In [24]:
# Check how much data is preserved after deduplication
(final['Id'].size*1.0)/(filtered_data['Id'].size*1.0)*100

69.25890143662969

<b>Observation 2: </b>We know that 'HelpfulnessNumerator' should be less than 'HelpfulnessDenominator'.<br>Few rows violet this. <b>Lets drop 'em.</b>

In [29]:
final[final['HelpfulnessNumerator'] > final['HelpfulnessDenominator']]

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
64421,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,1,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...
44736,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3,2,1,1212883200,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...


In [30]:
final = final[final['HelpfulnessNumerator'] <= final['HelpfulnessDenominator']]

In [32]:
final.shape

(364171, 10)

In [33]:
final['Score'].value_counts()

1    307061
0     57110
Name: Score, dtype: int64

 <b>Observation 3: </b>We observed that the food reviews data has some reviews of books. We can use 'Text' & 'Summary' columns to identify whether they are book reviews if words like 'book', 'read', 'reading' are there.<br>
 There is also a possibility that some food reviews contain words like these. It's upto us whether to delete this data or not. Since we got this data from data team, we will consider that these reviews are for food & not for book.<br>
 <b>If there will be need to delete this data, just make below cell as 'Code' and run.</b>

def apply_mask_summary(data,regex_string):
    mask = data.Summary.str.lower().str.contains(regex_string)
    data.drop(data[mask].index, inplace=True)

def apply_mask_text(data,regex_string):
    mask = data.Text.str.lower().str.contains(regex_string)
    data.drop(data[mask].index, inplace=True)


apply_mask_summary(final,re.compile(r"\bbook\b"))
apply_mask_summary(final,re.compile(r"\bread\b"))

apply_mask_text(final,re.compile(r"\bbook\b"))
apply_mask_text(final,re.compile(r"\bread\b"))

apply_mask_summary(final,re.compile(r"\bbooks\b"))
apply_mask_summary(final,re.compile(r"\breads\b"))

apply_mask_text(final,re.compile(r"\bbooks\b"))
apply_mask_text(final,re.compile(r"\breads\b"))

apply_mask_summary(final,re.compile(r"\breading\b"))
apply_mask_text(final,re.compile(r"\breading\b"))

# Text pre-processing

1. Remove the html tags
2. Remove any punctuations or limited set of special characters like , or . or # etc.
3. Check if the word is made up of english letters and is not alpha-numeric
4. Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
5. Convert the word to lowercase
6. Remove Stopwords
7. Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)

<b>Observation 1: </b>We also observed that few of the text reviews include hyperlink/URLs. <b>Lets remove those</b>

<b>Observation 2: </b>We also observed that few of the text reviews include HTML tags. <b>Lets remove those</b>

In [69]:
def cleanhtml(sentence):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', sentence)
    return cleantext

<b>Observation 3: </b>Lets remove punctuations<br>

In [67]:
def cleanpunc(sentence):
    cleaned = re.sub(r'[?|!|\'|"|#]', r'', sentence)
    cleaned = re.sub(r'[.|,|)|(|\|/]', r' ', cleaned)
    return cleaned

<b>Observation 4: </b>We observed that the text reviews include contractions like he's, they're, you've. <br>We need to expand these contractions so that 'he is' will be == he's<br>
https://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python/47091490#47091490

<b>Observation 5: </b>Remove stop words & do stemming

In [66]:
stop = set(stopwords.words('english'))
sno = SnowballStemmer('english')

In [93]:
# Lets do all above operations now

i = 0
str1 = ''
final_string = []
all_positive_words = []
all_negative_words = []
s = ''

for sent in final['Text'].values:
    filtered_sentence = []
    sent = cleanhtml(sent) # Clean HTML 
    
    for w in sent.split():
        for cleaned_words in cleanpunc(w).split():
            if( (cleaned_words.isalpha()) & (len(cleaned_words)>2) ):
                if(cleaned_words.lower() not in stop):
                    s = (sno.stem(cleaned_words.lower())).encode('utf-8')
                    filtered_sentence.append(s)
                    if(final['Score'].values)[i] == 1:
                        all_positive_words.append(s)
                    if(final['Score'].values)[i] == 0:
                        all_negative_words.append(s)
                        
                else:
                    continue
            else:
                continue
                
    str1 = b" ".join(filtered_sentence) # Final string of cleaned words
    final_string.append(str1)
    i+=1

In [94]:
final['CleanedText'] = final_string

In [95]:
final.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,CleanedText
150523,150524,6641040,ACITT7DI6IDDL,shari zychinski,0,0,1,939340800,EVERY book is educational,this witty little book makes my son laugh at l...,b'witti littl book make son laugh loud recit c...
150505,150506,6641040,A2IW4PEEKO2R0U,Tracy,1,1,1,1194739200,"Love the book, miss the hard cover version","I grew up reading these Sendak books, and watc...",b'grew read sendak book watch realli rosi movi...
150506,150507,6641040,A1S4A3IQ2MU7V4,"sally sue ""sally sue""",1,1,1,1191456000,chicken soup with rice months,This is a fun way for children to learn their ...,b'fun way children learn month year learn poem...
150507,150508,6641040,AZGXZ2UUK6X,"Catherine Hallberg ""(Kate)""",1,1,1,1076025600,a good swingy rhythm for reading aloud,This is a great little book to read aloud- it ...,b'great littl book read nice rhythm well good ...
150508,150509,6641040,A3CMRKGE0P909G,Teresa,3,4,1,1018396800,A great way to learn the months,This is a book of poetry about the months of t...,b'book poetri month year goe month cute littl ...


In [96]:
# Lets store this final table in sqlite

conn = sqlite3.connect('final.sqlite')
c = conn.cursor()
conn.text_factory = str

final.to_sql('Reviews', conn, schema=None, if_exists = 'replace')

### Quick analysis to decide n-gram range

Now we have list of words describing positive & negative reviews lets analyse them.
<br>Lets begin by getting the frequency distribution of the words 

In [97]:
freq_dist_positive = FreqDist(all_positive_words)
freq_dist_negative = FreqDist(all_negative_words)

In [98]:
print(freq_dist_positive.most_common(20))

[(b'like', 139429), (b'tast', 129047), (b'good', 112766), (b'flavor', 109624), (b'love', 107357), (b'use', 103888), (b'great', 103870), (b'one', 96726), (b'product', 91033), (b'tri', 86791), (b'tea', 83888), (b'coffe', 78814), (b'make', 75107), (b'get', 72125), (b'food', 64802), (b'would', 55568), (b'time', 55264), (b'buy', 54198), (b'realli', 52715), (b'eat', 52004)]


In [99]:
print(freq_dist_negative.most_common(20))

[(b'tast', 34585), (b'like', 32330), (b'product', 28218), (b'one', 20569), (b'flavor', 19575), (b'would', 17972), (b'tri', 17753), (b'use', 15302), (b'good', 15041), (b'coffe', 14716), (b'get', 13786), (b'buy', 13752), (b'order', 12871), (b'food', 12754), (b'dont', 11877), (b'tea', 11665), (b'even', 11085), (b'box', 10844), (b'amazon', 10073), (b'make', 9840)]


If we observe, 'like' is also categorized into negative which means it must be 'not like'. This indicates that we can use bi-gram to make model stronger

# Bag of Words (BoW)

In [100]:
# BoW for uni-gram
count_vect = CountVectorizer()
final_counts = count_vect.fit_transform(final['CleanedText'].values)

final_counts.shape

(364171, 71624)

In [101]:
# BoW for bi-gram
count_vect = CountVectorizer(ngram_range=(1,2))
final_counts = count_vect.fit_transform(final['CleanedText'].values)

final_counts.shape

(364171, 2923725)

Above can see massive increase in dimensions if we apply bi-gram

# TF-IDF 

In [102]:
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2))
final_tf_idf = tf_idf_vect.fit_transform(final['CleanedText'].values)

In [103]:
final_tf_idf.shape

(364171, 2923725)

In [104]:
features = tf_idf_vect.get_feature_names()
len(features)

2923725

In [105]:
features[1000:1010]

['abit celeri',
 'abit cheaper',
 'abit expens',
 'abit fancier',
 'abit get',
 'abit idiot',
 'abit larg',
 'abit noth',
 'abit pricey',
 'abit reason']

In [106]:
# Get top TFIDF features
def top_tfidf_feats(row, features, top_n=25):
    topn_ids = np.argsort(row)[::-1][:top_n]
    top_feats = [(features[i], row[i]) for i in topn_ids]
    df = pd.DataFrame(top_feats)
    df.columns = ['feature', 'tfidf']
    return df

top_tfidf = top_tfidf_feats(final_tf_idf[1,:].toarray()[0], features, 25)

In [107]:
top_tfidf

Unnamed: 0,feature,tfidf
0,page open,0.192673
1,read sendak,0.192673
2,movi incorpor,0.192673
3,paperback seem,0.192673
4,version paperback,0.192673
5,flimsi take,0.192673
6,incorpor love,0.192673
7,rosi movi,0.192673
8,keep page,0.192673
9,grew read,0.192673


# Word2Vec

Either we can create our own Word2Vec using our corpus or we can reuse else's e.g. Google's Word2Vec

In [110]:
# Lets train our own Word2Vec model using our corpus
i = 0
list_of_sent = []
for sent in final['Text'].values:
    filtered_sentence = []
    sent = cleanhtml(sent)
    
    for w in sent.split():
        for cleaned_words in cleanpunc(w).split():
            if(cleaned_words.isalpha()):
                filtered_sentence.append(cleaned_words.lower())
            else:
                continue
    list_of_sent.append(filtered_sentence)

In [111]:
print(final['Text'].values[0])
print("*"*50)
print(list_of_sent[0])

this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he always can sing the refrain. he's learned about whales, India, drooping roses:  i love all the new words this book  introduces and the silliness of it all.  this is a classic book i am  willing to bet my son will STILL be able to recite from memory when he is  in college
**************************************************
['this', 'witty', 'little', 'book', 'makes', 'my', 'son', 'laugh', 'at', 'loud', 'i', 'recite', 'it', 'in', 'the', 'car', 'as', 'were', 'driving', 'along', 'and', 'he', 'always', 'can', 'sing', 'the', 'refrain', 'hes', 'learned', 'about', 'whales', 'india', 'drooping', 'i', 'love', 'all', 'the', 'new', 'words', 'this', 'book', 'introduces', 'and', 'the', 'silliness', 'of', 'it', 'all', 'this', 'is', 'a', 'classic', 'book', 'i', 'am', 'willing', 'to', 'bet', 'my', 'son', 'will', 'still', 'be', 'able', 'to', 'recite', 'from', 'memory', 'when', 'he', 'is', 'in', 'colleg

In [112]:
# Lets train our model
w2v_model = gensim.models.Word2Vec(list_of_sent, min_count=5, size=50, workers=4) 
# Workers is how many cores to use while building model

In [113]:
words = list(w2v_model.wv.vocab)
len(words)

33783

In [114]:
w2v_model.wv.most_similar('tasty')

[('tastey', 0.8893786072731018),
 ('yummy', 0.8561932444572449),
 ('satisfying', 0.8441725373268127),
 ('delicious', 0.8217435479164124),
 ('filling', 0.8205041289329529),
 ('flavorful', 0.8001201748847961),
 ('addicting', 0.7849034070968628),
 ('tasteful', 0.7772332429885864),
 ('nutritious', 0.7601455450057983),
 ('delectable', 0.745866596698761)]