# Amazon Fine Food Reviews Analysis
Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews (https://www.kaggle.com/snap/amazon-fine-food-reviews)
The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.
Number of reviews: 568,454
Number of users: 256,059
Number of products: 74,258
Timespan: Oct 1999 - Oct 2012
Number of Attributes/Columns in data: 10
Attribute Information:
1. Id
2. ProductId - unique identifier for the product
3. UserId - unqiue identifier for the user
4. ProfileName
5. HelpfulnessNumerator - number of users who found the review helpful
6. HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not 7. Score - rating between 1 and 5
8. Time - timestamp for the review
9. Summary - brief summary of the review
10. Text - text of the review
Objective:
Given a review, determine whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2).







In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

!pip install nltk
import nltk
!pip install seaborn
import seaborn as sns

df = pd.read_csv("Reviews.csv")

Collecting nltk
  Downloading https://files.pythonhosted.org/packages/50/09/3b1755d528ad9156ee7243d52aa5cd2b809ef053a0f31b53d92853dd653a/nltk-3.3.0.zip (1.4MB)
[K    100% |################################| 1.4MB 935kB/s eta 0:00:01
Building wheels for collected packages: nltk
  Running setup.py bdist_wheel for nltk ... [?25ldone
[?25h  Stored in directory: /home/jovyan/.cache/pip/wheels/d1/ab/40/3bceea46922767e42986aef7606a600538ca80de6062dc266c
Successfully built nltk
Installing collected packages: nltk
Successfully installed nltk-3.3
[33mYou are using pip version 9.0.1, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Collecting seaborn
  Downloading https://files.pythonhosted.org/packages/10/01/dd1c7838cde3b69b247aaeb61016e238cafd8188a276e366d36aa6bcdab4/seaborn-0.8.1.tar.gz (178kB)
[K    100% |################################| 184kB 3.5MB/s ta 0:00:01
[?25hBuilding wheels for collected packages: seaborn
  Runn

In [2]:

df.describe()

Unnamed: 0,Id,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time
count,568454.0,568454.0,568454.0,568454.0,568454.0
mean,284227.5,1.743817,2.22881,4.183199,1296257000.0
std,164098.679298,7.636513,8.28974,1.310436,48043310.0
min,1.0,0.0,0.0,1.0,939340800.0
25%,142114.25,0.0,0.0,4.0,1271290000.0
50%,284227.5,0.0,1.0,5.0,1311120000.0
75%,426340.75,2.0,2.0,5.0,1332720000.0
max,568454.0,866.0,923.0,5.0,1351210000.0


In [3]:
#In this it is considered that score (1 or 2) will come under the category of negative review
#score (4 or 5) will come under the category of positive review
#score = 3 is neutral so we need to eliminate it!!!
filtered_data = df.loc[df['Score'] !=3]
filtered_data

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...
5,6,B006K2ZZ7K,ADT0SRK1MGOEU,Twoapennything,0,0,4,1342051200,Nice Taffy,I got a wild hair for taffy and ordered this f...
6,7,B006K2ZZ7K,A1SP2KVKFXXRU1,David C. Sullivan,0,0,5,1340150400,Great! Just as good as the expensive brands!,This saltwater taffy had great flavors and was...
7,8,B006K2ZZ7K,A3JRGQVEQN31IQ,Pamela G. Williams,0,0,5,1336003200,"Wonderful, tasty taffy",This taffy is so good. It is very soft and ch...
8,9,B000E7L2R4,A1MZYO9TZK0BBI,R. James,1,1,5,1322006400,Yay Barley,Right now I'm mostly just sprouting this so my...
9,10,B00171APVA,A21BT40VZCCYT4,Carol A. Reed,0,0,5,1351209600,Healthy Dog Food,This is a very healthy dog food. Good for thei...


In [4]:
filtered_data["Score"].value_counts()

5    363122
4     80655
1     52268
2     29769
Name: Score, dtype: int64

In [5]:
def partition(x):
    if x < 3:
        return "negative"
    else:
        return "positive"

In [6]:
score_column = filtered_data["Score"]
l = list()
for i in score_column:
    l.append(partition(i))
    

In [7]:
se = pd.Series(l)
filtered_data["score"] = se.values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [8]:
filtered_data=filtered_data.drop("Score",axis = 1)


In [9]:
filtered_data

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Time,Summary,Text,score
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...,positive
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,negative
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...,positive
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,1307923200,Cough Medicine,If you are looking for the secret ingredient i...,negative
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,1350777600,Great taffy,Great taffy at a great price. There was a wid...,positive
5,6,B006K2ZZ7K,ADT0SRK1MGOEU,Twoapennything,0,0,1342051200,Nice Taffy,I got a wild hair for taffy and ordered this f...,positive
6,7,B006K2ZZ7K,A1SP2KVKFXXRU1,David C. Sullivan,0,0,1340150400,Great! Just as good as the expensive brands!,This saltwater taffy had great flavors and was...,positive
7,8,B006K2ZZ7K,A3JRGQVEQN31IQ,Pamela G. Williams,0,0,1336003200,"Wonderful, tasty taffy",This taffy is so good. It is very soft and ch...,positive
8,9,B000E7L2R4,A1MZYO9TZK0BBI,R. James,1,1,1322006400,Yay Barley,Right now I'm mostly just sprouting this so my...,positive
9,10,B00171APVA,A21BT40VZCCYT4,Carol A. Reed,0,0,1351209600,Healthy Dog Food,This is a very healthy dog food. Good for thei...,positive


In [10]:
#now the Score column that initially contained 1,2,4 ,5(ratings),has been repalced by (positive,negative)


# Exploratory Data Analysis
#Data Cleaning: Deduplication
(Its important to remove duplicate so to avoid unbiased result)


In [11]:

filtered_data[filtered_data["UserId"] == "AR5J8UI46CURR" ]

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Time,Summary,Text,score
73790,73791,B000HDOPZG,AR5J8UI46CURR,Geetha Krishnan,2,2,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...,positive
78444,78445,B000HDL1RQ,AR5J8UI46CURR,Geetha Krishnan,2,2,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...,positive
138276,138277,B000HDOPYM,AR5J8UI46CURR,Geetha Krishnan,2,2,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...,positive
138316,138317,B000HDOPYC,AR5J8UI46CURR,Geetha Krishnan,2,2,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...,positive
155048,155049,B000PAQ75C,AR5J8UI46CURR,Geetha Krishnan,2,2,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...,positive


In [12]:
#here in the above output you can see that a user with same user id has done shopping of two different product at \
#same time which is not possible


As can be seen above the same user has multiple reviews of the with the same values for HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary and Text and on doing analysis it was found that
ProductId=B000HDOPZG was Loacker Quadratini Vanilla Wafer Cookies, 8.82-Ounce Packages (Pack of 8)
ProductId=B000HDL1RQ was Loacker Quadratini Lemon Wafer Cookies, 8.82-Ounce Packages (Pack of 8) and so on
It was inferred after analysis that reviews with same parameters other than ProductId belonged to the same product just having different flavour or quantity. Hence in order to reduce redundancy it was decided to eliminate the rows having same parameters.
The method used for the same was that we first sort the data according to ProductId and then just keep the first similar product review and delelte the others. for eg. in the above just the review for ProductId=B000HDL1RQ remains. This method ensures that there is only one representative for each product and deduplication without sorting would lead to possibility of different representatives still existing for the same product.


In [13]:
final=filtered_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep = 'first')
final.shape

(364173, 10)

It has been observed that in some cases helpfulness denominator is greater thab helpfulness numerator which is practically impossible.


In [14]:
final = final[final["HelpfulnessNumerator"]<= final["HelpfulnessDenominator"]]

In [15]:
final.shape

(364171, 10)

# Text Preprocessing: Stemming,Stop-word removal and Lemmatization

Now that we have finished deduplication our data requires some preprocessing before we go on further with analysis and making the prediction model.
Hence in the Preprocessing phase we do the following in the order below:-
1. Begin by removing the html tags
2. Remove any punctuations or limited set of special characters like , or . or # etc.
3. Check if the word is made up of english letters and is not alpha-numeric
4. Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-
letters)
5. Convert the word to lowercase
6. Remove Stopwords
7. Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)
After which we collect the words used to describe positive and negative reviews


In [16]:
final['score'].value_counts()

positive    307061
negative     57110
Name: score, dtype: int64

In [17]:
final["Text"].values[0]

'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.'

In [18]:
import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer


In [19]:
nltk.download('stopwords')
stop = set(stopwords.words('english'))


[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [20]:
sno = nltk.stem.SnowballStemmer('english')

In [21]:
def cleanhtml(sentence):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr," ",sentence)
    return cleantext
def cleanpunc(sentence):
    cleaned = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    cleaned = re.sub(r'[.|,|)|(|\|/]',r' ',cleaned)
    return cleaned
print (stop)
print ("##############################################")
print (sno.stem('tasty'))
    

{'at', 'be', 'up', 'now', 'should', 'was', 'can', 'hasn', 'some', 'do', 'yourself', "you're", 'by', 'all', 's', 'such', 'nor', 'mustn', 'here', 'having', 'ours', 'how', 'will', 'hadn', 'yourselves', 'shouldn', 'into', "don't", 'so', 'were', 'from', 'with', 'she', "hasn't", 'itself', 'why', 'few', 'just', 'shan', "didn't", "you'll", 'myself', 've', 'weren', 'yours', "it's", 'am', 'being', 'not', 'have', 'what', 'because', 'is', 'these', 'or', 'ma', 'them', 'down', 'aren', "that'll", 'me', 'won', 'against', "haven't", 'out', 'after', 'y', 'had', "mightn't", 'which', 'each', "wasn't", 'doing', 'a', 'again', 'o', "aren't", 'haven', "wouldn't", 'm', 'but', 'ain', 'there', 'wasn', 'theirs', 'whom', 'that', 'over', 'once', 'off', 'ourselves', 'until', 'couldn', 'an', 'above', "she's", "you've", 'who', 'does', 'no', 'between', "couldn't", 'hers', 'too', 'did', 'wouldn', 'on', 'your', 'doesn', 'needn', 'then', 'its', 'and', 'are', 'during', 'as', "mustn't", "hadn't", 'about', 'other', "should'v

Now implementing every step mentioned in the preprocessing phase above

In [22]:
i = 0
str1 = ' '
final_string = []
all_positive_words = []
all_negative_words = []
s = ''
for sent in final['Text'].values:
    filtered_sentence = []
    sent = cleanhtml(sent)
    for w in sent.split():
        for cleaned_words in cleanpunc(w).split():
            if (cleaned_words.isalpha() and (len(cleaned_words)>2)):
                if (cleaned_words.lower() not in stop):
                    s = (sno.stem(cleaned_words.lower())).encode('utf8')
                    filtered_sentence.append(s)
                    if(final['score'].values)[i] == 'positive':
                        all_positive_words.append(s)
                    if(final['score'].values)[i] == 'negative':
                        all_negative_words.append(s)
                else:
                    continue
            else:
                continue
    str1 = b" ".join(filtered_sentence)
    final_string.append(str1)
    i = i+1
                    
                    

In [29]:
final['CleanedText'] = final_string

In [31]:
final.shape

(364171, 11)

In [32]:
final.to_pickle('finaldf.pk')

In [49]:
final = pd.read_pickle('finaldf.pk')


In [50]:
final

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Time,Summary,Text,score,CleanedText
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...,positive,b'bought sever vital can dog food product foun...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,negative,b'product arriv label jumbo salt peanut peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...,positive,b'confect around centuri light pillowi citrus ...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,1307923200,Cough Medicine,If you are looking for the secret ingredient i...,negative,b'look secret ingredi robitussin believ found ...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,1350777600,Great taffy,Great taffy at a great price. There was a wid...,positive,b'great taffi great price wide assort yummi ta...
5,6,B006K2ZZ7K,ADT0SRK1MGOEU,Twoapennything,0,0,1342051200,Nice Taffy,I got a wild hair for taffy and ordered this f...,positive,b'got wild hair taffi order five pound bag taf...
6,7,B006K2ZZ7K,A1SP2KVKFXXRU1,David C. Sullivan,0,0,1340150400,Great! Just as good as the expensive brands!,This saltwater taffy had great flavors and was...,positive,b'saltwat taffi great flavor soft chewi candi ...
7,8,B006K2ZZ7K,A3JRGQVEQN31IQ,Pamela G. Williams,0,0,1336003200,"Wonderful, tasty taffy",This taffy is so good. It is very soft and ch...,positive,b'taffi good soft chewi flavor amaz would defi...
8,9,B000E7L2R4,A1MZYO9TZK0BBI,R. James,1,1,1322006400,Yay Barley,Right now I'm mostly just sprouting this so my...,positive,b'right most sprout cat eat grass love rotat a...
9,10,B00171APVA,A21BT40VZCCYT4,Carol A. Reed,0,0,1351209600,Healthy Dog Food,This is a very healthy dog food. Good for thei...,positive,b'healthi dog food good digest also good small...


# Bag of Words(BoW)


In [52]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
final_counts = count_vect.fit_transform(final['Text'].values)


In [53]:
type(final_counts)

scipy.sparse.csr.csr_matrix

In [54]:
final_counts.get_shape()

(364171, 115281)

In [55]:
final_counts

<364171x115281 sparse matrix of type '<class 'numpy.int64'>'
	with 19341760 stored elements in Compressed Sparse Row format>