Amazon Food Reviews Analysis

Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews

Reference: https://www.appliedaicourse.com/

EDA: https://nycdatascience.com/blog/student-works/amazon-fine-foods-visualization/

1. Understanding the downloaded data
2. Attribute and format information
3. Objective 
4. Initial steps like loading and reading the data
5. EDA with Data cleaning and Data Pre-processing
6. Text Analysis

Objective:
Given a review, determine whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2)

In [3]:
import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

from tqdm import tqdm
import os


In [5]:
# using the SQLite Table to read data.
con=sqlite3.connect('database.sqlite')

#filtering only positive and negative reviews i.e. 
# not taking into consideration those reviews with Score=3
# SELECT * FROM Reviews WHERE Score != 3 LIMIT 500000, will give top 500000 data points
# you can change the number to any other number based on your computing power

# filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3 LIMIT 500000""", con) 
# for tsne assignment you can take 5k data points
filtered_data = pd.read_sql_query("""SELECT * FROM Reviews WHERE score != 3 LIMIT 500000 """, con)


# Give reviews with Score>3 a positive rating, and reviews with a score<3 a negative rating.
filtered_data 
def partition(x):
    if x<3:
        return 'negative';
    return 'positive'



#changing reviews with score less than 3 to be positive and vice-versa

actualScore=filtered_data['Score']
positiveNegative=actualScore.map(partition)
filtered_data['Score'] = positiveNegative
print("Number of data points in our data", filtered_data.shape)
filtered_data.head(3)

Number of data points in our data (500000, 10)


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,positive,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,negative,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,positive,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...


In [6]:
#ART of finding ways to clean the data
#we have to work, query and understand the data; so that we will come through somes idea how and where we can clean up the data


display = pd.read_sql_query("""

SELECT UserId, ProfileName, Time, Score, Text, COUNT(*)
FROM Reviews
GROUP BY userId
HAVING count(*)>1
""",con)



In [7]:
print(display.shape)
display.head(50)

(80668, 6)


Unnamed: 0,UserId,ProfileName,Time,Score,Text,COUNT(*)
0,#oc-R115TNMSPFT9I7,Breyton,1331510400,2,Overall its just OK when considering the price...,2
1,#oc-R11D9D7SHXIJB9,"Louis E. Emory ""hoppy""",1342396800,5,"My wife has recurring extreme muscle spasms, u...",3
2,#oc-R11DNU2NBKQ23Z,Kim Cieszykowski,1348531200,1,This coffee is horrible and unfortunately not ...,2
3,#oc-R11O5J5ZVQE25C,Penguin Chick,1346889600,5,This will be the bottle that you grab from the...,3
4,#oc-R12KPBODL2B5ZD,Christopher P. Presta,1348617600,1,I didnt like this coffee. Instead of telling y...,2
5,#oc-R12MGTQS5KZZRV,"SKY2110 ""SKY2110""",1344211200,5,This is the highest PH level I can find withou...,3
6,#oc-R13EBF129DBX88,mary,1344729600,2,This coffee is not what I expected. I thought...,2
7,#oc-R13NNUL4EKL4FL,N. Chernyavskaya,1348358400,1,I tested the pH of this water. I am very disap...,3
8,#oc-R14ZSRYW2YB41B,A. Crafton,1346284800,5,I drank this on ice after a workout. It was ve...,3
9,#oc-R15343ZW0UTLMR,"Lisa L. Nolen ""SimplyLisaLisa""",1346457600,1,"I shouldn't label myself a coffee connoisseur,...",2


In [8]:
#Sorting data according to ProductId in ascending order
sorted_data=filtered_data.sort_values('ProductId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

In [9]:
#Deduplication of entries
final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep='first', inplace=False)
final.shape

(348262, 10)

In [10]:
#Checking to see how much % of data still remains
(final['Id'].size*1.0)/(filtered_data['Id'].size*1.0)*100

69.6524

In [11]:
display= pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3 AND Id=44737 OR Id=64422
ORDER BY ProductID
""", con)

display.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,5,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...
1,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3,2,4,1212883200,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...


In [12]:
final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]

In [13]:
#Before starting the next phase of preprocessing lets see the number of entries left
print(final.shape)

#How many positive and negative reviews are present in our dataset?
final['Score'].value_counts()

(348260, 10)


positive    293516
negative     54744
Name: Score, dtype: int64

In [25]:
#the next part, we will be dealing with concepts like bag of words, tfid, words2vectors

# printing some random reviews

#from the column text; we are selecting some random reviews as in .values[..]
sent_0 = final['Text'].values[0]
print(sent_0)
print("="*50)

sent_1000 = final['Text'].values[1000]
print(sent_1000)
print("="*50)

sent_1001 = final['Text'].values[1001]
print(sent_1001)
print("="*50)

sent_1002 = final['Text'].values[1002]
print(sent_1002)
print("="*50)

sent_1500 = final['Text'].values[1500]
print(sent_1500)
print("="*50)




This book was purchased as a birthday gift for a 4 year old boy. He squealed with delight and hugged it when told it was his to keep and he did not have to return it to the library.
I've purchased both the Espressione Espresso (classic) and the 100% Arabica.  My vote is definitely with the 100% Arabica.  The flavor has more bite and flavor (much more like European coffee than American).
These pods have got to be the best invention yet. They are compact, extremely easy to use & is the best solution to making several very quick cups of coffee.<br /><br />Fresh ground coffee is nice, but there is virtually no mess with these pods. The crema is just as good as ground coffee & these can be used in a pod holder as well as a one cup filter basket.<br /><br />What more could a person want?!
This is a great product. It is very healthy for all of our dogs, and it is the first food that they all love to eat. It helped my older dog lose weight and my 10 year old lab gain the weight he needed to be

In [24]:
# remove urls from text python: https://stackoverflow.com/a/40823105/4084039
sent_0 = re.sub(r"http\S+", "", sent_0)
sent_1000 = re.sub(r"http\S+", "", sent_1000)
sent_1001 = re.sub(r"http\S+", "", sent_1001)
sent_1002 = re.sub(r"http\S+", "", sent_1002)
sent_1500 = re.sub(r"http\S+", "", sent_1500)

print(sent_1002)

These pods have got to be the best invention yet. They are compact, extremely easy to use & is the best solution to making several very quick cups of coffee.<br /><br />Fresh ground coffee is nice, but there is virtually no mess with these pods. The crema is just as good as ground coffee & these can be used in a pod holder as well as a one cup filter basket.<br /><br />What more could a person want?!


In [29]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(sent_0, 'lxml')
text = soup.get_text()
print(text)
print("="*50)

soup = BeautifulSoup(sent_1000, 'lxml')
text = soup.get_text()
print(text)
print("="*50)

soup = BeautifulSoup(sent_1001, 'lxml')
text = soup.get_text()
print(text)
print("="*50)

soup = BeautifulSoup(sent_1002, 'lxml')
text = soup.get
print(text)
print("="*50)

soup = BeautifulSoup(sent_1500, 'lxml')
text = soup.get
print(text)
print("="*50)

This book was purchased as a birthday gift for a 4 year old boy. He squealed with delight and hugged it when told it was his to keep and he did not have to return it to the library.
I've purchased both the Espressione Espresso (classic) and the 100% Arabica.  My vote is definitely with the 100% Arabica.  The flavor has more bite and flavor (much more like European coffee than American).
<bound method Tag.get of <html><body><p>These pods have got to be the best invention yet. They are compact, extremely easy to use &amp; is the best solution to making several very quick cups of coffee.<br/><br/>Fresh ground coffee is nice, but there is virtually no mess with these pods. The crema is just as good as ground coffee &amp; these can be used in a pod holder as well as a one cup filter basket.<br/><br/>What more could a person want?!</p></body></html>>
<bound method Tag.get of <html><body><p>This is a great product. It is very healthy for all of our dogs, and it is the first food that they all