## Amazon Fine Food Reviews Analysis and Review Prediction

#### Objective: 
Given a review, determine whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2).
    
#### [Q] How to determine if a review is positive or negative?
[Ans] We could use the Score/Rating. A rating of 4 or 5 could be cosnidered a positive review. A review of 1 or 2 could be considered negative. A review of 3 is nuetral and ignored. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.

To check any product details follow below step
1. open the url : https://www.amazon.com/dp/
2. pass the product id : https://www.amazon.com/dp/B00004CI84

In [23]:
# importing required libraries
import time
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from bs4 import BeautifulSoup

import re



import warnings
warnings.filterwarnings('ignore')




#### Loading Data

In [2]:
# using the SQLite Table to read the data
con = sqlite3.connect('C:/DevelopmentPlayground/Datasets/AmazonFoodReviewDataset/database.sqlite')

start_time = time.time()

raw_data = pd.read_sql_query(""" Select * from Reviews """, con)

print("Time took to load data :", time.time() - start_time)

Time took to load data : 7.706194877624512


### Exploratory Data Analysis

In [3]:
# checking the number of records

raw_data.shape

(568454, 10)

In [4]:
# checking top 5 records

raw_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [5]:
# checking duplicate records

duplicate = raw_data[raw_data.duplicated(['ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator', 'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'])]\
.sort_values(by=['ProfileName'])

duplicate.shape

(281, 10)

In [6]:
duplicate

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
80331,80332,B000LRO5O4,ADLVFFE4VBT8,"A. Dent ""Aragorn""",0,0,5,1222992000,Say goodbye to sugar,I could not use any of the not-to-be-named pre...
424019,424020,B003YT0POA,A2GSNN6EH9K2HD,A. Meyer,0,0,5,1301875200,Very Tasty,These cereal bars are fantastic. All of their ...
546112,546113,B003YSV5ZY,A2GSNN6EH9K2HD,A. Meyer,0,0,5,1301875200,Very Tasty,These cereal bars are fantastic. All of their ...
522348,522349,B002498PVQ,A2YLC7T12FRDKJ,A. Smith,0,2,5,1267142400,the best cookies,To my European taste these and other Grisbi co...
72198,72199,B000LKUAK4,AWM1KZ2MDOVWJ,"A. Winters ""Be good humans.""",0,0,4,1236556800,Great fake jerkey,This is a great jerkey substitute if you're ve...
...,...,...,...,...,...,...,...,...,...,...
509731,509732,B001GL6GBE,AM820RV0VN0U,windie809,0,0,5,1339459200,love these protein bars!,if you are looking for a protein bar that does...
496515,496516,B001181NBA,AM820RV0VN0U,windie809,0,0,5,1339459200,love these protein bars!,if you are looking for a protein bar that does...
353537,353538,B000UVBYRM,AM820RV0VN0U,windie809,0,0,5,1339459200,love these protein bars!,if you are looking for a protein bar that does...
413165,413166,B000FRSSFC,AM820RV0VN0U,windie809,0,0,5,1339459200,love these protein bars!,if you are looking for a protein bar that does...


### There are 281 rows with duplicate data, which need to be dropped.

In [7]:
data = raw_data[~raw_data.isin(duplicate)].dropna()

In [8]:
# checking if HelpfulnessNumerator is greater than HelpfulnesDenominator

result = data[(data['HelpfulnessNumerator'] > data['HelpfulnessDenominator'])]

In [9]:
result

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
44736,44737.0,B001EQ55RW,A2V0I904FH7ABY,Ram,3.0,2.0,4.0,1212883000.0,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...
64421,64422.0,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3.0,1.0,5.0,1224893000.0,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...


In [10]:
# deleting the records where HelpfulnessNumerator is greater than HelpfulnesDenominator

data = data[~data.isin(result)].dropna()

In [11]:
# checking duplicate records

duplicate = data[data.duplicated(['UserId', 'ProfileName', 'Time', 'Summary', 'Text'])].sort_values(by=['ProfileName'])

duplicate.shape

(173001, 10)

In [12]:
duplicate.head(5)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
358218,358219.0,B0032CJPOK,A2JKR0W0EQ6QQM,,0.0,2.0,5.0,1334966000.0,good formula!!,this formula can help baby happy every day<br ...
451901,451902.0,B00004CXX9,A34NBH479RB0E,"""dmab6395""",0.0,1.0,5.0,977184000.0,FUNNY,"I THOUGHT THIS MOVIE WAS SO FUNNY, MICHAEL KEA..."
374382,374383.0,B00004CI84,A34NBH479RB0E,"""dmab6395""",0.0,1.0,5.0,977184000.0,FUNNY,"I THOUGHT THIS MOVIE WAS SO FUNNY, MICHAEL KEA..."
374329,374330.0,B00004CI84,AAI57M3OXP5NK,"""gibraud""",0.0,0.0,5.0,1025654000.0,Love This Movie!,This movie is a very odd movie but I love it b...
451849,451850.0,B00004CXX9,AAI57M3OXP5NK,"""gibraud""",0.0,0.0,5.0,1025654000.0,Love This Movie!,This movie is a very odd movie but I love it b...


In [13]:
data = data[~data.isin(duplicate)].dropna()

data.shape

(395170, 10)

## Text Preprocessing

In [None]:
1. Removing the html tags  and URL
2. Removing punctuations
3. Removing alpha-numeric
4. Removing stop-words and converting the words to lowercase
5. Implementing Lemmatizing

In [20]:
# Randomly checking few reviews

sent_0 = data['Text'].values[0]
print(sent_0)
print("="*50)

sent_1000 = data['Text'].values[1000]
print(sent_1000)
print("="*50)

sent_1500 = data['Text'].values[1500]
print(sent_1500)
print("="*50)

sent_4900 = data['Text'].values[4900]
print(sent_4900)
print("="*50)

sent_20000 = data['Text'].values[20000]
print(sent_20000)
print("="*50)

sent_10000 = data['Text'].values[10000]
print(sent_10000)
print("="*50)

sent_15000 = data['Text'].values[15000]
print(sent_15000)
print("="*50)

sent_395126 = data['Text'].values[395126]
print(sent_395126)
print("="*50)

I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.
I have a whole box of peanut butter dog cookies and she wont touch them. She loves these and begs for them so it was a good buy. She is a little chihuhua and they are not too big for her mouth. About 20 per box.
When my daughter was an infant, she was allergic to something in my breastmilk and was put on Neocate which worked wonders for her allergic colitis.  For my son, we were therefore very careful when it came time to picking out a formula to supplement breastmilk.  His stomach does react when I have accidentally had more dairy than I can tolerate so my pediatrician recommended a lactose free formula, especially given our Asian background.  I wanted something organic, not soy, lactose free, and with as clean of a label as 

In [22]:
# remove html tags

def remove_html_tags(text_data):
    soup = BeautifulSoup(text_data, 'lxml')
    return soup.get_text()

In [25]:
# remove URL

def remove_url(text_data):
    return re.sub(r"http\S+", "", text_data)