## Amazon Fine Food Reviews Analysis

### Context

This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories.

### Information about dataset

1. Reviews from Oct 1999 - Oct 2012
2. 568,454 reviews
3. 56,059 users
4. 74,258 products
5. 260 users with > 50 reviews

### Attribution Information

1. ID
2. ProductId
3. UserId
4. ProfileName
5. HelpfulnessNumerator - Number of users who found the review helpful
6. HelpfulnessDenominator - Number of users who indicated whether they found the review helpful or not
7. Score - Rating between 1 and 5 ****
8. Time - Timestamp for the review
9. Summary - Brief summary of the review
10. Text - Text of the review *****


In [1]:
# Objective : Given a review, we have to determine the review is either positive (4 or 5) or negative (1 or 2)
# Review 3 : neutral and we have to ignore them

In [2]:
import os,sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings ('ignore')

import sqlite3
import nltk
import string

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords


### Reading Data :

In [3]:
con = sqlite3.connect ('database.sqlite')
con

<sqlite3.Connection at 0x21001baaa80>

In [4]:
# SQL

filtered_data = pd.read_sql_query ('''select * from reviews where score != 3''', con)
filtered_data.shape


(525814, 10)

In [5]:
filtered_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [6]:
filtered_data ['Score'].value_counts()

5    363122
4     80655
1     52268
2     29769
Name: Score, dtype: int64

In [7]:
def partition(x):
    if x < 3:
        return 0
    return 1
    

In [8]:
actualScore = filtered_data ['Score']
positiveNegative = actualScore.map (partition)
filtered_data ['Score'] = positiveNegative
print ("Number of data points in the data", filtered_data.shape)
print()
print (filtered_data ['Score'].value_counts())
print()
print (filtered_data.head)


Number of data points in the data (525814, 10)

1    443777
0     82037
Name: Score, dtype: int64

<bound method NDFrame.head of             Id   ProductId          UserId                      ProfileName  \
0            1  B001E4KFG0  A3SGXH7AUHU8GW                       delmartian   
1            2  B00813GRG4  A1D87F6ZCVE5NK                           dll pa   
2            3  B000LQOCH0   ABXLMWJIXXAIN  Natalia Corres "Natalia Corres"   
3            4  B000UA0QIQ  A395BORC6FGVXV                             Karl   
4            5  B006K2ZZ7K  A1UQRSCLF8GW1T    Michael D. Bigham "M. Wassir"   
...        ...         ...             ...                              ...   
525809  568450  B001EO7N10  A28KG5XORO54AY                 Lettie D. Carter   
525810  568451  B003S1WTCU  A3I8AFVPEE8KI5                        R. Sawyer   
525811  568452  B004I613EE  A121AA1GQV751Z                    pksd "pk_007"   
525812  568453  B004I613EE   A3IBEVCTXKNOH          Kathy A. Welch "katwel"   
52

In [9]:
filtered_data.columns

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')

In [10]:
# Sorting the data into ascending order basis product ID

sorted_data = filtered_data.sort_values ('ProductId', axis = 0, ascending = True, inplace = False,
                                        kind = 'quicksort', na_position = 'last')


In [11]:
sorted_data

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
138706,150524,0006641040,ACITT7DI6IDDL,shari zychinski,0,0,1,939340800,EVERY book is educational,this witty little book makes my son laugh at l...
138688,150506,0006641040,A2IW4PEEKO2R0U,Tracy,1,1,1,1194739200,"Love the book, miss the hard cover version","I grew up reading these Sendak books, and watc..."
138689,150507,0006641040,A1S4A3IQ2MU7V4,"sally sue ""sally sue""",1,1,1,1191456000,chicken soup with rice months,This is a fun way for children to learn their ...
138690,150508,0006641040,AZGXZ2UUK6X,"Catherine Hallberg ""(Kate)""",1,1,1,1076025600,a good swingy rhythm for reading aloud,This is a great little book to read aloud- it ...
138691,150509,0006641040,A3CMRKGE0P909G,Teresa,3,4,1,1018396800,A great way to learn the months,This is a book of poetry about the months of t...
...,...,...,...,...,...,...,...,...,...,...
176791,191721,B009UOFTUI,AJVB004EB0MVK,D. Christofferson,0,0,0,1345852800,weak coffee not good for a premium product and...,"This coffee supposedly is premium, it tastes w..."
1362,1478,B009UOFU20,AJVB004EB0MVK,D. Christofferson,0,0,0,1345852800,weak coffee not good for a premium product and...,"This coffee supposedly is premium, it tastes w..."
303285,328482,B009UUS05I,ARL20DSHGVM1Y,Jamie,0,0,1,1331856000,Perfect,The basket was the perfect sympathy gift when ...
5259,5703,B009WSNWC4,AMP7K1O84DH1T,ESTY,0,0,1,1351209600,DELICIOUS,Purchased this product at a local store in NY ...


In [12]:
sorted_data.isnull().sum()

Id                        0
ProductId                 0
UserId                    0
ProfileName               0
HelpfulnessNumerator      0
HelpfulnessDenominator    0
Score                     0
Time                      0
Summary                   0
Text                      0
dtype: int64

In [13]:
sorted_data.shape

(525814, 10)

In [14]:
# check duplicate entries and drop them

final = sorted_data.drop_duplicates (subset = {'UserId', 'ProfileName', 'Text'}, keep = 'first',
                                    inplace = False)


In [15]:
final.shape

(363899, 10)

### Text Analytics :

In [16]:
# Printing some random sample reviews

sent_0 = final ['Text'].values[0]
print (sent_0)
print ("="*20)

sent_200 = final ['Text'].values[200]
print (sent_200)
print ("="*20)

sent_1500 = final ['Text'].values[1500]
print (sent_1500)
print ("="*20)

sent_3000 = final ['Text'].values[3000]
print (sent_3000)
print ("="*20)

sent_4110 = final ['Text'].values[4110]
print (sent_4110)
print ("="*20)

sent_4800 = final ['Text'].values[4800]
print (sent_4800)
print ("="*20)


this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he always can sing the refrain. he's learned about whales, India, drooping roses:  i love all the new words this book  introduces and the silliness of it all.  this is a classic book i am  willing to bet my son will STILL be able to recite from memory when he is  in college
You asked for my review of this purchase already, I said fast shipping and good product. Do not bother me with these reviews any more, if i have a problem i will let you know. I will stop using amozon in the future if you don't leave me alone.
Great ingredients although, chicken should have been 1st rather than chicken broth, the only thing I do not think belongs in it is Canola oil. Canola or rapeseed is not someting a dog would ever find in nature and if it did find rapeseed in nature and eat it, it would poison them. Today's Food industries have convinced the masses that Canola oil is a safe and even better oil th

In [17]:
# Data Cleaning

import re
sent_0 = re.sub (r"http\S+", "", sent_0)
print (sent_0)


this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he always can sing the refrain. he's learned about whales, India, drooping roses:  i love all the new words this book  introduces and the silliness of it all.  this is a classic book i am  willing to bet my son will STILL be able to recite from memory when he is  in college


In [18]:
# webscrapping package - BeautifulSoup

from bs4 import BeautifulSoup
soup = BeautifulSoup (sent_0, 'lxml')
text = soup.get_text()
print (text)


this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he always can sing the refrain. he's learned about whales, India, drooping roses:  i love all the new words this book  introduces and the silliness of it all.  this is a classic book i am  willing to bet my son will STILL be able to recite from memory when he is  in college


In [19]:
sent_3000 = final ['Text'].values[3000]
print (sent_3000)
print ("="*20)

I discovered this at a nearby restaurant and had to track it down. It has a light and flowery flavor. Not overpowering, and not too light. It's just right!


In [20]:
# remove all special symbols and numbers

sent_3000 = re.sub ('[^A-Za-z]+', ' ', sent_3000)
print (sent_3000)

I discovered this at a nearby restaurant and had to track it down It has a light and flowery flavor Not overpowering and not too light It s just right 


In [21]:
def decontracted (phrase):
    phrase = re.sub (r"don't", "do not", phrase)
    phrase = re.sub (r"doesn't", "does not", phrase)
    phrase = re.sub (r"won't", "will not", phrase)
    phrase = re.sub (r"it's", "it is", phrase)
    phrase = re.sub (r"haven't", "have not", phrase)
    phrase = re.sub (r"i've", "i have", phrase)
    phrase = re.sub (r"hasn't", "has not", phrase)
    phrase = re.sub (r"can't", "cannot", phrase)
    phrase = re.sub (r"\ve", "have", phrase)
    phrase = re.sub (r"\'s", "is", phrase)
    phrase = re.sub (r"\'m", "am", phrase)
    phrase = re.sub (r"\'t", "not", phrase)
    phrase = re.sub (r"\'ll", "will", phrase)
    phrase = re.sub (r"\'re", "are", phrase)
    return phrase
    

In [22]:
sent_3000 = final ['Text'].values[3000]
print (sent_3000)

I discovered this at a nearby restaurant and had to track it down. It has a light and flowery flavor. Not overpowering, and not too light. It's just right!


In [23]:
sent_3000 = decontracted (sent_3000)
sent_3000

'I discovered this at a nearby restaurant and had to track it down. It has a light and flowery flavor. Not overpowering, and not too light. Itis just right!'

In [24]:
!pip install contractions








In [25]:
import contractions

# Example usage

text = "I can't believe its raining today"
expanded_text = contractions.fix (text)

print (expanded_text)


I cannot believe its raining today


In [26]:
!pip install -U spacy

Collecting spacy
  Downloading spacy-3.5.4-cp39-cp39-win_amd64.whl (12.2 MB)
     ---------------------------------------- 12.2/12.2 MB 3.6 MB/s eta 0:00:00
Installing collected packages: spacy
  Attempting uninstall: spacy
    Found existing installation: spacy 3.5.3
    Uninstalling spacy-3.5.3:
      Successfully uninstalled spacy-3.5.3




Successfully installed spacy-3.5.4


In [27]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 12.8/12.8 MB 3.7 MB/s eta 0:00:00
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')




In [28]:
# remove stopwords

import spacy

In [29]:
from nltk.corpus import stopwords
stop_words = set (stopwords.words ('english'))
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [30]:
stopwords = set (['okay', 'okiey', 'yup', 'br','a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 're',
 's',
 'same',
 'shan',
 "shan't",
 'she',
 "she's",
 'should',
 "should've",
 'shouldn',
 "shouldn't",
 'so',
 'some',
 'such',
 't',
 'than',
 'that',
 "that'll",
 'the',
 'their',
 'theirs',
 'them',
 'themselves',
 'then',
 'there',
 'these',
 'they',
 'this',
 'those',
 'through',
 'to',
 'too',
 'under',
 'until',
 'up',
 've',
 'very',
 'was',
 'wasn',
 "wasn't",
 'we',
 'were',
 'weren',
 "weren't",
 'what',
 'when',
 'where',
 'which',
 'while',
 'who',
 'whom',
 'why',
 'will',
 'with',
 'won',
 "won't",
 'wouldn',
 "wouldn't",
 'y',
 'you',
 "you'd",
 "you'll",
 "you're",
 "you've",
 'your',
 'yours',
 'yourself',
 'yourselves'])

In [31]:
# apply all the cleaning methods at one go with entire text

from tqdm import tqdm
preprocessed_reviews = []

# tqdm is for printing the status bar

for sentence in tqdm (final ['Text'].values):
    sentence = re.sub (r"http\S+", "", sentence)
    sentence = BeautifulSoup (sentence, 'lxml').get_text()
    sentence = contractions.fix (sentence)
    sentence = re.sub ("S\*d\S*", ' ', sentence).strip()
    sentence = re.sub ("[^A-Za-z]+", ' ', sentence)
    sentence = ' '.join (e.lower() for e in sentence.split() if e.lower() not in stopwords)
    preprocessed_reviews.append (sentence.strip())


100%|██████████| 363899/363899 [04:29<00:00, 1351.93it/s]


In [32]:
preprocessed_reviews[0]

'witty little book makes son laugh loud recite car driving along always sing refrain learned whales india drooping roses love new words book introduces silliness classic book willing bet son still able recite memory college'

In [33]:
preprocessed_reviews[1284]

'two dogs looove thing wierd thing expecting rex got stegasaurus really complaint heads want rex might need specify look like material nylabone products bought seems little harder bones bought make difference dogs never get tired'

In [34]:
# Feature Extraction
## Bag of words
## n-grams => uni-gram, bi-grams, tri-grams
## TF-IDF
## Word2Vec/Glove/BERT => Deep Learning

### TF-IDF :

In [35]:
tf_idf_vect = TfidfVectorizer (ngram_range = (1, 1))
tf_idf_vect.fit (preprocessed_reviews)
print ("some feature names : ", tf_idf_vect.get_feature_names())



In [36]:
final_counts_tfidf = tf_idf_vect.transform (preprocessed_reviews)
print ("The type of count vectorizer", type (final_counts_tfidf))
print ("="*50)
print ("The shape of the out text by using TFIDF", final_counts_tfidf.get_shape())
print ("="*50)
print ("The number of unique words", final_counts_tfidf.get_shape()[1])


The type of count vectorizer <class 'scipy.sparse.csr.csr_matrix'>
The shape of the out text by using TFIDF (363899, 116616)
The number of unique words 116616


In [None]:
final_counts_tfidf = tf_idf_vect.transform (preprocessed_reviews).toarray()

In [None]:
final_counts_tfidf

In [None]:
pd.DataFrame (final_counts_tfidf).shape

In [None]:
pd.DataFrame (final_counts_tfidf)