# Amazon Fine Food Reviews Analysis


Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews <br>

EDA: https://nycdatascience.com/blog/student-works/amazon-fine-foods-visualization/


The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.<br>

Number of reviews: 568,454<br>
Number of users: 256,059<br>
Number of products: 74,258<br>
Timespan: Oct 1999 - Oct 2012<br>
Number of Attributes/Columns in data: 10 

Attribute Information:

1. Id
2. ProductId - unique identifier for the product
3. UserId - unqiue identifier for the user
4. ProfileName
5. HelpfulnessNumerator - number of users who found the review helpful
6. HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
7. Score - rating between 1 and 5
8. Time - timestamp for the review
9. Summary - brief summary of the review
10. Text - text of the review


#### Objective:
Given a review, determine whether the review is positive (rating of 4 or 5) or negative (rating of 1 or 2).

<br>
[Q] How to determine if a review is positive or negative?<br>
<br> 
[Ans] We could use Score/Rating. A rating of 4 or 5 can be cosnidered as a positive review. A rating of 1 or 2 can be considered as negative one. A review of rating 3 is considered nuetral and such reviews are ignored from our analysis. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.




In [2]:
import nltk
import pandas as pd
pd.set_option("display.max_colwidth", 200)
import numpy as np
import re
import nltk
import matplotlib.pyplot as plt
import seaborn as sns              
from nltk import FreqDist
nltk.download('stopwords') # run this one time

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
#import spacy
#import gensim
#from gensim import corpora

In [4]:
import os
os.chdir('C:\\Users\\Administrator\\Desktop\\Data\\AmazonFineFood\\Reviews')
#os.chdir('C:\\Users\\prudi\\Desktop\\Data Sets\\amazon-fine-food-reviews')
#df = pd.read_json('Reviews.json', lines=True)
df=pd.read_csv('Reviews.csv')

In [5]:
df.loc[df['ProductId'] == 'B000HDOPZG',:]

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text


In [6]:
df.loc[df['ProductId'] == 'B000HDL1RQ',:]

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text


In [7]:
#Sorting data according to ProductId in ascending order
sorted_data=df.sort_values('ProductId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

In [8]:
#Deduplication of entries
final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep='first', inplace=False)
final.shape

(49, 10)

In [9]:
sorted_data[sorted_data.duplicated(['UserId','ProfileName',"Time","Text"])].shape

(1, 10)

In [10]:
final.shape

(49, 10)

In [11]:
final=final.reset_index()

In [12]:
final=final.loc[:10,:]

In [13]:
def partition(x):
    if x < 3:
        return 0
    return 1

#changing reviews with score less than 3 to be positive and vice-versa
actualScore = final['Score']
positiveNegative = actualScore.map(partition) 
final['Score'] = positiveNegative
print("Number of data points in our data", final.shape)
final.head(3)

Number of data points in our data (11, 11)


Unnamed: 0,index,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,10,11,B0001PB9FE,A3HDKO7OW0QNK4,Canadian Fan,1,1,1,1107820800,The Best Hot Sauce in the World,"I don't know if it's the cactus or the tequila or just the unique combination of ingredients, but the flavour of this hot sauce makes it one of a kind! We picked up a bottle once on a trip we wer..."
1,12,13,B0009XLVG0,A327PCT23YH90,LT,1,1,0,1339545600,My Cats Are Not Fans of the New Food,My cats have been happily eating Felidae Platinum for more than two years. I just got a new bag and the shape of the food is different. They tried the new food when I first put it in their bowls a...
2,11,12,B0009XLVG0,A2725IB4YY9JEB,"A Poeng ""SparkyGoHome""",4,4,1,1282867200,"My cats LOVE this ""diet"" food better than their regular food","One of my boys needed to lose some weight and the other didn't. I put this food on the floor for the chubby guy, and the protein-rich, no by-product food up higher where only my skinny boy can ju..."


In [14]:
from nltk.corpus import stopwords
Stopwords=stopwords.words('english')

In [15]:
for i in range(len(final['Text'])):
    words=nltk.word_tokenize(final['Text'][i])
    withoutstopwords=[word for word in words if word not in Stopwords]
    final['Text'][i]=' '.join(withoutstopwords)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [16]:
#remove spacial character: https://stackoverflow.com/a/5843547/4084039
## Remove the special characters
final['Text'] = final['Text'].str.replace("[^a-zA-Z0-9]+", " ")

In [17]:
# remove urls from text python: https://stackoverflow.com/a/40823105/4084039
## Remove the URL'S
import re
final['Text']=final['Text'].apply(lambda x: re.sub(r"http\S+", "", x))

In [18]:
from nltk.stem import PorterStemmer
Stemmer=PorterStemmer()
for i in range(len(final['Text'])):
    tokens=nltk.word_tokenize(final['Text'][i])
    stem_tokens=[Stemmer.stem(token) for token in tokens]
    final['Text'][i]=' '.join(stem_tokens)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [19]:
# https://stackoverflow.com/questions/16206380/python-beautifulsoup-how-to-remove-all-tags-from-an-element
## Remove xml tags from the reviews
from bs4 import BeautifulSoup
final['Text']=final['Text'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text())

In [20]:
# remove short words (length < 3)
final['Text'] = final['Text'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>2]))

In [21]:
# make entire text lowercase
final['Text'] = [r.lower() for r in final['Text']]

In [22]:
final.head(1)

Unnamed: 0,index,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,10,11,B0001PB9FE,A3HDKO7OW0QNK4,Canadian Fan,1,1,1,1107820800,The Best Hot Sauce in the World,know cactu tequila uniqu combin ingredi flavour hot sauc make one kind pick bottl trip brought back home total blown away when realiz simpli could find anywher citi bum now magic internet case sau...


# printing some random reviews
sent_0 = final['Text'].values[0]
print(sent_0)
print("="*50)

sent_1000 = final['Text'].values[10]
print(sent_1000)
print("="*50)

sent_1500 = final['Text'].values[20]
print(sent_1500)
print("="*50)

sent_4900 = final['Text'].values[30]
print(sent_4900)
print("="*50)

In [23]:
# https://stackoverflow.com/a/47091490/4084039
import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [24]:
final['Text']=final['Text'].map(decontracted)

In [25]:
#remove words with numbers python: https://stackoverflow.com/a/18082370/4084039
final['Text']=final['Text'].apply(lambda x: re.sub("\S*\d\S*", "", x))

In [26]:
final.head(2)

Unnamed: 0,index,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,10,11,B0001PB9FE,A3HDKO7OW0QNK4,Canadian Fan,1,1,1,1107820800,The Best Hot Sauce in the World,know cactu tequila uniqu combin ingredi flavour hot sauc make one kind pick bottl trip brought back home total blown away when realiz simpli could find anywher citi bum now magic internet case sau...
1,12,13,B0009XLVG0,A327PCT23YH90,LT,1,1,0,1339545600,My Cats Are Not Fans of the New Food,cat happili eat felida platinum two year got new bag shape food differ they tri new food first put bowl bowl sit full kitti touch food notic similar review relat formula chang past unfortun need f...


In [27]:
#remove words with numbers python: https://stackoverflow.com/a/18082370/4084039
import re
sent_0 = re.sub("\S*\d\S*", "", "This are number 123")
print(sent_0)

This are number 


In [28]:
final.head(1)

Unnamed: 0,index,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,10,11,B0001PB9FE,A3HDKO7OW0QNK4,Canadian Fan,1,1,1,1107820800,The Best Hot Sauce in the World,know cactu tequila uniqu combin ingredi flavour hot sauc make one kind pick bottl trip brought back home total blown away when realiz simpli could find anywher citi bum now magic internet case sau...


In [29]:
# Combining all the above stundents 
from tqdm import tqdm
preprocessed_reviews = []
# tqdm is for printing the status bar
for sentance in tqdm(final['Text'].values):
    sentance = re.sub(r"http\S+", "", sentance)
    sentance = BeautifulSoup(sentance, 'lxml').get_text()
    sentance = decontracted(sentance)
    sentance = re.sub("\S*\d\S*", "", sentance).strip()
    sentance = re.sub('[^A-Za-z]+', ' ', sentance)
    # https://gist.github.com/sebleier/554280
    sentance = ' '.join(e.lower() for e in sentance.split() if e.lower() not in Stopwords)
    preprocessed_reviews.append(sentance.strip())

100%|████████████████████████████████████████| 11/11 [00:00<00:00, 1047.08it/s]


In [30]:
final['Text']=preprocessed_reviews

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

tfidf = TfidfVectorizer(lowercase=False, )
tfidf.fit_transform(final['Text'])

# dict key:word and value:tf-idf score
word2tfidf = dict(zip(tfidf.get_feature_names(), tfidf.idf_))

In [32]:
import warnings
# en_vectors_web_lg, which includes over 1 million unique vectors.
import spacy
nlp = spacy.load('en_core_web_sm')
#import en_core_web_sm
#nlp = en_core_web_sm.load()

In [88]:
vecs1 = []
# https://github.com/noamraph/tqdm
# tqdm is used to print the progress bar
for qu1 in tqdm(range(len(final['Text']))):
    doc1 = nlp(final['Text'][qu1]) 
    # 384 is the number of dimensions of vectors 
    mean_vec1 = np.zeros([len(doc1), 384])
    for word1 in doc1:
        # word2vec
        vec1 = word1.vector
        # fetch df score
        try:
            idf = word2tfidf[str(word1)]
        except:
            idf = 0
        # compute final vec
        mean_vec1[qu1] = vec1 * idf
    mean_vec1 = mean_vec1.mean(axis=0)
    vecs1.append(mean_vec1)
#df['q1_feats_m'] = list(vecs1)


100%|██████████████████████████████████████████| 11/11 [00:00<00:00, 29.45it/s]


In [89]:
w2v_final=pd.DataFrame(vecs1)

<h2><font color='red'>[3.2] Preprocessing Review Summary</font></h2>

In [90]:
w2v_final.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,374,375,376,377,378,379,380,381,382,383
0,0.040597,0.029095,-0.027385,0.008699,-0.000762,0.043959,-0.1255,0.049812,0.064907,0.086278,...,0.008569,0.012811,0.045857,-0.029848,-0.015161,-0.009784,-0.009713,-0.035239,0.035038,0.015609
1,0.101272,-0.016836,0.083388,-0.060804,0.253146,0.034564,-0.119985,0.092053,0.003815,-0.006391,...,0.037583,-0.011276,0.009697,-0.008541,-0.009426,0.017738,-0.003649,0.007103,0.00747,-0.00844


In [35]:
## Similartly you can do preprocessing for review summary also.

# [4] Featurization

## [4.1] WeightedW2V

In [91]:
## Train Test Split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(w2v_final, final['Score'],test_size=0.3,random_state=42)

In [92]:
print('Shape of X_train is',X_train.shape)
print('Shape of y_train is',y_train.shape)

print('*'*100)

print('Shape of X_test is',X_test.shape)
print('Shape of y_test is',y_test.shape)

Shape of X_train is (7, 384)
Shape of y_train is (7,)
****************************************************************************************************
Shape of X_test is (4, 384)
Shape of y_test is (4,)


In [93]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')

In [94]:
## Predicting the values for the X_test
predicted_xtest=list(neigh.predict(X_test))
actual_test=list(y_test)

In [95]:
## Predicting the values for the X_train
predicted_xtrain=list(neigh.predict(X_train))
actual_train=list(y_train)

In [96]:
## Passing the first value from the X_train
Predicted_value_X_train=neigh.predict(X_train.iloc[:1,:])
print('Predicted value for the first record is ',Predicted_value_X_train)

Predicted value for the first record is  [1]


In [97]:
## Passing the first value from the X_test
Predicted_value_X_test=neigh.predict(X_test.iloc[:1,:])
print('Predicted value for the first record is ',Predicted_value_X_test)

Predicted value for the first record is  [1]


In [98]:
## Passing the first value from the X_test and Predicting the Probelities
Predicted_value_X_test=neigh.predict_proba(X_test.iloc[1:20,:])
print('Predicted value for the first record is \n',Predicted_value_X_test)

Predicted value for the first record is 
 [[0.33333333 0.66666667]
 [0.33333333 0.66666667]
 [0.33333333 0.66666667]]


In [99]:
neigh.kneighbors(X_train.iloc[:1,:],return_distance=True)

(array([[0.        , 2.22819736, 2.27891552]]),
 array([[0, 6, 3]], dtype=int64))

In [100]:
neigh.kneighbors(X_train.iloc[:1,:],return_distance=False)

array([[0, 6, 3]], dtype=int64)

In [101]:
print('Classes we have to predict ',neigh.classes_)

Classes we have to predict  [0 1]


In [102]:
# Confusion metrics for the Test data

from sklearn.metrics import confusion_matrix 
from sklearn.metrics import accuracy_score 
from sklearn.metrics import classification_report 

results =confusion_matrix(actual_test, predicted_xtest) 

print('Confusion Matrix :')
print(results) 
print('Accuracy Score :',accuracy_score(actual_test, predicted_xtest) )
print('Report : ')
print (classification_report(actual_test, predicted_xtest) )


Confusion Matrix :
[[0 1]
 [0 3]]
Accuracy Score : 0.75
Report : 


  'precision', 'predicted', average, warn_for)


              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.75      1.00      0.86         3

    accuracy                           0.75         4
   macro avg       0.38      0.50      0.43         4
weighted avg       0.56      0.75      0.64         4



In [103]:
# Confusion metrics for the Test data

from sklearn.metrics import confusion_matrix 
from sklearn.metrics import accuracy_score 
from sklearn.metrics import classification_report 

results =confusion_matrix(actual_train, predicted_xtrain) 

print('Confusion Matrix :')
print(results) 
print('Accuracy Score :',accuracy_score(actual_train, predicted_xtrain) )
print('Report : ')
print (classification_report(actual_train, predicted_xtrain) )

Confusion Matrix :
[[0 1]
 [0 6]]
Accuracy Score : 0.8571428571428571
Report : 
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.86      1.00      0.92         6

    accuracy                           0.86         7
   macro avg       0.43      0.50      0.46         7
weighted avg       0.73      0.86      0.79         7

