## Amazon Fine Food Reviews 

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.

Number of reviews: 568,454
Number of users: 256,059
Number of products: 74,258
Timespan: Oct 1999 - Oct 2012
Number of Attributes/Columns in data: 10

Attribute Information:

1) Id

2) ProductId - unique identifier for the product

3) UserId - unqiue identifier for the user

4) ProfileName

5) HelpfulnessNumerator - number of users who found the review helpful

6) HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not

7) Score - rating between 1 and 5

8) Time - timestamp for the review

9) Summary - brief summary of the review

10) Text - text of the review


## importing required library

In [10]:
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")


import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from bs4 import BeautifulSoup

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc

import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

import os

In [11]:
#importing the training data

df = pd.read_csv('Reviews.csv')

In [12]:
df.shape[0]

568454

## Summary of the dataset

In [13]:
df.describe()

Unnamed: 0,Id,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time
count,568454.0,568454.0,568454.0,568454.0,568454.0
mean,284227.5,1.743817,2.22881,4.183199,1296257000.0
std,164098.679298,7.636513,8.28974,1.310436,48043310.0
min,1.0,0.0,0.0,1.0,939340800.0
25%,142114.25,0.0,0.0,4.0,1271290000.0
50%,284227.5,0.0,1.0,5.0,1311120000.0
75%,426340.75,2.0,2.0,5.0,1332720000.0
max,568454.0,866.0,923.0,5.0,1351210000.0


## this function is used for pulling data out of HTML as well as XML files

In [14]:
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

## removing the square brackets

In [15]:
def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)

## removing the URL's

In [16]:
def remove_between_square_brackets(text):
    return re.sub(r'http\S+', '', text)

## removing the noisy text

In [17]:
def denoise_text(text):   
    text = strip_html(text)
    text = remove_between_square_brackets(text)
    return text

In [18]:
#Apply function on review column

df['Text']=df['Text'].apply(denoise_text)

In [19]:
del df['Id']
del df['Time']
del df['UserId']
del df['ProductId']
del df['HelpfulnessNumerator']
del df['HelpfulnessDenominator']

In [20]:
df.head()

Unnamed: 0,ProfileName,Score,Summary,Text
0,delmartian,5,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,dll pa,1,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,"Natalia Corres ""Natalia Corres""",4,"""Delight"" says it all",This is a confection that has been around a fe...
3,Karl,2,Cough Medicine,If you are looking for the secret ingredient i...
4,"Michael D. Bigham ""M. Wassir""",5,Great taffy,Great taffy at a great price. There was a wid...


In [21]:
df['Text'] = df['Text'] + ' ' + df['Summary'] + ' ' + df['ProfileName']

del df['Summary']

del df['ProfileName']

df.head()

Unnamed: 0,Score,Text
0,5,I have bought several of the Vitality canned d...
1,1,Product arrived labeled as Jumbo Salted Peanut...
2,4,This is a confection that has been around a fe...
3,2,If you are looking for the secret ingredient i...
4,5,Great taffy at a great price. There was a wid...


# Replace scores of 1,2,3 with 0 (not good) and 4,5 with 1 (good)

In [22]:
def score_sentiment(score):
    
    if(score == 1 or score == 2 or score == 3):
        return 0
    else:
        return 1 

In [23]:
df.Score = df.Score.apply(score_sentiment)

In [24]:
df.head()

Unnamed: 0,Score,Text
0,1,I have bought several of the Vitality canned d...
1,0,Product arrived labeled as Jumbo Salted Peanut...
2,1,This is a confection that has been around a fe...
3,0,If you are looking for the secret ingredient i...
4,1,Great taffy at a great price. There was a wid...


In [25]:
df.isna().sum()

Score     0
Text     43
dtype: int64

In [26]:
df.Text.fillna("",inplace = True)

In [27]:
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\Karan
[nltk_data]     khatri\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [28]:
stop = set(stopwords.words('english'))

punctuation = list(string.punctuation)

stop.update(punctuation)

## removing the commoner morphological and inflexional endings from words in English

In [29]:
stemmer = PorterStemmer()
def stem_text(text):
    final_text = []
    for i in text.split():
        if i.strip().lower() not in stop:
            word = stemmer.stem(i.strip())
            final_text.append(word)
    return " ".join(final_text)

In [31]:
df.Text = df.Text.apply(stem_text)

In [32]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

In [33]:
x_train,x_test,y_train,y_test = train_test_split(df.Text,df.Score,random_state = 0)

## feature extraction

In [34]:
cv=CountVectorizer(min_df=0,max_df=1,ngram_range=(1,2))

#transformed train reviews
cv_train_reviews=cv.fit_transform(x_train)

#transformed test reviews
cv_test_reviews=cv.transform(x_test)

print('BOW_cv_train:',cv_train_reviews.shape)
print('BOW_cv_test:',cv_test_reviews.shape)

BOW_cv_train: (426340, 2485966)
BOW_cv_test: (142114, 2485966)


In [35]:
from sklearn.linear_model import LogisticRegression

## Fitting the model for Bag of words

In [36]:
lr=LogisticRegression(penalty='l2',max_iter=500,C=1,random_state=0)

#Fitting the model for Bag of words

lr_bow=lr.fit(cv_train_reviews,y_train)

print(lr_bow)

LogisticRegression(C=1, max_iter=500, random_state=0)


## Predicting the model for bag of words

In [37]:
#Predicting the model for bag of words

lr_bow_predict=lr.predict(cv_test_reviews)

In [38]:
#Accuracy score for bag of words

lr_bow_score=accuracy_score(y_test,lr_bow_predict)

print("lr_bow_score :",lr_bow_score)

lr_bow_score : 0.8005052281970813


## Classification report for bag of words

In [39]:
lr_bow_report = classification_report(y_test,lr_bow_predict)

print(lr_bow_report)

              precision    recall  f1-score   support

           0       0.98      0.09      0.17     31133
           1       0.80      1.00      0.89    110981

    accuracy                           0.80    142114
   macro avg       0.89      0.55      0.53    142114
weighted avg       0.84      0.80      0.73    142114

