## Sentiment Analysis of Amazon Fine Food Reviews
____


**by Michael Xiao**

### Overview:

1. How do our data look?
2. Importing libraries
3. Cleaning and EDA
4. Feature Engineering
5. Preprocessing (NLTK)
6. VADER Sentiment Analysis
7. TextBlob Sentiment Analysis
8. Hybrid Score
9. Removing Users and Products with few transactions
10. Final DF

## How do our data look?

![](../pictures/fine_food_review_screenshot.jpg)

![](../pictures/fine_food_review_illustration.jpg)

## Importing libraries

In [1]:
import pandas as pd
import numpy as np

import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob

pd.set_option('display.width', 5000)
pd.set_option('display.max_rows', 500)     #ease of viewing
pd.set_option('display.max_columns', 120)
pd.set_option('display.max_colwidth', 500)

## 1.0 Cleaning and EDA

In [30]:
df = pd.read_csv('../data/fine_food_reviews.csv')
df.Summary.fillna(' ', inplace=True) 
df.drop_duplicates("Text", inplace=True)        #drop rows that contain repeated reviews; drop potential fake users
df['Review'] = df['Summary'] + ' ' + df['Text'] #we will be analysing our review as a sum of these two features
df.info()                                       #there are 174,875 repeated reviews

<class 'pandas.core.frame.DataFrame'>
Int64Index: 393579 entries, 0 to 568453
Data columns (total 11 columns):
Id                        393579 non-null int64
ProductId                 393579 non-null object
UserId                    393579 non-null object
ProfileName               393568 non-null object
HelpfulnessNumerator      393579 non-null int64
HelpfulnessDenominator    393579 non-null int64
Score                     393579 non-null int64
Time                      393579 non-null int64
Summary                   393579 non-null object
Text                      393579 non-null object
Review                    393579 non-null object
dtypes: int64(5), object(6)
memory usage: 36.0+ MB


## 1.1 Feature Engineering

In [31]:
df['helpfulness_score'] = df.HelpfulnessNumerator/df.HelpfulnessDenominator 
df.isnull().sum() #184496 reviews returned null values

Id                             0
ProductId                      0
UserId                         0
ProfileName                   11
HelpfulnessNumerator           0
HelpfulnessDenominator         0
Score                          0
Time                           0
Summary                        0
Text                           0
Review                         0
helpfulness_score         184496
dtype: int64

In [32]:
df.drop(labels=['ProfileName','HelpfulnessNumerator', 'HelpfulnessDenominator', 'Time', #drop features 
               'Summary','Text'], axis=1, inplace=True)
print('Unique users: ', df.UserId.nunique())
print('Unique products: ', df.ProductId.nunique())
df.dropna(inplace=True)
df.info()

Unique users:  256044
Unique products:  67554
<class 'pandas.core.frame.DataFrame'>
Int64Index: 209083 entries, 0 to 568452
Data columns (total 6 columns):
Id                   209083 non-null int64
ProductId            209083 non-null object
UserId               209083 non-null object
Score                209083 non-null int64
Review               209083 non-null object
helpfulness_score    209083 non-null float64
dtypes: float64(1), int64(2), object(3)
memory usage: 11.2+ MB


In [5]:
df.to_csv('../data/fine_food_reviews_cleaned.csv')

## 1.2 Preprocessing with NLTK

In [6]:
#df = pd.read_csv('fine_food_reviews_cleaned.csv')

In [28]:
tokenizer = RegexpTokenizer(r'\w+')
lemmatizer = WordNetLemmatizer()

def standardize_text(df, text_field):
    
    df['tokenized_reviews'] = df[text_field].apply(tokenizer.tokenize)                              #tokenize first
    #df[text_field] = df[text_field].str.lower()
    df[text_field] = df[text_field].str.replace(r"\(https?:\/\/.*[\r\n]*", "")                      #accounting for punctuation and html links
    df[text_field] = df[text_field].str.replace("<a.*?</a>", "")
    df[text_field] = df[text_field].str.replace(r"[^\w\s]", "")
    df[text_field] = df[text_field].str.replace('\d+', '')
    
    lemmatize = df['tokenized_reviews'].apply(lambda x:[lemmatizer.lemmatize(i) for i in x])        #lemmatize
    df['lemmatized'] = pd.DataFrame(lemmatize)
    
    words = stopwords.words('english') #stopwords
    words.extend(['was','have','has','were', 'im', 'ill', 'id', 'wa', 'i', 'I','ha', 'The', 'br'])  #extending stopwords
    df['lemmatized'] = df['lemmatized'].apply(lambda x:[w for w in x if not w in words])            #applying back to df
    df['reviews'] = df['lemmatized'].apply(lambda x: ' '.join(x))
    
    return df

In [33]:
standardize_text(df, 'Review') 
df.drop(['Review','tokenized_reviews','lemmatized'], axis=1, inplace=True)
df.head(3)

Unnamed: 0,Id,ProductId,UserId,Score,helpfulness_score,reviews
0,1,B001E4KFG0,A3SGXH7AUHU8GW,5,1.0,Good Quality Dog Food bought several Vitality canned dog food product found good quality product look like stew processed meat smell better My Labrador finicky appreciates product better
2,3,B000LQOCH0,ABXLMWJIXXAIN,4,1.0,Delight say This confection around century It light pillowy citrus gelatin nut case Filberts And cut tiny square liberally coated powdered sugar And tiny mouthful heaven Not chewy flavorful highly recommend yummy treat If familiar story C S Lewis Lion Witch Wardrobe treat seduces Edmund selling Brother Sisters Witch
3,4,B000UA0QIQ,A395BORC6FGVXV,2,1.0,Cough Medicine If looking secret ingredient Robitussin believe found got addition Root Beer Extract ordered good made cherry soda flavor medicinal


In [9]:
df.to_csv('../data/fine_food_reviews_nltk.csv')

## 1.3 VADER Sentiment Analysis 

![](../pictures/Vader.jpeg)

In [10]:
#df = pd.read_csv('fine_food_reviews_nltk.csv')

In [11]:
vader = SentimentIntensityAnalyzer() 

vader_score = []
for review in df.reviews:
    vader_score.append(vader.polarity_scores(review))               #get polarity scores for each review in df.reviews
    
vader_score = pd.DataFrame(vader_score)                             #put results into a dataframe
vader_score.columns = ['V_'+col for col in vader_score.columns] 
vader_score['Id'] = np.arange(1, len(vader_score)+1) 
df1 = df.merge(vader_score, on='Id',how='right')                    #merge back to become df1

In [12]:
df1.dropna(inplace=True) 
#drop null values that were present; 2 possible reasons: rows were blank or reviews were too short to run analyse
df1.shape

(88874, 10)

In [13]:
df1.sort_values(by='V_compound')

Unnamed: 0,Id,ProductId,UserId,Score,helpfulness_score,reviews,V_compound,V_neg,V_neu,V_pos
44734,96436,B0001WA6D8,A2ZFSD8JS441HI,5.0,1.000000,Really good first Sweden year ago love flavor Lemon least favorite good,-0.9997,0.243,0.658,0.099
84167,194777,B0046IIPMW,A20B83UTKGOZ48,5.0,1.000000,Creamy buttery flavorful heaven tried number Ghee brand hand favorite It taste like melted butter removed froth top Luscious use make egg saute vegetable It never burned oxidized even high heat browning Another benefit maker use grass fed cow make ghee totally hooked,-0.9978,0.254,0.688,0.058
86606,202560,B000MWYAMU,AKJXIL2VS3Y36,5.0,1.000000,Awesome taste love stuff Wild Ride Hoppin Hickory NO nitrite NO MSG It made natural ingredient healthy get still eating beef It also low calorie An entire 4 ounce bag 240 calorie 60 calorie per serving high protein low fat watching weight Wild Ride welcome addition balanced diet Essentially sliced steak bag An average piece bite size one inch square 1 8 inch thick Keep mind Wild Ride definitely NOT shoe leather tough like jerky product thus traditional beef jerky lover may put tender quality...,-0.9963,0.226,0.685,0.089
9437,18875,B000F4H5FY,A2T34X1FM76O2Z,5.0,1.000000,great tea This tea good quit good mighty leaf good Would buy,-0.9932,0.263,0.520,0.217
9678,19393,B002GP405S,A1FDVT0DLJWV78,4.0,0.500000,Good ingredient okay taste actually ordered accident thinking getting original Tandoor Chef Naan While okay really prefer original Naan still gave four star really like garlic great make really good pizza base However original continues king book,-0.9932,0.270,0.620,0.110
16766,35165,B000E1DSQS,A17VOMKQE7ZB83,5.0,1.000000,Maxwell House Coffee Hazelnut flavor coffee still favorite coffee It good expected would order Nice LOTS available kitchen D Heliotis,-0.9920,0.437,0.438,0.125
41462,88924,B000F3WS7K,A2GZ3O8AHD3PI2,5.0,1.000000,Yorkshire Gold Really good kind strong bitter Has mouthfeel tea might make good substitute coffee,-0.9918,0.260,0.604,0.135
74913,170005,B0012MV2RA,ACPFMGGSK8BA1,3.0,0.000000,Too much caffeine Taste good think stick regular bean without caffeine Good need pick training though,-0.9903,0.269,0.666,0.065
8710,17603,B000CMIYWC,A2695DQ3DE1Q05,5.0,0.928571,Greatest Tea Ever It truly work wonder seasonal asthma ever since child tried rather unsuccessfully overcome excercise inhaler medication NOTHING worked well Yogi Breathe Deep tea It like miracle When breathing feeling restricted drink hot cup delicious tea bed sends nice deep sleep When wake breathing 100 better feel great next 1 2 day That exaggeration reccommend tea asthmatic anyone suffering stress anxiety lack sleep anyone appreciates good tea Thank thank thank Yogi Tea people May succe...,-0.9899,0.295,0.583,0.121
41431,88880,B001HTI58M,AYUPCH12CIR4F,1.0,0.000000,gimmick It gimmick Do waste money Water regulated company claim doe thing,-0.9898,0.210,0.602,0.188


In [14]:
df1.to_csv('../data/fine_food_reviews_vader.csv')

In [15]:
df1.head(1)

Unnamed: 0,Id,ProductId,UserId,Score,helpfulness_score,reviews,V_compound,V_neg,V_neu,V_pos
0,1,B001E4KFG0,A3SGXH7AUHU8GW,5.0,1.0,Good Quality Dog Food bought several Vitality canned dog food product found good quality product look like stew processed meat smell better My Labrador finicky appreciates product better,0.9565,0.0,0.516,0.484


## 1.4 TextBlob Sentiment Analysis 

In [16]:
#df1 = pd.read_csv('fine_food_reviews_vader1.csv')

In [17]:
def sentiment_text(text):
    try:
        return TextBlob(text).sentiment 
    except:
        return None
df['sentiment'] = df['reviews'].apply(sentiment_text)          #applying function to df.reviews

sentiment = df['sentiment'].tolist()                           #breaking polarity and subjectivity scores up into another DataFrame
columns = ['T_polarity', 'T_subjectivity']
df2 = pd.DataFrame(sentiment, columns=columns, index=df.index) 

In [18]:
df2.reset_index(inplace=True)
df1.reset_index(inplace=True)
df1 = df1.merge(df2, on='index')    #merge back to df1
df1.head(1)

Unnamed: 0,index,Id,ProductId,UserId,Score,helpfulness_score,reviews,V_compound,V_neg,V_neu,V_pos,T_polarity,T_subjectivity
0,0,1,B001E4KFG0,A3SGXH7AUHU8GW,5.0,1.0,Good Quality Dog Food bought several Vitality canned dog food product found good quality product look like stew processed meat smell better My Labrador finicky appreciates product better,0.9565,0.0,0.516,0.484,0.48,0.44


In [19]:
df1[['reviews', 'V_compound', 'T_polarity' ]]

Unnamed: 0,reviews,V_compound,T_polarity
0,Good Quality Dog Food bought several Vitality canned dog food product found good quality product look like stew processed meat smell better My Labrador finicky appreciates product better,0.9565,0.480000
1,Cough Medicine If looking secret ingredient Robitussin believe found got addition Root Beer Extract ordered good made cherry soda flavor medicinal,0.8225,0.187000
2,Yay Barley Right mostly sprouting cat eat grass They love rotate around Wheatgrass Rye,0.8271,0.150000
3,Strawberry Twizzlers Yummy Strawberry Twizzlers guilty pleasure yummy Six pound around son,0.9819,0.428571
4,Nasty No flavor candy red No flavor Just plan chewy would never buy,0.9451,0.262500
5,Great Bargain Price glad Amazon carried battery hard time finding elsewhere unique size need garage door opener Great deal price,0.8720,0.196875
6,THIS IS MY TASTE This offer great price great taste thanks Amazon selling product Staral,0.9499,0.099621
7,Best Instant Oatmeals McCann Instant Oatmeal great must oatmeal scrape together two three minute prepare There escaping fact however even best instant oatmeal nowhere near good even store brand oatmeal requiring stovetop preparation Still McCann good get instant oatmeal It even better organic natural brand tried All variety McCann variety pack taste good It prepared microwave adding boiling water convenient extreme time issue McCann use actual cane sugar instead high fructose corn syrup help...,0.9271,0.533333
8,Good Instant This good instant oatmeal best oatmeal brand It us cane sugar instead high fructouse corn syrup doe better sweetness doctor say form sugar better Great cold morning time make McCann Steel Cut Oats apple cinnamon best maple brown sugar regular good Plus require doctoring actually tell three flavor apart,0.7783,-0.500000
9,Great Irish oatmeal hurry Instant oatmeal become soggy minute water hit bowl McCann Instant Oatmeal hold texture excellent flavor good time McCann regular oat meal excellent may take bit longer prepare time morning This best instant brand ever eaten close second non instant variety McCann Instant Irish Oatmeal Variety Pack Regular Apples Cinnamon Maple Brown Sugar 10 Count Boxes Pack 6,0.0258,0.316667


## 1.5 Hybrid Score

In [20]:
df1['composite'] = df1['V_compound'] * df1['T_polarity']     #values may range from -2 to 2. 'higher value = more positive' and vice versa.
df1.sort_values(by='composite')
df1['hybrid_score'] = df1['composite'] * df1['helpfulness_score']    #sentiments of helpful reviews
print(df1[['Score','hybrid_score']].corr())                                 #Score may not be an accurate indicator of buyers' sentiments

                 Score  hybrid_score
Score         1.000000      0.107987
hybrid_score  0.107987      1.000000


In [21]:
df1 = df1[df1.hybrid_score != 0] #removing products with a hybrid_score of 0
df1.ProductId.nunique() #number of unique products

8929

## 1.6 Removing Users and Products with few transactions

To have a more robust recommender system, we removed items that had only one review and removed users that had only one purchase.

Caveat:
- RecSys is focused on only recommending similar products to **existing users**
- RecSys cannot recommend accurately to what new users may look for but will recommend based on user purchase history and hybrid scores

In [22]:
reviews_count = df1['reviews'].groupby(df1['ProductId']).count().reset_index() #number of reviews per product
one_review = reviews_count[reviews_count['reviews'] == 1]
one_review_list = list(one_review['ProductId'])
df2 = df1[~df1.ProductId.isin(one_review_list)] #drop items with only one review
df2.nunique()

index                32108
Id                   32108
ProductId             4985
UserId               28395
Score                    5
helpfulness_score      467
reviews              32099
V_compound            4776
V_neg                  486
V_neu                  754
V_pos                  761
T_polarity           16660
T_subjectivity       14391
composite            31508
hybrid_score         31721
dtype: int64

In [23]:
user_count = df2['reviews'].groupby(df2['UserId']).count().reset_index() #number of purchases per user (in terms of reviews given)
one_purch = user_count[user_count['reviews'] == 1]
one_purch_list = list(one_purch['UserId'])
df3 = df2[~df2.UserId.isin(one_purch_list)] #drop users that did one-time purchase
df3.nunique()

index                5908
Id                   5908
ProductId            2569
UserId               2195
Score                   5
helpfulness_score     197
reviews              5899
V_compound           1844
V_neg                 352
V_neu                 646
V_pos                 662
T_polarity           4049
T_subjectivity       3657
composite            5882
hybrid_score         5886
dtype: int64

## Final DF

In [46]:
final_df = df3[['UserId', 'ProductId', 'hybrid_score']]
final_df.to_csv('../data/cleaned_df_with_SA.csv')
final_df.head()

Unnamed: 0,UserId,ProductId,hybrid_score
7,AOVROBZ8BNTP7,B001EO5QW8,0.494453
17,A27TKQHFW0FB5N,B001GVISJC,0.069048
74,A37AO20OXS51QA,B001UJEN6C,0.223402
80,A2LFHPZFG1OHBZ,B001UJEN6C,0.092079
106,ALSAOZ1V546VT,B001ELL6O8,0.403563


In [None]:
!jupyter nbconvert SA_Slides.ipynb --to slides --post serve

[NbConvertApp] Converting notebook SA_Slides.ipynb to slides
[NbConvertApp] Writing 575561 bytes to SA_Slides.slides.html
[NbConvertApp] Redirecting reveal.js requests to https://cdnjs.cloudflare.com/ajax/libs/reveal.js/3.5.0
Serving your slides at http://127.0.0.1:8000/SA_Slides.slides.html
Use Control-C to stop this server
