# Sentiment Analysis via food.com Word Embedding
In this notebook, we load in a word embedding and sentiment analyzer which are both trained specifically on food.com reviews to predict sentiments on reviews for the NYT Salted Tahini Chocolate Chip Cookies recipe (https://cooking.nytimes.com/recipes/1018055-salted-tahini-chocolate-chip-cookies). Our hypothesis is that this would produce a better accuracy that using a generic word embedding and/or sentiment analyzer. 

The data set used to train these models can be found here: https://www.kaggle.com/irkaal/foodcom-recipes-and-reviews

This notebook is part of a collaborative project completed for The Erdos Institute's Code 2021 Data Science Boot camp. The work in this notebook was completed by Anila Yadavalli. The other teammates are Shirley Li, Nida Obtake, and Enkhzaya Enkhtaivan.

In [1]:
import re 
import pandas as pd

from time import time 
from collections import defaultdict 

import spacy 
import numpy as np
import gensim
from tqdm.notebook import tqdm



We are reading in the .csv file which contains the cleaned reviews from the food.com dataset as a dataframe. This is so that we can train the sentiment analyzer later. 

The cleaning consists of these items:

-Removes non-alphabetic characters.

-Lemmatizes the words (i.e. 'ran', 'run', 'running', 'runs' all become 'run')

-Creates bigrams of common words that appear together (i.e. 'chocolate chip' becomes 'chocolate_chip'

In [2]:
df = pd.read_csv('cleaned_reviews_with_ratings_and_stops.csv') 
df.head()

Unnamed: 0,ReviewId,RecipeId,AuthorId,AuthorName,Rating,Review,DateSubmitted,DateModified,clean
0,2,992,2008,gayg msft,5,better than any you can get at a restaurant!,2000-01-25T21:44:00Z,2000-01-25T21:44:00Z,well than any you can get at a restaurant
1,7,4384,1634,Bill Hilbrich,4,"I cut back on the mayo, and made up the differ...",2001-10-17T16:49:59Z,2001-10-17T16:49:59Z,I cut back on the mayo and make up the differe...
2,9,4523,2046,Gay Gilmore ckpt,2,i think i did something wrong because i could ...,2000-02-25T09:00:00Z,2000-02-25T09:00:00Z,I think I do something wrong because I could t...
3,13,7435,1773,Malarkey Test,5,easily the best i have ever had. juicy flavor...,2000-03-13T21:15:00Z,2000-03-13T21:15:00Z,easily the good I have ever have juicy flavorf...
4,14,44,2085,Tony Small,5,An excellent dish.,2000-03-28T12:51:00Z,2000-03-28T12:51:00Z,an excellent dish


We remove any reviews that have a 0-star rating. This is because 0-stars indicate that the reviewer did not select a star rating, so the sentiment for such reviews cannot be classified. This removes about 80,000 reviews from the food.com dataset. 

In [3]:
print(len(df))
df = df[df.Rating != 0].reset_index(drop = True)
df.sample(20)
print(len(df))

1401768
1325520


We also assign a numeric score to the sentiment of a review based on the star-rating. Ratings of 4 and 5 are classified as positive and ratings of 1-3 are classified as negative. 

In [4]:
df['Sentiment'] = (df['Rating'] > 3).astype(int)
df.head()

Unnamed: 0,ReviewId,RecipeId,AuthorId,AuthorName,Rating,Review,DateSubmitted,DateModified,clean,Sentiment
0,2,992,2008,gayg msft,5,better than any you can get at a restaurant!,2000-01-25T21:44:00Z,2000-01-25T21:44:00Z,well than any you can get at a restaurant,1
1,7,4384,1634,Bill Hilbrich,4,"I cut back on the mayo, and made up the differ...",2001-10-17T16:49:59Z,2001-10-17T16:49:59Z,I cut back on the mayo and make up the differe...,1
2,9,4523,2046,Gay Gilmore ckpt,2,i think i did something wrong because i could ...,2000-02-25T09:00:00Z,2000-02-25T09:00:00Z,I think I do something wrong because I could t...,0
3,13,7435,1773,Malarkey Test,5,easily the best i have ever had. juicy flavor...,2000-03-13T21:15:00Z,2000-03-13T21:15:00Z,easily the good I have ever have juicy flavorf...,1
4,14,44,2085,Tony Small,5,An excellent dish.,2000-03-28T12:51:00Z,2000-03-28T12:51:00Z,an excellent dish,1


The cleaning generated empty values for some reviews. We drop those here. 

In [5]:
df.dropna(inplace = True)
df.reset_index(drop = True, inplace = True)

Now we load our word embedding which was trained on the food.com data. See https://colab.research.google.com/drive/1uqw557Y0l4dOIxO_jTZHUX6Zec9T7Pkl?usp=sharing
for details.

In [6]:
from gensim.test.utils import common_texts
from gensim.models import Word2Vec

# have to change the path if connecting to Google Drive.
model = Word2Vec.load("recipes2.model")

Eventually, we need to create an array whose columns consist of word vectors for each review, so we need to know the length of the longest review. Alternately, you just pick a maximum length. 

In [7]:
# find max sentence length
max_val = 0
idx = 0
for i, review in tqdm(enumerate(df['clean'])):
      if len(review.split(' ')) > max_val:
        max_val = len(review.split(' '))
        idx = i
print("max:", max_val, "index:", idx)

HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…


max: 1276 index: 1213989


We create a function that reads in a review and produces a matrix whose columns are the word vector for each word in the review. 

In [8]:
def sentence_to_matrix(sentence, maxlen = max_val, model = model):
  #takes in a sentence as a string and outputs a matrix whose rows
  #are word vectors for each word in the sentence
    sentence_matrix = np.zeros((maxlen, 300))
    
    #Split the input sentence into words.
    sen_len = len(sentence.split(' '))
    shift = 0
    sen = sentence.split(' ')
    
    # two loops depending on if the sentence is past the length of maxlen
    if sen_len > maxlen:
        for i in range(maxlen):
            # here sen[::-1] is the reversed list of sen
            # we populate the word vector from the back because it needs to be front-padded with zeros
            # (otherwise it will just erase everything it learned)
            if sen[::-1][i] in model.wv.index_to_key:
                sentence_matrix[maxlen - (i+1) + shift,:] = model.wv[sen[::-1][i]]
            else:
                shift += 1 #skip any words that aren't in the dictionary
    else:
        for i in range(sen_len):
            if sen[::-1][i] in model.wv.index_to_key: 
                sentence_matrix[maxlen - (i+1) + shift,:] = model.wv[sen[::-1][i]]
            else:
                shift += 1
    
    return sentence_matrix

Now we are ready to train the sentiment analyzer! The first step is to create a validation data frame; we chose to use a balanced set of 5-star and 1-star reviews. We used the most extreme ratings to ensure that the sentiment analyzer to make sure that the data was actually positive or negative. Reviews in the 2-4 range could be more neutral. 

Alternately, one could use 'Sentiment' == 0 or 1.

In [9]:
# create validation set of 1000 positive, 1000 negative
df_positive = df.loc[df['Rating']==5]
df_negative = df.loc[df['Rating']==1]
df_val = pd.concat([df_negative.sample(1000), df_positive.sample(1000)])

# ensures training set and validation set are disjoint
df_train = df.drop(df_val.index)
print(len(df), len(df_train), len(df_val))

1315921 1313921 2000


To train our model, we need to create training sets that are balanced between negative and positive reviews. We do this in batches of 1000, otherwise the it is too big for a laptop to handle. 

We create a balanced data set so that the model doesn't 'learn' to just predict positive every time.

In [10]:
# get a sample with 1000 positive reviews and 1000 negative reviews
def get_sample(df, size=1000):
    X_train = np.zeros((size*2, max_val, 300))
    y_train = np.zeros(size*2)
    # Could use 'Sentiment' = 0 or 1 instead
    df_positive = df.loc[df['Sentiment']==1]
    df_negative = df.loc[df['Sentiment']==0]
    #creates a data frame containing 50% positive and 50% negative reviews
    df_small = pd.concat([df_negative.sample(size), df_positive.sample(size)])
    for i in range(len(df_small)):
        row = df_small.iloc[i] #loops through df_small and takes out each row as a dictionary
        X_train[i] = sentence_to_matrix(row['clean'])
        y_train[i] = row['Sentiment']
    return X_train, y_train

In [11]:
from keras import models
from keras import layers
from keras import optimizers
from keras import losses
from keras import metrics
from keras import regularizers

Now we set up the recurrent neural network (RNN) to train. RNN is a model designed to work with sequence data, which is why can use it to pass in sequences of word vectors (i.e. one review). 

The output of the model is a single value between 0 and 1 which tells you the probability of a review being positive. 

We added L2 regularization to the RNN in order to limit the size of the parameters, to prevent overfitting. 

In [14]:
# added in regularization (L2 like ridge regression!)
# Regularizers encourage parameters to stay small using L2 norm.
# This just prevents overfitting.

RNN = models.Sequential()
RNN.add(layers.SimpleRNN(300,
                         kernel_regularizer=regularizers.L2(0.01),
                         bias_regularizer=regularizers.L2(0.01),
                         recurrent_regularizer=regularizers.L2(0.01),
                         return_sequences=False))
RNN.add(layers.Dense(1,activation = 'sigmoid'))

The optimizer ```rmsprop``` is a standard optimization algorithm. 
The loss function ```binary_crossentropy``` is specific to binary classification (e.g. positive/negative)
We use ```accuracy``` as a metric to evaluate our model during training. Alternately, you could use ```AUC```

In [15]:
RNN.compile(optimizer = 'rmsprop',
            loss = 'binary_crossentropy',
            metrics =['accuracy'])

We turn our validation data frame into sentence matrices that we  can input into the model. 

In [16]:
X_val, y_val = get_sample(df_val)

This is where we are actually training the model. For better accuracy, change ```range(6)``` to a higher number, but 6 is the maximum I could do on my laptop without it crashing. 

In [None]:
for i in range(6):
    X_train, y_train = get_sample(df_train)
    RNN.fit(X_train, y_train,
                  epochs = 1,
                  batch_size = 128,
               validation_data=(X_val,y_val))
    #print(RNN.predict(X_val[:10]))

In [None]:
# Now that the model is trained, we can save it for future use. 
RNN.save('rnnmodel')

# Using the model on NYT reviews
Now we can use this model to predict sentiments on NYT reviews!

In [12]:
# load the model
RNN = models.load_model('rnnmodel')

In [13]:
# load in the cleaned tahini dataset here
tahini = pd.read_csv("tahini_cleaned_comments.csv")
tahini.head()

Unnamed: 0,user,comment,sentiment,clean
0,lmk,Yum. These took much longer than 16 minutes t...,pos,yum take long minute cook denver ft altitude g...
1,Sonya,If you follow the recipe as written the tahini...,pos,follow recipe write tahini sesame flavour cook...
2,KV,I have made these cookies 5 times. My advice i...,pos,cookie time advice recipe say don t tell step ...
3,MaryN,I liked this- the tahini is slightly more subt...,pos,like tahini slightly subtle pb cookie combine ...
4,Maggie B,Used Shaila M's tweaks. Baked first tray strai...,pos,shaila m tweak bake tray straight mix deliciou...


For this project, since we are testing our model, we assigned sentiments to the NYT reviews manually. Here we are assigning a Sentiment_Score of 0 (negative) or 1 (positive). Leaving neutral comments in, and classifying them as negative was agreed upon for comparison in the context of the overall project.

In [14]:
tahini['Sentiment_Score'] = (tahini['sentiment'] == 'pos').astype(int)
tahini.dropna(inplace = True).reset_index(inplace=True, drop=True)

In [15]:
# Use sentence_to_matrix to convert NYT reviews to vectors
# Replace 'tahini' with 'tahini_no_neu' if you decided to leave out neutrals above. 

tahini_size = len(tahini)
tahini_max_val = 0
tahini_idx = 0

for i, review in tqdm(enumerate(tahini['clean'])):
      if len(review.split(' ')) > tahini_max_val:
        tahini_max_val = len(review.split(' '))
        tahini_idx = i
print("max:", tahini_max_val, "index:", tahini_idx)


HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…


max: 57 index: 170


We are turning the ```tahini``` data frame into a matrix so we can input it into ```RNN```

In [16]:
X_test = np.zeros((tahini_size, tahini_max_val, 300))
y_test = np.zeros(tahini_size)

for i in range(len(tahini)):
    row = tahini.iloc[i] #loops through tahini and takes out each row as a dictionary
    X_test[i] = sentence_to_matrix(row['clean'],tahini_max_val)
    y_test[i] = row['Sentiment_Score']



Now we make a prediction!

In [None]:
# do RNN.predict to see the prediction!
preds = RNN.predict(X_test)

In [None]:
for i in range(len(tahini)):
    print("P(positive) = ", preds[i], "; Actual Sentiment = ", tahini.loc[i, 'Sentiment_Score'],
         "\n Actual comment: ", tahini.loc[i, 'comment'])
    print()

Now we can look at the confusion matrix to determine how well our model did on the Tahini Cookie data. Note that looking at the accuracy score doesn't tell us the whole story since the data had way more positive reviews than negative review. 

In [19]:
from sklearn.metrics import confusion_matrix

y_preds = (preds > 0.5) # Play around with 0.5 for different accuracy scores. 
tahini_matrix = confusion_matrix(y_test, y_preds)
tahini_matrix

array([[ 79,  32],
       [110, 127]], dtype=int64)

In [20]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

That being said, the AUC (area under curve) score looks decent, which tells us that a lower threshold for predicting positive could give us better accuracy.

In [21]:
accuracy_score(y_test, y_preds), roc_auc_score(y_test, preds)

(0.5919540229885057, 0.6714562663929753)

When we lower the threshold, we sacrifice our accuracy of negatives, for a better accuracy of positives. 

In [22]:
y_preds = (preds > 0.35) # Play around with 0.5 for different accuracy scores. 
tahini_matrix = confusion_matrix(y_test, y_preds)
tahini_matrix

array([[ 67,  44],
       [ 89, 148]], dtype=int64)

Now we repeat the process, but drop the neutral sentiment reviews. We see this improves the AUC. 

In [24]:
tahini_no_neu = tahini[tahini.sentiment != 'neu'].reset_index(drop = True)

tahini_size = len(tahini_no_neu)
tahini_max_val = 0
tahini_idx = 0

for i, review in tqdm(enumerate(tahini_no_neu['clean'])):
      if len(review.split(' ')) > tahini_max_val:
        tahini_max_val = len(review.split(' '))
        tahini_idx = i
print("max:", tahini_max_val, "index:", tahini_idx)

X_test = np.zeros((tahini_size, tahini_max_val, 300))
y_test = np.zeros(tahini_size)

for i in range(len(tahini_no_neu)):
    row = tahini_no_neu.iloc[i] #loops through tahini and takes out each row as a dictionary
    X_test[i] = sentence_to_matrix(row['clean'],tahini_max_val)
    y_test[i] = row['Sentiment_Score']


HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…


max: 57 index: 133


In this case, the confusion matrix tells us that the model predicts negative reviews very well, but there are a lot of false positives.

This could mean that according to the model, a user has to really love the cookie recipe for their review to be detected as positive! 🤷🏽‍♀️

In [27]:
preds = RNN.predict(X_test)

y_preds = (preds > 0.5) # Play around with 0.5 for different accuracy scores. 
tahini_matrix = confusion_matrix(y_test, y_preds)
tahini_matrix

array([[ 24,   2],
       [110, 127]], dtype=int64)

The AUC score looks much better!

In [28]:
accuracy_score(y_test, y_preds), roc_auc_score(y_test, preds)

(0.5741444866920152, 0.8377150275884453)

Just like above, when we lower the threshold, we sacrifice our accuracy of negatives, for a better accuracy of positives. 

In [29]:
y_preds = (preds > 0.35) # Play around with 0.5 for different accuracy scores. 
tahini_matrix = confusion_matrix(y_test, y_preds)
tahini_matrix

array([[ 23,   3],
       [ 89, 148]], dtype=int64)