# Natural Language Processing on Restaurant Reviews

In this Jupyter notebook, we will carry out some natural language processing on a dataset containing reviews of a certain restaurant. The dataset was obtained on Kirill Eremenko's machine learning website https://www.superdatascience.com/machine-learning/

The aim of this NLP task is to classify reviews based on if the reviewer liked the restaurant or not. We will create a Bag-Of-Words model which will break down sentences into its constituent terms. It will then analyze the frequency of used terms along with our target variable, whether or not a review liked the restaurant.


## Loading in Libraries and Data

First we will load in the required libraries and data. After the standard numpy, pandas, and matplotlib are read in, libraries dedicated to NLP are loaded. These libraries will help us accomplish tasks such as forming a corpus out of our dataset, removing often used words (stopwords), removing whitespace/punctuation/contractions, and stemming words to their morphological base (loved -> love).

This clean-up of the data is necessary to prevent unique words from being double counted. For example, many people may use the past tense of a word while others use a present tense. Using the porter stemmer will ensure that these words will be grouped together for further classification. 

After the data is cleaned and preprocessed, we will use the naive bayes and random forest algorithms to classify whether a consumer liked the restaurant or not


In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Cleaning the texts
import re
import nltk
#Only need this for the first time
#nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import classification_report

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

In [2]:
# Importing the dataset
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)

Lets look at a snapshot of our data to see what kind of reviews were left. 

In [12]:
print(dataset.head(10))

                                              Review  Liked
0                           Wow... Loved this place.      1
1                                 Crust is not good.      0
2          Not tasty and the texture was just nasty.      0
3  Stopped by during the late May bank holiday of...      1
4  The selection on the menu was great and so wer...      1
5     Now I am getting angry and I want my damn pho.      0
6              Honeslty it didn't taste THAT fresh.)      0
7  The potatoes were like rubber and you could te...      0
8                          The fries were great too.      1
9                                     A great touch.      1


## Data Preprocessing

We will define a function which will iterate through our data set. This iteration will accomplish the following:

1. Extract the words used in each sentence, ignoring numbers
2. Change all the words to lowercase
3. Split the sentence into its constituent words
4. Stem the words to their morphological base
5. Remove 'stopwords'
6. Re-attach our split words back into one string
7. Create sparse matrix of word frequency


In [5]:
def PrepareData(dataset):
    corpus = []
    for i in range(0, len(dataset)):
        review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
        review = review.lower()
        review = review.split()
        ps = PorterStemmer()
        review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
        review = ' '.join(review)
        corpus.append(review)
    
    # Creating the Bag of Words model
    cv = CountVectorizer(max_features = 1500)
    X = cv.fit_transform(corpus).toarray()
    y = dataset.iloc[:, 1].values
    return X, y

In [6]:
X, y = PrepareData(dataset)

## Classification

We will be using two of the most common classification algorithms used in NLP, Naive Bayes and Random Forest. We end up with a similar precision, recall, and f score for both these classifiers. However, after implementing a cross validated grid search, I was able to edge out the naive bayes classifier by a little. 

Overall there I believe there is still some wiggle room for improvement for these classifiers. However, as this is my first NLP project I am satisfied with the results!

In [8]:
# Split into Train/Test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

In [9]:
#Naive Bayes

clf = GaussianNB()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)


print(classification_report(y_test, y_pred))

             precision    recall  f1-score   support

          0       0.82      0.57      0.67        97
          1       0.68      0.88      0.77       103

avg / total       0.75      0.73      0.72       200



In [16]:
#RandomForest


clf = RandomForestClassifier()
params = {'n_estimators' : [500, 1000, 1500],
          'max_depth': [ 2, 4, 6, 8],
          'min_samples_split' : [2, 3,4,5]}

cv = GridSearchCV(clf, param_grid = params, scoring = 'f1', cv = 5)

cv.fit(X_train, y_train)

print(cv.best_params_)

{'max_depth': 8, 'min_samples_split': 4, 'n_estimators': 1500}


In [14]:
y_pred = cv.predict(X_test)

In [15]:

print(classification_report(y_test, y_pred))

             precision    recall  f1-score   support

          0       0.67      0.93      0.78        97
          1       0.89      0.56      0.69       103

avg / total       0.78      0.74      0.73       200

