# Yelp Review Sentiment Classification

A classifier that can predict a user's rating of a given restaurant from their review.




![Example of a Yelp review](https://wordstream-files-prod.s3.amazonaws.com/s3fs-public/styles/simple_image/public/images/yelp-reviews-filtered.png)

In [None]:
#@title Import our libraries { display-mode: "both" }
import pandas as pd   # Great for tables (google spreadsheets, microsoft excel, csv). 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string
import nltk
import spacy
import wordcloud
import os # Good for navigating your computer's files 
import sys

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from spacy.lang.en.stop_words import STOP_WORDS
nltk.download('wordnet')
nltk.download('punkt')

from wordcloud import WordCloud
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
!python -m spacy download en_core_web_md
import en_core_web_md



[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


NotImplementedError: ignored

In [None]:
#@title Import our data

import gdown
gdown.download('https://drive.google.com/uc?id=1u0tnEF2Q1a7H_gUEH-ZB3ATx02w8dF4p', 'yelp_final.csv', True)
data_file  = 'yelp_final.csv'


## Data Exploration

Look at the data available.

In [None]:
yelp = pd.read_csv(data_file)

In [None]:
#@title Show data
yelp.head()

Unnamed: 0,business_id,stars,text,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,5,My wife took me here on my birthday for breakf...,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,5,I have no idea why some people give bad review...,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,_1QQZuf4zZOyFCvXc0o6Vg,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",uZetl9T0NcROGOyFfughhg,1,2,0
3,6ozycU1RpktNG2-1BroVtw,5,General Manager Scott Petello is a good egg!!!...,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0
4,zp713qNhx8d9KCJJnrw1xA,5,Drop what you're doing and drive here. After I...,wFweIWhv2fREZV_dYkz_1g,7,7,4


Have access to 7 columns of data but business_id and user_id information are not important.



In [None]:
#@title **Remove unnecessary columns** { display-mode: "both" }
yelp.drop(labels=['business_id','user_id'],inplace=True,axis=1)

NameError: ignored

In [None]:
#@title Check the text in differently rated reviews
num_stars =  1#@param {type:"integer"}

for t in yelp[yelp['stars'] == num_stars]['text'].head(20).values:
    print (t) 

U can go there n check the car out. If u wanna buy 1 there? That's wrong move! If u even want a car service from there? U made a biggest mistake of ur life!! I had 1 time asked my girlfriend to take my car there for an oil service, guess what? They ripped my girlfriend off by lying how bad my car is now. If without fixing the problem. Might bring some serious accident. Then she did what they said. 4 brand new tires, timing belt, 4 new brake pads. U know why's the worst? All of those above I had just changed 2 months before!!! What a trashy dealer is that? People, better off go somewhere!
Disgusting!  Had a Groupon so my daughter and I tried it out.  Very outdated and gaudy 80's style interior made me feel like I was in an episode of Sopranos.  The food itself was pretty bad.  We ordered pretty simple dishes but they just had no flavor at all!  After trying it out I'm positive all the good reviews on here are employees or owners creating them.
I've eaten here many times, but none as bad

We can start to see that there are certain quantitative differences between highly rated reviews and poorly rated reviews. Certain words, for example, 'delightful', 'impressive', 'amazing', might be more associated with 4 or 5 star reviews. However one might be able to see that these words might also be present in a 2 star review. For example: "The seating and ambience were impressive, but the food served to us was not". 


**It is not really the presence of individual words that gives us an indication of the stars given to a review, but more  the *relative occurrence* of these words in each review that might give us an indication of a user's rating.**

#### Word Clouds

Another way to take a look at the most prominent words in any given star rating is through the use of word clouds. 

Edit the value in the cell below to see the word cloud for each star rating.

In [None]:
#@title Word cloud for differently rated reviews
num_stars =  None#@param {type:"integer"}
this_star_text = ''
for t in yelp[yelp['stars'] == num_stars]['text'].values: # form field cell
    this_star_text += t + ' '
    
wordcloud = WordCloud()    
wordcloud.generate_from_text(this_star_text)
plt.figure(figsize=(14,7))
plt.imshow(wordcloud, interpolation='bilinear')

NameError: ignored

**What are the differences between the reviews that have 1, 2, 3, 4, and 5 stars?**

*As* we can see, in this case, the word cloud does not give us a great deal of distinguishing information between reviews that have 1, 2, 3, 4, or 5 stars. All these reviews seem to prominently feature words such as 'place', 'food', 'service' and 'table'. Human intuition will only get us so far. 

Before we go any further, we will need to clean up our text.

## Text Preprocessing

#### Tokenization

- Convert each review from a single string into a list of words (this is a process known as tokenizaton). 
- NLP algorithms require a list of words as arguments and not actual sentences. 

In [None]:
#@title Basic tokenization example
example_text = "All the people I spoke to were super nice and very welcoming." #@param {type:"string"}
tokens = word_tokenize(example_text)
tokens

['All',
 'the',
 'people',
 'I',
 'spoke',
 'to',
 'were',
 'super',
 'nice',
 'and',
 'very',
 'welcoming',
 '.']

#### Stopwords

We can see that certain particular words might be associated with 4 or 5 star reviews, and some words would be associated with 1 or 2 star reviews. However, at the same time, there are some words that do not really possess any relevant information for our current problem. In the field of NLP there is a concept of words that are "stopwords" - words that exist to provide grammatical structure, but do not convey information about the particular subject. Edit the cell below to see if a given word is a stop word.

In [None]:
#@title Check if a word is a stop word
example_word = "the" #@param {type:'string'}
if example_word.lower() in STOP_WORDS:
  print (example_word + " is a stop word.")
else:
  print (example_word + " is NOT a stop word.")

the is a stop word.


We would like to remove these stopwords from the user reviews.

Tokenization and removal of stop words are universal to nearly every NLP application. In some cases, additional cleaning may be required (for example, removal of proper nouns, removal of digits) but we can build a text preprocessing function with these "base" cleaning steps.

Putting all these together, we can come up with a text cleaning function that we can apply to all of our reviews.

Using Spacy


In [None]:
nlp = en_core_web_md.load()
doc = nlp(u"We are running out of time! Are we though?")
doc


NameError: ignored

The doc object has a lot of nice properties. For instance you can get the text of each of the words and the length of each of the words.

In [None]:
doc = nlp(u"We are running out of time! Are we though?")
token = doc[0] # Get the first word in the text.
assert token.text == u"We" # Check that the token text is 'We'.
assert len(token) == 2 # Check that the length of the token is 2.

It has some word vectors that we can use. Though note that it doesn't have all the words. Let's import a new dataset of word (this may take a minute or so):

We can get the word embedding of a particular word in our document as follows:

In [None]:
doc = nlp(u"I like apples")
print(doc)
appleVariable = doc[2]

print(appleVariable.vector) # Each word is being represented by 300 dimensional vector embedding

I like apples
[-0.6334     0.18981   -0.53544   -0.52658   -0.30001    0.30559
 -0.49303    0.14636    0.012273   0.96802    0.0040354  0.25234
 -0.29864   -0.014646  -0.24905   -0.67125   -0.053366   0.59426
 -0.068034   0.10315    0.66759    0.024617  -0.37548    0.52557
  0.054449  -0.36748   -0.28013    0.090898  -0.025687  -0.5947
 -0.24269    0.28603    0.686      0.29737    0.30422    0.69032
  0.042784   0.023701  -0.57165    0.70581   -0.20813   -0.03204
 -0.12494   -0.42933    0.31271    0.30352    0.09421   -0.15493
  0.071356   0.15022   -0.41792    0.066394  -0.034546  -0.45772
  0.57177   -0.82755   -0.27885    0.71801   -0.12425    0.18551
  0.41342   -0.53997    0.55864   -0.015805  -0.1074    -0.29981
 -0.17271    0.27066    0.043996   0.60107   -0.353      0.6831
  0.20703    0.12068    0.24852   -0.15605    0.25812    0.007004
 -0.10741   -0.097053   0.085628   0.096307   0.20857   -0.23338
 -0.077905  -0.030906   1.0494     0.55368   -0.10703    0.052234
  0.43407  

The word 'Apple' is represented by 300 dimensional vector embedding


### Get similarity of Two Words

In [None]:
#@title  { display-mode: "code" }
similar_words_doc = nlp(u"apples oranges")
w1 = similar_words_doc[0]
w2 = similar_words_doc[1]
print(w1.similarity(w2))

dissimilar_words_doc = nlp(u"doorknob phone")
w3 = dissimilar_words_doc[0]
w4 = dissimilar_words_doc[1]
print(w3.similarity(w4))

0.77809423
0.14618301


As we saw before, the language in 4 star reviews is quite similar to the language in 5 star reviews. So the text in those reviews might not be very useful and we can drop those rows from our data.

Although the text in the 3 star reviews is not very similar to the 1 or 2 star reviews, it is quite different from the language used in the 5 star reviews. So we could actually group those reviews together with the 1 and 2 star reviews.

In order to reduce our problem to a **binary classification** problem, we will:

 - remove all 4 star reviews
 - label 5 star reviews as 'good'
 - label 1, 2, 3 star reviews as 'bad'

Get rid of 4 star reviews.

In [None]:
yelp = yelp[yelp.stars != 4]

### Re-categorize reviews



In [None]:
#@title  { display-mode: "code" }
def is_good_review(stars):
    if stars > 3:             
        return True
    else:
        return False
    
yelp['is_good_review'] = yelp['stars'].apply(is_good_review)
yelp.head()

Unnamed: 0,stars,text,cool,useful,funny,is_good_review
0,5,My wife took me here on my birthday for breakf...,2,5,0,True
1,5,I have no idea why some people give bad review...,0,0,0,True
2,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",1,2,0,True
3,5,General Manager Scott Petello is a good egg!!!...,0,0,0,True
4,5,Drop what you're doing and drive here. After I...,7,7,4,True


## One-Hot Vectors

How do we convert our text to numbers in a structured way that we can feed into a machine learning algorithm? One way to do it is to use a concept called "one-hot encoding". We can see this concept with the following example. Suppose we have a sentence "great tacos at this restaurant". Its one-hot encoding would be.

In [None]:
#@title Example: one-hot encoding of 'great tacos at this restaurant' { vertical-output: true, display-mode: "both" }
print('{:^5}|{:^5}|{:^4}|{:^4}|{:^10}'.format('great', 'tacos', 'at','this','restaurant'))
print('--------------------------------------------')
print('{:^5}|{:^5}|{:^4}|{:^4}|{:^10}'.format('1', '0', '0','0','0'))
print('{:^5}|{:^5}|{:^4}|{:^4}|{:^10}'.format('0', '1', '0','0','0'))
print('{:^5}|{:^5}|{:^4}|{:^4}|{:^10}'.format('0', '0', '1','0','0'))
print('{:^5}|{:^5}|{:^4}|{:^4}|{:^10}'.format('0', '0', '0','1','0'))
print('{:^5}|{:^5}|{:^4}|{:^4}|{:^10}'.format('0', '0', '0','0','1'))

great|tacos| at |this|restaurant
--------------------------------------------
  1  |  0  | 0  | 0  |    0     
  0  |  1  | 0  | 0  |    0     
  0  |  0  | 1  | 0  |    0     
  0  |  0  | 0  | 1  |    0     
  0  |  0  | 0  | 0  |    1     


## Bag of Words

- One-hot encoding is a way to represent individual words as vectors
- Bag of words as a way to represent sentences (or larger pieces of text) as the **sum** of the one-hot encoding vectors of each of the words. 

Example:
**"The food was great. The ambience was also great."**

1. Define our vocabulary - This is *each unique word* in the review. 
  So our vocabulary is **[the, food, was, great, ambience, also]**.

2. Determine one hot encoding 

> 
      the = (1,0,0,0,0,0)
      food = (0,1,0,0,0,0)
      was = (0,0,1,0,0,0)
      great = (0,0,0,1,0,0)
      ambience = (0,0,0,0,1,0)
      also = (0,0,0,0,0,1).

3. Represent as a bag of words.
 - Bag of words vector will also only be 6 elements long
 - To construct it, we can start off with a (0,0,0,0,0,0) vector, and then pass through each word in the review. For each word we encounter, we simply add its one hot encoding to our vector! So for our review, the bag of words representation will be

      **(2,1,2,2,1,1)**

## Creating our Bag of Words

We want to select the features for our model and the output classes from our data. What are the features? We are only using the review text to make predictions for our model. And the output classes are the 'good' and 'bad' review classes we created just above. 

By convention, we represent our entire set of features as X, and our target output as y. Running the cell below will create the relevant X and y for our problem.

In [None]:
X = yelp['text']
y = yelp['is_good_review']

Running the cell below will create an object we can use to *transform* each piece of raw text into a bag of words vector.
CountVectorizer is a useful class we can call from scikit-learn that will help us create this object. It even has a helpful parameter that we can set to our tokenize function to preprocess the raw text.

In [None]:
#@title Initialize the text cleaning function { display-mode: "form" }
def tokenize(text):
    clean_tokens = []
    for token in nlp(text):
        if (not token.is_stop) & (token.lemma_ != '-PRON-') & (not token.is_punct): # -PRON- is a special all inclusive "lemma" spaCy uses for any pronoun, we want to exclude these 
            clean_tokens.append(token.lemma_)
    return clean_tokens

In [None]:
bow_transformer = CountVectorizer(analyzer=tokenize, max_features=1600).fit(X)

See entire vocabulart. Index = position of each word in the vocabulary.

In [None]:
bow_transformer.vocabulary_

We can see the length of the vocabulary stored in the transformer object by running the cell below. 

In [None]:
len(bow_transformer.vocabulary_)

Transformer to transform our entire training set (X) into a series of bag of words vectors:

In [None]:
X = bow_transformer.transform(X)

## Training a Baseline Classification Model (Logistic Regression)

Our classification problem is a classic two-class classification problem, and so we will use the tried and tested **Logistic Regression** machine learning model.

In [None]:
# import the logistic regression model from scikit-learn
logistic_model = LogisticRegression()

We will use 20% of our data as test data. If you run the cell below, it will randomly split the data such that 80% of it is training data and 20% of it is data we can use to test the predictions from our trained model.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)

### Train Model


In [None]:
#@title  { vertical-output: true, display-mode: "code" }
logistic_model.fit(X_train, y_train)

ValueError: ignored

## Get Predictions

In [None]:
# Get our predictions.
preds = logistic_model.predict(X_test)

# Get the confusion matrix.
cm = confusion_matrix(y_test, preds)

# Get TP, FP, TN, and FN rates.
TP = cm[0][0]
TN = cm[1][1]
FP = cm[0][1]
FN = cm[1][0]

# Calculate and print accuracy.
accuracy = (TP + TN)/(TP + TN + FP + FN)
print ("The accuracy of the model is " + str(accuracy*100) + "%")

Not perfect, but definitely better than we would have expected at random (50%).

Enter an example review to see if our model predicts it as a positive one or a negative one.

In [None]:
#@title Enter an example review, and see if it is classified as good or bad
example_review = "good!!!!!!!!!!!!!" #@param {type:'string'}
prediction = logistic_model.predict(bow_transformer.transform([example_review]))

if prediction:
  print ("This was a GOOD review!")
else:
  print ("This was a BAD review!")

