# Homework with Yelp reviews data

## Introduction

This assignment uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.

**Description of the data:**

- **`yelp.csv`** contains the dataset. It is stored in the course repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website.
- Each observation (row) in this dataset is a review of a particular business by a particular user.
- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
- The **text** column is the text of the review.

**Goal:** Predict the star rating of a review using **only** the review text.

**Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations.

In [40]:
from __future__ import print_function

import pandas as pd
import numpy as np

## Task 1

Read **`yelp.csv`** into a Pandas DataFrame and examine it.

In [41]:
# Read form the csv file and print first rows
yelp_dataset = pd.read_csv("../data/yelp.csv")
yelp_dataset.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


In [42]:
# The dataset consist in 10000 datapoints with 10 features per datapoint
print(yelp_dataset.shape)

(10000, 10)


## Task 2

Create a new DataFrame that only contains the **5-star** and **1-star** reviews.

- **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](https://www.youtube.com/watch?v=YPItfQ87qjM&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=9) explains how to do this.

In [43]:
# We can filter in a DataFrame using the notation foo[(foo.var1 > x) & or | (foo.var2 == y)]
yelp_binary = yelp_dataset[(yelp_dataset.stars == 1) | (yelp_dataset.stars == 5)]
yelp_binary.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0
6,zp713qNhx8d9KCJJnrw1xA,2010-02-12,riFQ3vxNpP4rWLk_CSri2A,5,Drop what you're doing and drive here. After I...,review,wFweIWhv2fREZV_dYkz_1g,7,7,4


In [44]:
# Only 4086 datapoints have 1 star or 5 stars.
print(yelp_binary.shape)

(4086, 10)


## Task 3

Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.

- **Hint:** Keep in mind that X should be a Pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows.

In [45]:
# Notice that I use capital x, means that is a matrix, in this case a matrix of 4086x1
X = yelp_binary.text
X.shape

(4086,)

In [46]:
X.head()

0    My wife took me here on my birthday for breakf...
1    I have no idea why some people give bad review...
3    Rosie, Dakota, and I LOVE Chaparral Dog Park!!...
4    General Manager Scott Petello is a good egg!!!...
6    Drop what you're doing and drive here. After I...
Name: text, dtype: object

In [47]:
y = yelp_binary.stars
y.shape

(4086,)

In [48]:
y.head()

0    5
1    5
3    5
4    5
6    5
Name: stars, dtype: int64

In [49]:
# Split the dataset into train and test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# Let's confirm the data shapes
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(3064,)
(3064,)
(1022,)
(1022,)


## Task 4

Use CountVectorizer to create **document-term matrices** from X_train and X_test.

In [50]:
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [51]:
# Instanciate it
vect = CountVectorizer()

In [52]:
# Now we 'learn' the vocabulary from the training data
vect.fit(X_train)

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [53]:
# Let's examine the features we learnt
feature_names = vect.get_feature_names()
print(len(feature_names))
# print(feature_names)

16825


In [54]:
# Create the data document-term matrix
X_train_dtm = vect.transform(X_train)
X_test_dtm = vect.transform(X_test)
# TODO: examine the shapes of the matrix and visualise the dtm matrix using pandas

## Task 5

Use Multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix.

In [55]:
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [56]:
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)
# Confirm that both shapes are the same
print(y_pred_class.shape)
print(y_test.shape)

(1022,)
(1022,)


In [57]:
# Let's see how we did !!
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.91878669275929548

In [58]:
# Let's see the confusion Matrix
metrics.confusion_matrix(y_test, y_pred_class)

array([[126,  58],
       [ 25, 813]])

In [59]:
# Let's see the accuracy using logistic regression
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train_dtm, y_train)
y_pred = logreg.predict(X_test_dtm)
metrics.accuracy_score(y_test, y_pred)

0.92563600782778865

In [60]:
# It has a slightly higher accuracy, let's see the confusion matrix
metrics.confusion_matrix(y_test, y_pred)

array([[140,  44],
       [ 32, 806]])

In [61]:
# Using logistic regression we have a much better understanding of the reviews class1- True negatives
# We have decreased the false positives(44), but increased the false negatives(32) and true positives(806).

In [62]:
# Let's investigate what makes a review to fall in 1 or 5 stars.

In [63]:
X_train_tokens = vect.get_feature_names()
len(X_train_tokens)

16825

In [64]:
nb.feature_count_
nb.feature_count_.shape

(2, 16825)

In [65]:
# Array with each token count for each class
one_star_token_count = nb.feature_count_[0,:]
five_star_token_count = nb.feature_count_[1,:]

In [66]:
tokens = pd.DataFrame({'token':X_train_tokens, 'onestar':one_star_token_count, 'fivestar':five_star_token_count}).set_index('token')

In [67]:
# examine 10 DataFrame rows
tokens.sample(10)

Unnamed: 0_level_0,fivestar,onestar
token,Unnamed: 1_level_1,Unnamed: 2_level_1
blue,54.0,11.0
curtains,1.0,1.0
sooooo,5.0,0.0
mel,2.0,0.0
thrilled,9.0,0.0
viagra,0.0,1.0
grits,15.0,1.0
nowhere,8.0,3.0
immune,4.0,0.0
passing,5.0,2.0


## Task 6 (Challenge)

Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!

In [68]:
# Null Accuracy is the accuracy of the classifier if we always predict the most likely class.
# Find out which class is the most likely (assuming that the data is distributed in the same
# way in training and test datasets)
y_test.value_counts()

5    838
1    184
Name: stars, dtype: int64

In [69]:
def get_null_accuracy(targets):
    """
    Returns the accuracy if we always
    predict the most repeated class
    
    TODO: Extend this method to general cases
    """
    return targets.value_counts().head(1) / len(targets)

null_accuracy = get_null_accuracy(y_test)
print(null_accuracy)

5    0.819961
Name: stars, dtype: float64


## Task 7 (Challenge)

Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of "false positives" and "false negatives".
- **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the "positive class"?

In [70]:
# Let's print some false positives. False positive -> we predict 5 when the actual value is 1.
# The predictions from the Bayes Algorithm are stored in y_pred_class
false_positives = X_test[y_pred_class > y_test]
list(false_positives)[:10]

['This has to be the worst restaurant in terms of hygiene. Two of my friends had food -poisoning after having dinner here. The food is just unhealthy with tons of oil floating on the top of curries, and I am not sure if any health/hygiene code is followed here. \nThe service is poor and the information on its website is incorrect, the owner does not allow dine-in after 9 or 10 even though it says that the restaurant is open till 11. \n\nOne night I saw the owner cleaning the place without gloves and she was nice enough to give us a to-go parcel without cleaning her hands (great example to the servers!). I had a peek inside the kitchen when the door was ajar, and it definitely looked dirty.\n\nI have been a lot of hole-in-the-wall places around this restaurant, including Haji Baba, the Vietnamese place and others, but neither any of my friends nor I have fallen sick coz of the food. If you need a spicy-food fix, i strongly recommend you do not try this place, lest you want a visit to th

In [71]:
# False Negatives are class1 predictions when the actual class is 5
false_negatives = X_test[y_pred_class < y_test]
list(false_negatives)[:10]

["I now consider myself an Arizonian. If you drive a lot on the 101 or 51 like I do, you'll get your fair share of chips on your windshield. You'll also have to replace a windshield like I had to do just recently. Apparently, chips and cracking windshields  is common in Arizona. In fact, I seem to recall my insurance agent telling me that insurance companies must provide this coverage in Arizona.\n\nI had a chip repaired about a year ago near the very bottom of the windshield. Just recently a small, very fine crack started traveling north on the windshield from the repaired chip (a different vendor repaired the chip). I called these guys over to my house and they said it was too long to fix, so they replaced the whole windshield the next day.\n\nWhat great service, they come out to your residence or place of business to repair or replace your windshield.",
 'This is by far my favourite department store, hands down. I have had nothing but perfect experiences in this store, without excep

## Task 8 (Challenge)

Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.

- **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object.

In [72]:
# Numpy arrays for the number of times that each token is in one/five star reviews
one_star_tokens = nb.feature_count_[0,:]
five_star_tokens = nb.feature_count_[1,:]
# First 5 tokens example
one_star_tokens[:5]

array([ 26.,   4.,   1.,   3.,   1.])

In [73]:
# We'll add 1 to avoid dividing by 0 when calculating the ratio
one_star_tokens = one_star_tokens + 1
five_star_tokens = five_star_tokens + 1

In [74]:
tokens = pd.DataFrame({'token': X_train_tokens, 'one_star': one_star_tokens, 'five_star': five_star_tokens}).set_index('token')
tokens.sample(10, random_state=7)
# The token opinion appears 31 times in a 5* review and only 7 in a 1* review

Unnamed: 0_level_0,five_star,one_star
token,Unnamed: 1_level_1,Unnamed: 2_level_1
teenager,3.0,2.0
vittles,2.0,1.0
quiona,2.0,1.0
cine,2.0,1.0
opinion,31.0,7.0
delish,37.0,1.0
rec,2.0,1.0
insists,2.0,1.0
patronize,7.0,4.0
usps,2.0,1.0


In [75]:
# Let's some tokens ratios
tokens['one_star'] = tokens.one_star / nb.class_count_[0]
tokens['five_star'] = tokens.five_star / nb.class_count_[1]
tokens.sample(5, random_state=7)

Unnamed: 0_level_0,five_star,one_star
token,Unnamed: 1_level_1,Unnamed: 2_level_1
teenager,0.0012,0.00354
vittles,0.0008,0.00177
quiona,0.0008,0.00177
cine,0.0008,0.00177
opinion,0.012405,0.012389


In [76]:
# Now we can compare the 1* vs 5* ratios and add it to the panda series
tokens['one_to_five_ratio'] = tokens.one_star / tokens.five_star

In [77]:
tokens.sample(10, random_state=7)

Unnamed: 0_level_0,five_star,one_star,one_to_five_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
teenager,0.0012,0.00354,2.948673
vittles,0.0008,0.00177,2.211504
quiona,0.0008,0.00177,2.211504
cine,0.0008,0.00177,2.211504
opinion,0.012405,0.012389,0.998744
delish,0.014806,0.00177,0.119541
rec,0.0008,0.00177,2.211504
insists,0.0008,0.00177,2.211504
patronize,0.002801,0.00708,2.527434
usps,0.0008,0.00177,2.211504


In [79]:
tokens.sort_values('one_to_five_ratio', ascending=False)

Unnamed: 0_level_0,five_star,one_star,one_to_five_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
staffperson,0.000400,0.030088,75.191150
refused,0.000400,0.024779,61.922124
disgusting,0.000800,0.042478,53.076106
filthy,0.000400,0.019469,48.653097
unacceptable,0.000400,0.015929,39.807080
acknowledge,0.000400,0.015929,39.807080
unprofessional,0.000400,0.015929,39.807080
ugh,0.000800,0.030088,37.595575
yuck,0.000800,0.028319,35.384071
fuse,0.000400,0.014159,35.384071


## Task 9 (Challenge)

Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.

Here are the steps:

- Define X and y using the original DataFrame. (y should contain 5 different classes.)
- Split X and y into training and testing sets.
- Create document-term matrices using CountVectorizer.
- Calculate the testing accuracy of a Multinomial Naive Bayes model.
- Compare the testing accuracy with the null accuracy, and comment on the results.
- Print the confusion matrix, and comment on the results. (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) explains how to read a multi-class confusion matrix.)
- Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix!