# Tutorial Exercise: Yelp reviews

## Introduction

This exercise uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.

**Description of the data:**

- **`yelp.csv`** contains the dataset. It is stored in the repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website.
- Each observation (row) in this dataset is a review of a particular business by a particular user.
- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
- The **text** column is the text of the review.

**Goal:** Predict the star rating of a review using **only** the review text.

**Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations.

In [1]:
import pandas as pd

## Task 1

Read **`yelp.csv`** into a pandas DataFrame and examine it

In [2]:
df = pd.read_csv('.\data\yelp.csv')

In [3]:
df.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


## Task 2

Create a new DataFrame that only contains the **5-star** and **1-star** reviews.

- **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29) explains how to do this.

In [4]:
reviews = df[(df['stars'] == 1) | (df['stars'] == 5)]

In [5]:
reviews['stars'].value_counts()

5    3337
1     749
Name: stars, dtype: int64

## Task 3

Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.

- **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows.

In [6]:
X = reviews['text']
y = reviews['stars']

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 123)

In [8]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(2860,)
(1226,)
(2860,)
(1226,)


## Task 4

Use CountVectorizer to create **document-term matrices** from X_train and X_test.

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

In [10]:
cv = CountVectorizer()

In [11]:
X_train_dtm = cv.fit_transform(X_train)
X_train_dtm

<2860x16025 sparse matrix of type '<class 'numpy.int64'>'
	with 221151 stored elements in Compressed Sparse Row format>

In [12]:
X_test_dtm = cv.transform(X_test)
X_test_dtm

<1226x16025 sparse matrix of type '<class 'numpy.int64'>'
	with 92579 stored elements in Compressed Sparse Row format>

## Task 5

Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix.

In [13]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

In [14]:
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)

nb_pred = nb.predict(X_test_dtm)

In [15]:
print(metrics.accuracy_score(y_test, nb_pred))

0.9094616639477977


In [16]:
print(metrics.confusion_matrix(y_test, nb_pred))

[[138  93]
 [ 18 977]]


## Task 6 (Challenge)

Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!

In [17]:
y_test.value_counts()

5    995
1    231
Name: stars, dtype: int64

In [18]:
max(y_test.mean(), 1 - y_test.mean())

4.246329526916803

In [19]:
y_test.value_counts().head(1) / len(y_test)

5    0.811582
Name: stars, dtype: float64

## Task 7 (Challenge)

Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of "false positives" and "false negatives".
- **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the "positive class"?

In [20]:
pos = X_test[(y_test == 5) & (nb_pred == 5)]
pos.shape

(977,)

In [21]:
fp = X_test[(y_test == 1) & (nb_pred == 5)]
fp.shape

(93,)

In [22]:
fp

7340    After landing in PHX but before embarking on a...
2430    Just because this place is called "Maria Maria...
9262    Very loud, crowded and quite unpleasant. I fel...
8681    As I promised myself, I'd go back again to try...
9818    Mucho Gusto es mucho mierda. \r\n\r\nLet me sa...
                              ...                        
6576    veerrrrrryyyyyyyyy dirty. I think I will buy t...
289     I'd say I've been to the Clubhouse a few times...
4562    despite it's billing as the 'largest thrift st...
7803    I'm sad to report that we dined here for lunch...
6159    Really, if I could, I would give this place ze...
Name: text, Length: 93, dtype: object

In [23]:
fn = X_test[(y_test == 5) & (nb_pred == 1)]
fn

2902    Southwest blows its competitors so far out of ...
2407    Recently bought a house and had issues with th...
7531    This was such a good experience\r\nI was up al...
9636    OK OK... as a Proud Italian I hope my momma do...
390     RIP AZ Coffee Connection.  :(  I stopped by tw...
5565    I`ve had work done by this shop a few times th...
9297    Fast service, the woman who did my hair was gr...
2404    I have been going to Physical Therapy for seve...
9765    You can't give anything less than 5 stars to a...
8083    After unsuccessfully attempting a walk in gel ...
5223    Brought my car here b/c of the reviews I read ...
8053    My wife called to have our vent cleaned since ...
750     This store has the most pleasant employees of ...
2306    There are certain people in your life that you...
2127    This place is great!  I called at 8:30 am to m...
8571    If it's an emergency, they will generally see ...
2444    EXCELLENT CUSTOMER SERVICE! \r\n\r\nEven with ...
2504    I've p

## Task 8 (Challenge)

Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.

- **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object.

In [24]:
nb.feature_count_

array([[24.,  3.,  3., ...,  0.,  0.,  0.],
       [26.,  7.,  2., ...,  4.,  1.,  1.]])

In [26]:
X_train_tokens = cv.get_feature_names()
len(X_train_tokens)

16025

In [29]:
one_star_count = nb.feature_count_[0,:]
five_star_count = nb.feature_count_[1,:]

In [30]:
print(one_star_count)
print(five_star_count)

[24.  3.  3. ...  0.  0.  0.]
[26.  7.  2. ...  4.  1.  1.]


In [31]:
tokens = pd.DataFrame({'token': X_train_tokens, '1 star': one_star_count, '5 star': five_star_count}).set_index('token')
tokens.head()

Unnamed: 0_level_0,1 star,5 star
token,Unnamed: 1_level_1,Unnamed: 2_level_1
00,24.0,26.0
000,3.0,7.0
00am,3.0,2.0
00pm,1.0,4.0
01,1.0,2.0


In [41]:
tokens['1 star'] = tokens['1 star'] + 1
tokens['5 star'] = tokens['5 star'] + 1

In [42]:
tokens['1 star ratio'] = tokens['1 star'].div(nb.class_count_[0])
tokens['5 star ratio'] = tokens['5 star'].div(nb.class_count_[1])
tokens.head()

Unnamed: 0_level_0,1 star,5 star,1 star ratio,5 star ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
00,25.0,27.0,0.048263,0.011529
000,4.0,8.0,0.007722,0.003416
00am,4.0,3.0,0.007722,0.001281
00pm,2.0,5.0,0.003861,0.002135
01,2.0,3.0,0.003861,0.001281


In [46]:
tokens['5 star values'] = tokens['5 star ratio']/tokens['1 star ratio']
tokens.sort_values('5 star values', ascending=False).head(10)

Unnamed: 0_level_0,1 star,5 star,1 star ratio,5 star ratio,1 star values,5 star values
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
fantastic,2.0,189.0,0.003861,0.0807,0.047844,20.901366
perfect,3.0,234.0,0.005792,0.099915,0.057965,17.251921
yum,1.0,62.0,0.001931,0.026473,0.072923,13.713066
favorite,6.0,308.0,0.011583,0.131512,0.088076,11.353829
perfection,1.0,48.0,0.001931,0.020495,0.094192,10.616567
heaven,1.0,46.0,0.001931,0.019641,0.098288,10.17421
reasonably,1.0,41.0,0.001931,0.017506,0.110274,9.068318
view,1.0,40.0,0.001931,0.017079,0.113031,8.847139
knowledgeable,1.0,40.0,0.001931,0.017079,0.113031,8.847139
helped,1.0,38.0,0.001931,0.016225,0.11898,8.404782


## Task 9 (Challenge)

Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.

Here are the steps:

- Define X and y using the original DataFrame. (y should contain 5 different classes.)
- Split X and y into training and testing sets.
- Create document-term matrices using CountVectorizer.
- Calculate the testing accuracy of a Multinomial Naive Bayes model.
- Compare the testing accuracy with the null accuracy, and comment on the results.
- Print the confusion matrix, and comment on the results. (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) explains how to read a multi-class confusion matrix.)
- Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix!