# Tutorial Exercise: Yelp reviews

## Introduction

This exercise uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.

**Description of the data:**

- **`yelp.csv`** contains the dataset. It is stored in the repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website.
- Each observation (row) in this dataset is a review of a particular business by a particular user.
- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
- The **text** column is the text of the review.

**Goal:** Predict the star rating of a review using **only** the review text.

**Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations.

## Task 1

Read **`yelp.csv`** into a pandas DataFrame and examine it.

In [171]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

In [172]:
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/yelp.csv"
yelp = pd.read_csv(url)

In [173]:
yelp.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


## Task 2

Create a new DataFrame that only contains the **5-star** and **1-star** reviews.

- **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29) explains how to do this.

In [174]:
df = yelp[(yelp.stars == 5) | (yelp.stars == 1)]

In [175]:
df.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0
6,zp713qNhx8d9KCJJnrw1xA,2010-02-12,riFQ3vxNpP4rWLk_CSri2A,5,Drop what you're doing and drive here. After I...,review,wFweIWhv2fREZV_dYkz_1g,7,7,4


In [190]:
df.shape

(4086, 10)

In [176]:
df.stars.unique()

array([5, 1])

In [177]:
df[df.stars == 5].shape

(3337, 10)

In [178]:
df[df.stars == 1].shape

(749, 10)

## Task 3

Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.

- **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows.

In [179]:
X = df["text"]
y = df["stars"]
print(type(X), type(y))


<class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'>


In [180]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(3064,) (1022,) (3064,) (1022,)


## Task 4

Use CountVectorizer to create **document-term matrices** from X_train and X_test.

In [181]:
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

In [182]:
X_train_dtm.shape, X_test_dtm.shape

((3064, 16932), (1022, 16932))

## Task 5

Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix.

In [183]:
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred = nb.predict(X_test_dtm)
metrics.accuracy_score(y_test, y_pred)


0.9256360078277887

In [184]:
metrics.confusion_matrix(y_test, y_pred)


array([[116,  56],
       [ 20, 830]])

## Task 6 (Challenge)

Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!

In [191]:
only_5s = np.full((1022,), 5)
only_5s

array([5, 5, 5, ..., 5, 5, 5])

In [192]:
null_accuracy = metrics.accuracy_score(y_test, only_5s)
null_accuracy

0.8317025440313112

In [193]:
# examine the class distribution of the testing set
y_test.value_counts()

5    850
1    172
Name: stars, dtype: int64

In [194]:
# calculate null accuracy
y_test.value_counts().head(1) / y_test.shape

5    0.831703
Name: stars, dtype: float64

In [196]:
# calculate null accuracy manually
850 / float(850 + 172)

0.8317025440313112

## Task 7 (Challenge)

Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of "false positives" and "false negatives".
- **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the "positive class"?

In [128]:
falseN = X_test[y_test > y_pred]
falseN

3999    TJ was there for me when my water heater broke...
4034    "Fine dining" is not just a setting.  it isn't...
1477    I' m psycho for this place.  The sell cupcakes...
635     This place really made a terrible situation as...
6101    Easy 5 star fun at the Phoenix Zoo. I've never...
617     Tried going there for my 1st visit and they we...
4963    This is by far my favourite department store, ...
2444    EXCELLENT CUSTOMER SERVICE! \n\nEven with Happ...
2002    Y'know, rarely do I buy the "extended warranty...
3052    When I met some friends for dinner at this res...
66      This an incredible church that embraces the pr...
1914    Much more organized than the last time I was t...
9297    Fast service, the woman who did my hair was gr...
5637    I'm giving them a 5 star review for how they h...
7903    First, I'm sorry this review is lengthy, but i...
Name: text, dtype: object

In [130]:
falseN[2444]

'EXCELLENT CUSTOMER SERVICE! \n\nEven with Happy Hour in full swing and the place full of parched Winkies, the service at our large table was spot on. We barely had time to settle our bums and push the silver out to claim our table territory before iced H20 was before us, our drink orders were taken . . . and one quick blink of the eye and the "peacemakers" were before us!  \n\nNow that\'s what I call HAPPY HOUR!\n\nBravo Thirsty Lion!'

In [131]:
falseN[66]

"This an incredible church that embraces the principle that ......Christ Life Church exists to bring people into a life-changing relationship with Jesus Christ.  Three campuses allow Christ Life to reach into several locations...online, Tempe and Casa Grande.   The wonderful children's and youth programs appeal to a wide perspective of people."

In [132]:
falseN[6101]

"Easy 5 star fun at the Phoenix Zoo. I've never seen critters so closeup before. At the Monkey Village, they don't even have cages. The giraffes came right up to us. I spent half my beer money on fishheads and shrimp feeding the stingrays. It was a blast!\n\nWe had great time!\nyow, bill\n\nPS - I've never commented on other people's reviews before. But jeez. When someone goes to the zoo and complains about the cost of a hot dog or a snow cone... well, for spider monkey's sake, get a clue!"

In [133]:
falseN[635]

"This place really made a terrible situation as easy as possible.  Our cat was hit by a car or beaten, and my husband found her immobile and crying horribly in pain.  We rushed her to the clinic, and the doctor saw her right away.  She told us our options, but was very upfront about the liklihood that our kitty would pass no matter what we did.  She didn't pressure us one way or another, and left the room so that we could talk privately and make our decision.  My husband asked if they could give her some pain meds while we talked about it, and they gave her some immediately.\n\nOnce we made our difficult decision to put our kitty down, she brought the cat into a private room with us so that we could say goodbye.  She left us alone and gave us as much time as we needed.  She also let us be present in the room when they gave her the final shot.  At all times the doctor and the techs were compassionate and very respectful.  I was crying and they were very sensitive to my feelings.  They t

In [134]:
falseP = X_test[y_test < y_pred]
falseP

8233    Good prices but this please is honestly not wo...
1695    Honestly the drinks are overpriced and have a ...
8362    Some of my friends brought me here last night ...
4374    Cadillac Ranch looked really awesome from the ...
9183    The food is simple, pure and uncomplicated; yo...
                              ...                        
126     My friend kept telling me how good their lunch...
1563    This was my second visit to Arriba's and belie...
6051    This place has really bad service and the food...
9846    NO.  Don't go. Don't do it.  This was my first...
874     We went to American Junkie after we found a 50...
Name: text, Length: 63, dtype: object

In [135]:
falseP[8233]

"Good prices but this please is honestly not worth the hassle. First of all you have to leave your purse at the counter. I am not comfortable doing that anywhere. Who does that? The only place I have ever seen this before was ASU's bookstore. \n\nThen you walk in this warehouse that has NO air conditioning!!! I made the mistake of going yesterday in the middle of the day (111 degrees out, yikes!) Then you get inside and you have to carry a catalog to refer to for prices since items are unmarked in large boxes. \n\nIf they had a/c it would have been worth it to save a few bucks on the toys my dog destroys in minutes but no air conditioning??? IN ARIZONA?? Madness!"

In [136]:
falseP[9846]

"NO.  Don't go. Don't do it.  This was my first visit and it was on a whim because we drive by it often.  \n\nOne other reviewer here made a wise statement - go the grocery store and buy frozen fish.  I promise it will be better than this."

## Task 8 (Challenge)

Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.

- **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object.

In [187]:
sortedfc = np.sort(nb.feature_count_, axis=-1)
sortedfc

array([[    0.,     0.,     0., ...,  2389.,  2709.,  4255.],
       [    0.,     0.,     0., ...,  6539., 10463., 14288.]])

In [188]:
sortedfc[0, -10:]

array([ 856.,  939.,  950.,  953., 1300., 1313., 1519., 2389., 2709.,
       4255.])

In [189]:
sortedfc[1, -10:]

array([ 2927.,  3349.,  3517.,  3541.,  4326.,  4448.,  4763.,  6539.,
       10463., 14288.])

In [139]:
nb.class_count_

array([ 555., 2509.])

I don't know

Answers below

In [217]:
# store the vocabulary of X_train
X_train_tokens = vect.get_feature_names_out()
len(X_train_tokens)

16932

In [218]:
# first row is one-star reviews, second row is five-star reviews
nb.feature_count_.shape

(2, 16932)

In [219]:
# store the number of times each token appears across each class
one_star_token_count = nb.feature_count_[0, :]
five_star_token_count = nb.feature_count_[1, :]

In [220]:
# create a DataFrame of tokens with their separate one-star and five-star counts
tokens = pd.DataFrame({'token':X_train_tokens, 'one_star':one_star_token_count, 'five_star':five_star_token_count}).set_index('token')

In [221]:

# add 1 to one-star and five-star counts to avoid dividing by 0
tokens['one_star'] = tokens.one_star + 1
tokens['five_star'] = tokens.five_star + 1

In [222]:
# first number is one-star reviews, second number is five-star reviews
nb.class_count_

array([ 577., 2487.])

In [223]:
# convert the one-star and five-star counts into frequencies
tokens['one_star'] = tokens.one_star / nb.class_count_[0]
tokens['five_star'] = tokens.five_star / nb.class_count_[1]

In [224]:
# calculate the ratio of five-star to one-star for each token
tokens['five_star_ratio'] = tokens.five_star / tokens.one_star

In [225]:
# sort the DataFrame by five_star_ratio (descending order), and examine the first 10 rows

tokens.sort_values('five_star_ratio', ascending=False).head(10)

Unnamed: 0_level_0,one_star,five_star,five_star_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
fantastic,0.001733,0.077201,44.545235
favorite,0.006932,0.139928,20.18456
perfect,0.005199,0.103739,19.952553
yum,0.001733,0.024528,14.152392
gem,0.001733,0.016084,9.280257
gluten,0.001733,0.016084,9.280257
notch,0.001733,0.015682,9.048251
organic,0.001733,0.015682,9.048251
creative,0.001733,0.014073,8.120225
mozzarella,0.001733,0.013671,7.888219


In [216]:
# sort the DataFrame by five_star_ratio (ascending order), and examine the first 10 rows
tokens.sort_values('five_star_ratio', ascending=True).head(10)

Unnamed: 0_level_0,one_star,five_star,five_star_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
spagetti,0.001733,0.0,0.0
unapologetic,0.003466,0.0,0.0
unanswered,0.001733,0.0,0.0
scout,0.001733,0.0,0.0
scowl,0.001733,0.0,0.0
brusqueness,0.001733,0.0,0.0
scraggly,0.003466,0.0,0.0
brutal,0.001733,0.0,0.0
unacceptable,0.010399,0.0,0.0
bryan,0.001733,0.0,0.0


## Task 9 (Challenge)

Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.

Here are the steps:

- Define X and y using the original DataFrame. (y should contain 5 different classes.)
- Split X and y into training and testing sets.
- Create document-term matrices using CountVectorizer.
- Calculate the testing accuracy of a Multinomial Naive Bayes model.
- Compare the testing accuracy with the null accuracy, and comment on the results.
- Print the confusion matrix, and comment on the results. (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) explains how to read a multi-class confusion matrix.)
- Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix!

In [140]:
X = yelp.text
X

0       My wife took me here on my birthday for breakf...
1       I have no idea why some people give bad review...
2       love the gyro plate. Rice is so good and I als...
3       Rosie, Dakota, and I LOVE Chaparral Dog Park!!...
4       General Manager Scott Petello is a good egg!!!...
                              ...                        
9995    First visit...Had lunch here today - used my G...
9996    Should be called house of deliciousness!\n\nI ...
9997    I recently visited Olive and Ivy for business ...
9998    My nephew just moved to Scottsdale recently so...
9999    4-5 locations.. all 4.5 star average.. I think...
Name: text, Length: 10000, dtype: object

In [141]:
type(X)

<class 'pandas.core.series.Series'>

In [142]:
y = yelp.stars

In [143]:
y.shape, X.shape

((10000,), (10000,))

In [144]:
y.unique()

array([5, 4, 2, 3, 1])

In [145]:
y.value_counts()

4    3526
5    3337
3    1461
2     927
1     749
Name: stars, dtype: int64

In [146]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [147]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((7500,), (2500,), (7500,), (2500,))

In [148]:
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

In [149]:
X_train_dtm.shape, y_train.shape

((7500, 25562), (7500,))

In [150]:
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred = nb.predict(X_test_dtm)

In [151]:
metrics.accuracy_score(y_test, y_pred)

0.4712

In [152]:
all4s = np.full((2500,), 4)

In [155]:
null_accuracy4s = metrics.accuracy_score(y_test, all4s)
null_accuracy

0.3396

In [156]:
all5s = np.full((2500,), 5)
null_accuracy5s = metrics.accuracy_score(y_test, all5s)
null_accuracy

0.3396

In [153]:
print(metrics.confusion_matrix(y_test, y_pred))

[[ 57  27  26  64  16]
 [ 19  14  41 132  25]
 [  3  10  34 326  27]
 [ 13   0  26 636 174]
 [  4   2   5 382 437]]


In [103]:
print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       0.65      0.32      0.43       178
           2       0.37      0.08      0.14       224
           3       0.34      0.09      0.14       384
           4       0.45      0.76      0.56       898
           5       0.63      0.56      0.59       816

    accuracy                           0.50      2500
   macro avg       0.49      0.36      0.37      2500
weighted avg       0.50      0.50      0.46      2500

