# Homework with Yelp reviews data

## Introduction

This assignment uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.

**Description of the data:**

- **`yelp.csv`** contains the dataset. It is stored in the course repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website.
- Each observation (row) in this dataset is a review of a particular business by a particular user.
- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
- The **text** column is the text of the review.
- The **cool** column is the number of "cool" votes this review received from other Yelp users. All reviews start with 0 "cool" votes, and there is no limit to how many "cool" votes a review can receive. In other words, it is a rating of the review itself, not a rating of the business.
- The **useful** and **funny** columns are similar to the **cool** column.

**Goal:** Predict the star rating of a review using **only** the review text. (We will not be using the cool, funny, or useful columns.)

**Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations.

In [41]:
#libraries
import pandas as pd
from pandas import DataFrame,Series
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
%matplotlib inline

# use print only as a function

## Task 1

Read **`yelp.csv`** into a Pandas DataFrame and examine it.

In [42]:
yelp_df = pd.read_csv("../data/yelp.csv")

In [43]:
yelp_df.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


## Task 2

Create a new DataFrame that only contains the **5-star** and **1-star** reviews.

- **Hint:** You will need to filter the DataFrame using an OR condition. [Working with DataFrames](http://www.gregreda.com/2013/10/26/working-with-pandas-dataframes/) has an example of this, and this [code snippet](http://chrisalbon.com/python/pandas_select_rows_multiple_filters.html) may also be helpful.

In [44]:
yelp_1_5_df = yelp_df[(yelp_df["stars"]==5) | (yelp_df["stars"]==1)]

## Task 3

Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.

- **Hint:** Keep in mind that X should be a Pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows.

In [45]:
X = yelp_1_5_df["text"]
y = yelp_1_5_df["stars"]
print(X.shape)
print(y.shape)

(4086,)
(4086,)


In [46]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)

(3064,)
(1022,)


## Task 4

Use CountVectorizer to create **document-term matrices** from X_train and X_test.

In [47]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [48]:
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_train_dtm

<3064x16825 sparse matrix of type '<type 'numpy.int64'>'
	with 237720 stored elements in Compressed Sparse Row format>

In [49]:
X_test_dtm = vect.transform(X_test)
X_test_dtm

<1022x16825 sparse matrix of type '<type 'numpy.int64'>'
	with 77006 stored elements in Compressed Sparse Row format>

## Task 5

Use Multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix.

In [50]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [51]:
%time nb.fit(X_train_dtm, y_train)

CPU times: user 8 ms, sys: 0 ns, total: 8 ms
Wall time: 7.59 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [52]:
y_pred_class = nb.predict(X_test_dtm)

In [53]:
from sklearn import metrics
print metrics.accuracy_score(y_test, y_pred_class)

0.918786692759


In [54]:
print metrics.confusion_matrix(y_test, y_pred_class)

[[126  58]
 [ 25 813]]


## Task 6 (Challenge)

Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!

In [55]:
y_test.value_counts().head(1) / len(y_test)

5    0.819961
Name: stars, dtype: float64

## Task 7 (Challenge)

Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of "false positives" and "false negatives".
- **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the "positive class"?

In [56]:
#False positives
X_test[y_test < y_pred_class][:5]

2175    This has to be the worst restaurant in terms o...
1781    If you like the stuck up Scottsdale vibe this ...
2674    I'm sorry to be what seems to be the lone one ...
9984    Went last night to Whore Foods to get basics t...
3392    I found Lisa G's while driving through phoenix...
Name: text, dtype: object

In [57]:
#False Negatives
X_test[y_test > y_pred_class][:5]

7148    I now consider myself an Arizonian. If you dri...
4963    This is by far my favourite department store, ...
6318    Since I have ranted recently on poor customer ...
380     This is a must try for any Mani Pedi fan. I us...
5565    I`ve had work done by this shop a few times th...
Name: text, dtype: object

* Naive Bayes is classifying a review as 1-star or 5-star on the basis of spamminess of the words present in a particular review. I think that the reviews are getting incorrectly classified because the reviews word count are different. Naive bayes usually sums the log of the spam score for the words in a review. So if a review is long i.e. has many words, the simple summation might have led to in accurate results. Another reason can be that Naive bayes is not considering the relation between tokens, it maintains that each token is independent.

* In this case, False Positives are reviews which were incorrectly given a 5 star, whereas in reality they were 1 star and False Negatives were the ones which were incorrectly given a 1 star , whereas in reality they has 5 star.

## Task 8 (Challenge)

Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.

- **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object.

In [58]:
vocab = vect.get_feature_names()

In [59]:
one_star = nb.feature_count_[0, :]
five_star = nb.feature_count_[1, :]

In [60]:
token_df = pd.DataFrame({'token':vocab, '1-star':one_star, '5-star':five_star}).set_index('token')

In [61]:
token_df["1-star"] += 1
token_df["5-star"] += 1

In [62]:
token_df["1-star"] = token_df["1-star"]/nb.class_count_[0]
token_df["5-star"] = token_df["5-star"]/nb.class_count_[1]

In [63]:
token_df["frequency"] = token_df["1-star"]/token_df["5-star"]

In [64]:
#10 best keywords that predict 1-star reviews
token_df.sort_values("frequency", ascending=False).head(10)

Unnamed: 0_level_0,1-star,5-star,frequency
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
staffperson,0.030088,0.0004,75.19115
refused,0.024779,0.0004,61.922124
disgusting,0.042478,0.0008,53.076106
filthy,0.019469,0.0004,48.653097
unacceptable,0.015929,0.0004,39.80708
acknowledge,0.015929,0.0004,39.80708
unprofessional,0.015929,0.0004,39.80708
ugh,0.030088,0.0008,37.595575
yuck,0.028319,0.0008,35.384071
fuse,0.014159,0.0004,35.384071


In [65]:
#10 best keywords that predict 5-star reviews
token_df.sort_values("frequency", ascending=True).head(10)

Unnamed: 0_level_0,1-star,5-star,frequency
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
fantastic,0.00354,0.077231,0.045834
perfect,0.00531,0.098039,0.054159
yum,0.00177,0.02481,0.071339
favorite,0.012389,0.138055,0.089742
outstanding,0.00177,0.019608,0.090265
brunch,0.00177,0.016807,0.10531
gem,0.00177,0.016006,0.110575
mozzarella,0.00177,0.015606,0.11341
pasty,0.00177,0.015606,0.11341
amazing,0.021239,0.185274,0.114635


## Task 9 (Challenge)

Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.

Here are the steps:

- Define X and y using the original DataFrame. (y should contain 5 different classes.)
- Split X and y into training and testing sets.
- Create document-term matrices using CountVectorizer.
- Calculate the testing accuracy of a Multinomial Naive Bayes model.
- Compare the testing accuracy with the null accuracy, and comment on the results.
- Print the confusion matrix, and comment on the results.
- Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix!

In [66]:
X = yelp_df["text"]
y = yelp_df["stars"]
print(X.shape)
print(y.shape)

(10000,)
(10000,)


In [67]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)

(7500,)
(2500,)


In [68]:
vect = CountVectorizer()

vect.fit(X_train)
X_train_dtm = vect.transform(X_train)

X_test_dtm = vect.transform(X_test)

In [69]:
X_train_dtm

<7500x25797 sparse matrix of type '<type 'numpy.int64'>'
	with 622700 stored elements in Compressed Sparse Row format>

In [70]:
X_test_dtm

<2500x25797 sparse matrix of type '<type 'numpy.int64'>'
	with 200729 stored elements in Compressed Sparse Row format>

In [71]:
nb = MultinomialNB()
%time nb.fit(X_train_dtm, y_train)

CPU times: user 28 ms, sys: 0 ns, total: 28 ms
Wall time: 25.8 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [72]:
y_pred_class = nb.predict(X_test_dtm)

In [73]:
metrics.accuracy_score(y_test, y_pred_class)

0.47120000000000001

In [74]:
y_test.value_counts().head(1) / len(y_test)

4    0.3536
Name: stars, dtype: float64

In [75]:
print metrics.confusion_matrix(y_test, y_pred_class)

[[ 55  14  24  65  27]
 [ 28  16  41 122  27]
 [  5   7  35 281  37]
 [  7   0  16 629 232]
 [  6   4   6 373 443]]


* The model is able to predict 4-star reviews with maximum accuracy of more than 70%
* The model is able to predict 5-star reviews with an approx. accuracy of 50%
* The model doesnt predict 1,2 & 3 star reviews with great accuracy
* 281 3-star reviews were classified as 4-star
* 232 4-star reviews were classified as 5-star
* 373 5-star reviews were classified as 4-star

In [76]:
from sklearn.metrics import classification_report

In [77]:
print classification_report(y_test, y_pred_class)

             precision    recall  f1-score   support

          1       0.54      0.30      0.38       185
          2       0.39      0.07      0.12       234
          3       0.29      0.10      0.14       365
          4       0.43      0.71      0.53       884
          5       0.58      0.53      0.55       832

avg / total       0.46      0.47      0.43      2500



* The model predicts 4-star reviews best with 71% accuracy, followed by 5-star reviews with 53% accuracy
* The model performs extremely poor in predicting 2-star reviews at a mere 7% accuracy
* The f1 score for 4 and 5 star reviews has been the best, indicating that model is more accurate in predicting 4-star and 5-star reviews