## Yelp reviews - Text classification

**Description of the data:**

- Each observation (row) in this dataset is a review of a particular business by a particular user.
- The stars column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
- The text column is the text of the review.

*Goal: Predict the star rating of a review using only the review text.*

In [1]:
import pandas as pd
import numpy as np

In [2]:
yelp = pd.read_csv('yelp.csv')

In [3]:
yelp.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


In [4]:
yelp.shape

(10000, 10)

In [5]:
yelp.stars.value_counts().sort_index()

1     749
2     927
3    1461
4    3526
5    3337
Name: stars, dtype: int64

### Use only 5-star and 1-star reviews.

In [6]:
# filter the DataFrame using an OR condition
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

# equivalently, use the 'loc' method
yelp_best_worst = yelp.loc[(yelp.stars==5) | (yelp.stars==1), :]

In [7]:
# examine the shape
yelp_best_worst.shape

(4086, 10)

In [8]:
# define X and y
X = yelp_best_worst.text
y = yelp_best_worst.stars

# split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# examine the object shapes
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)


(3064,)
(1022,)
(3064,)
(1022,)




In [9]:
# import and instantiate CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(stop_words = 'english', lowercase=True,max_df=0.3, min_df=2)

In [10]:
# fit and transform X_train into X_train_dtm
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

(3064, 8491)

In [11]:
# transform X_test into X_test_dtm
X_test_dtm = vect.transform(X_test)
X_test_dtm.shape

(1022, 8491)

In [12]:
# import and instantiate MultinomialNB
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [13]:
# train the model using X_train_dtm
nb.fit(X_train_dtm, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [14]:
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

In [15]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.91976516634050876

In [16]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

array([[147,  37],
       [ 45, 793]])

Calculate the null accuracy, which is the classification accuracy that could be achieved by always predicting the most frequent class.

In [17]:
# examine the class distribution of the testing set
y_test.value_counts()

5    838
1    184
Name: stars, dtype: int64

In [18]:
# calculate null accuracy
y_test.value_counts().head(1) / y_test.shape

5    0.819961
Name: stars, dtype: float64

In [19]:
# first 10 false positives (1-star reviews incorrectly classified as 5-star reviews)
X_test[y_test < y_pred_class].head(10)

1781    If you like the stuck up Scottsdale vibe this ...
2839    Never Again,\nI brought my Mountain Bike in (w...
1423    I hadn't been to Fuddruckers for about 20 year...
321     My wife and I live around the corner, hadn't e...
1919                                         D-scust-ing.
7037    3 months, 22 emails, 10 plus calls and (I thin...
8755    Not lesbian/gay friendly at all. I should have...
9125    La Grande Orange Grocery has a problem. It can...
9185    For frozen yogurt quality, I give this place a...
436     this another place that i would give no stars ...
Name: text, dtype: object

In [20]:
# false positive: model is reacting to the words "good", "impressive", "nice"
X_test[1781]

"If you like the stuck up Scottsdale vibe this is a good place for you. The food isn't impressive. Nice outdoor seating."

In [21]:
# false positive: model does not have enough data to work with
X_test[1919]

'D-scust-ing.'

In [22]:
# first 10 false negatives (5-star reviews incorrectly classified as 1-star reviews)
X_test[y_test > y_pred_class].head(10)

1438    I really enjoyed my mani/pedi with Trisha!  Sh...
4533    You know the story about the cobbler who creat...
7148    I now consider myself an Arizonian. If you dri...
7046    What a great place to hike/ run/ shred! It's a...
1341    Just want to Say ThankYou Hawaiian Airlines fo...
4899    I'm sorry to say that they closed their doors ...
1       I have no idea why some people give bad review...
4963    This is by far my favourite department store, ...
6318    Since I have ranted recently on poor customer ...
380     This is a must try for any Mani Pedi fan. I us...
Name: text, dtype: object

In [23]:
# false negative: model is reacting to the words "complain", "crowds", "rushing", "pricey", "scum"
print(X_test[4963])

This is by far my favourite department store, hands down. I have had nothing but perfect experiences in this store, without exception, no matter what department I'm in. The shoe SA's will bend over backwards to help you find a specific shoe, and the staff will even go so far as to send out hand-written thank you cards to your home address after you make a purchase - big or small. Tim & Anthony in the shoe salon are fabulous beyond words! 

I am not completely sure that I understand why people complain about the amount of merchandise on the floor or the lack of crowds in this store. Frankly, I would rather not be bombarded with merchandise and other people. One of the things I love the most about Barney's is not only the prompt attention of SA's, but the fact that they aren't rushing around trying to help 35 people at once. The SA's at Barney's are incredibly friendly and will stop to have an actual conversation, regardless or whether you are purchasing something or not. I have also nev

Calculate which 10 tokens are the most predictive of 5-star reviews, and which 10 tokens are the most predictive of 1-star reviews.

In [24]:
# store the vocabulary of X_train
X_train_tokens = vect.get_feature_names()
len(X_train_tokens)

8491

In [25]:
# first row is one-star reviews, second row is five-star reviews
nb.feature_count_.shape

(2, 8491)

In [26]:
# store the number of times each token appears across each class
one_star_token_count = nb.feature_count_[0, :]
five_star_token_count = nb.feature_count_[1, :]

In [27]:
# create a DataFrame of tokens with their separate one-star and five-star counts
tokens = pd.DataFrame({'token':X_train_tokens, 'one_star':one_star_token_count, 'five_star':five_star_token_count}).set_index('token')

In [28]:
# add 1 to one-star and five-star counts to avoid dividing by 0
tokens['one_star'] = tokens.one_star + 1
tokens['five_star'] = tokens.five_star + 1

In [29]:
# first number is one-star reviews, second number is five-star reviews
nb.class_count_

array([  565.,  2499.])

In [30]:
# convert the one-star and five-star counts into frequencies
tokens['one_star'] = tokens.one_star / nb.class_count_[0]
tokens['five_star'] = tokens.five_star / nb.class_count_[1]

In [31]:
# calculate the ratio of five-star to one-star for each token
tokens['five_star_ratio'] = tokens.five_star / tokens.one_star

In [32]:
# sort the DataFrame by five_star_ratio (descending order), and examine the first 10 rows
# note: use sort() instead of sort_values() for pandas 0.16.2 and earlier
tokens.sort_values('five_star_ratio', ascending=False).head(10)

Unnamed: 0_level_0,five_star,one_star,five_star_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
fantastic,0.077231,0.00354,21.817727
perfect,0.098039,0.00531,18.464052
yum,0.02481,0.00177,14.017607
favorite,0.138055,0.012389,11.143029
outstanding,0.019608,0.00177,11.078431
brunch,0.016807,0.00177,9.495798
gem,0.016006,0.00177,9.043617
mozzarella,0.015606,0.00177,8.817527
pasty,0.015606,0.00177,8.817527
amazing,0.185274,0.021239,8.723323


In [33]:
# sort the DataFrame by five_star_ratio (ascending order), and examine the first 10 rows
tokens.sort_values('five_star_ratio', ascending=True).head(10)

Unnamed: 0_level_0,five_star,one_star,five_star_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
refused,0.0004,0.024779,0.016149
disgusting,0.0008,0.042478,0.018841
filthy,0.0004,0.019469,0.020554
acknowledge,0.0004,0.015929,0.025121
unprofessional,0.0004,0.015929,0.025121
unacceptable,0.0004,0.015929,0.025121
ugh,0.0008,0.030088,0.026599
yuck,0.0008,0.028319,0.028261
disaster,0.0004,0.014159,0.028261
voucher,0.0004,0.014159,0.028261


## Predict for all stars reviews (1, 2, 3, 4, and 5 stars)

In [35]:
# define X and y using the original DataFrame
X = yelp.text
y = yelp.stars

In [36]:
# check that y contains 5 different classes
y.value_counts().sort_index()

1     749
2     927
3    1461
4    3526
5    3337
Name: stars, dtype: int64

In [37]:
# split X and y into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [38]:
# create document-term matrices using CountVectorizer
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

In [39]:
# fit a Multinomial Naive Bayes model
nb.fit(X_train_dtm, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [40]:
# make class predictions
y_pred_class = nb.predict(X_test_dtm)

In [41]:
# calculate the accuary
metrics.accuracy_score(y_test, y_pred_class)


0.48520000000000002

In [42]:
# calculate the null accuracy
y_test.value_counts().head(1) / y_test.shape

4    0.3536
Name: stars, dtype: float64

In [43]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

array([[ 82,  30,  32,  28,  13],
       [ 41,  41,  53,  77,  22],
       [ 15,  18,  90, 195,  47],
       [ 15,  14,  50, 561, 244],
       [ 19,   9,  23, 342, 439]])

**Confusion matrix comments:**

- Nearly all 4-star and 5-star reviews are classified as 4 or 5 stars, but they are hard for the model to distinguish between.
- 1-star, 2-star, and 3-star reviews are most commonly classified as 4 stars, probably because it's the predominant class in the training data.

In [44]:
# print the classification report
print(metrics.classification_report(y_test, y_pred_class))

             precision    recall  f1-score   support

          1       0.48      0.44      0.46       185
          2       0.37      0.18      0.24       234
          3       0.36      0.25      0.29       365
          4       0.47      0.63      0.54       884
          5       0.57      0.53      0.55       832

avg / total       0.48      0.49      0.47      2500



In [45]:
# manually calculate the precision for class 1
precision = 55 / float(55 + 28 + 5 + 7 + 6)
print(precision)

0.5445544554455446


In [46]:
# manually calculate the recall for class 1
recall = 55 / float(55 + 14 + 24 + 65 + 27)
print(recall)

0.2972972972972973


In [47]:
# manually calculate the F1 score for class 1
f1 = 2 * (precision * recall) / (precision + recall)
print(f1)

0.38461538461538464


**Classification report comments:**

- Class 1 has low recall, meaning that the model has a hard time detecting the 1-star reviews, but high precision, meaning that when the model predicts a review is 1-star, it's usually correct.
- Class 5 has high recall and precision, probably because 5-star reviews have polarized language, and because the model has a lot of observations to learn from.