#  Yelp reviews

## Introduction

This exercise uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.

**Description of the data:**

- **`yelp.csv`** contains the dataset. It is stored in the repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website.
- Each observation (row) in this dataset is a review of a particular business by a particular user.
- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
- The **text** column is the text of the review.

**Goal:** Predict the star rating of a review using **only** the review text.

## Task 1

Read **`yelp.csv`** into a pandas DataFrame and examine it.

In [1]:
import numpy as np
import pandas as pd

In [2]:
df=pd.read_csv("yelp.csv")
df.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


In [3]:
df.shape

(10000, 10)

In [4]:
df.size

100000

In [5]:
df.ndim

2

In [6]:
df=df.drop(["business_id","date","review_id","cool","useful","funny","type","user_id"],axis=1)
df.head()

Unnamed: 0,stars,text
0,5,My wife took me here on my birthday for breakf...
1,5,I have no idea why some people give bad review...
2,4,love the gyro plate. Rice is so good and I als...
3,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!..."
4,5,General Manager Scott Petello is a good egg!!!...


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 2 columns):
stars    10000 non-null int64
text     10000 non-null object
dtypes: int64(1), object(1)
memory usage: 156.3+ KB


In [8]:
df.describe()

Unnamed: 0,stars
count,10000.0
mean,3.7775
std,1.214636
min,1.0
25%,3.0
50%,4.0
75%,5.0
max,5.0


## Task 2

Create a new DataFrame that only contains the **5-star** and **1-star** reviews.

- **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29) explains how to do this.

In [9]:
df1=df[(df.stars==5)|(df.stars==1)]
df1.head()

Unnamed: 0,stars,text
0,5,My wife took me here on my birthday for breakf...
1,5,I have no idea why some people give bad review...
3,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!..."
4,5,General Manager Scott Petello is a good egg!!!...
6,5,Drop what you're doing and drive here. After I...


In [10]:
df1.stars.value_counts()

5    3337
1     749
Name: stars, dtype: int64

## Task 3

Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.

- **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows.

In [32]:
x=df1["text"]
x.head()

0    My wife took me here on my birthday for breakf...
1    I have no idea why some people give bad review...
3    Rosie, Dakota, and I LOVE Chaparral Dog Park!!...
4    General Manager Scott Petello is a good egg!!!...
6    Drop what you're doing and drive here. After I...
Name: text, dtype: object

In [15]:
y=df1["stars"]
y.head()

0    5
1    5
3    5
4    5
6    5
Name: stars, dtype: int64

In [33]:
type(x)

pandas.core.series.Series

In [34]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=10)

In [35]:
print(x_train.shape,x_test.shape)

(3064,) (1022,)


In [36]:
print(y_train.shape,y_test.shape)

(3064,) (1022,)


## Task 4

Use CountVectorizer to create **document-term matrices** from X_train and X_test.

In [37]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [38]:
vect.fit(x_train)
x_train_dtm=vect.transform(x_train)

In [39]:
x_train_dtm

<3064x16757 sparse matrix of type '<class 'numpy.int64'>'
	with 237131 stored elements in Compressed Sparse Row format>

In [40]:
print(x_train_dtm)

  (0, 51)	1
  (0, 169)	1
  (0, 279)	1
  (0, 815)	2
  (0, 1201)	1
  (0, 1277)	1
  (0, 2781)	1
  (0, 3330)	1
  (0, 4386)	1
  (0, 5658)	1
  (0, 5743)	1
  (0, 6021)	2
  (0, 6105)	1
  (0, 6391)	1
  (0, 6422)	1
  (0, 6533)	1
  (0, 7007)	1
  (0, 7341)	1
  (0, 7853)	1
  (0, 7967)	1
  (0, 9123)	1
  (0, 9303)	1
  (0, 9964)	1
  (0, 10108)	1
  (0, 10284)	1
  :	:
  (3063, 11808)	2
  (3063, 11845)	1
  (3063, 11893)	1
  (3063, 12593)	1
  (3063, 12846)	1
  (3063, 14115)	1
  (3063, 14121)	1
  (3063, 14470)	1
  (3063, 14771)	1
  (3063, 14783)	1
  (3063, 14971)	1
  (3063, 14988)	3
  (3063, 14996)	2
  (3063, 15021)	2
  (3063, 15057)	1
  (3063, 15081)	1
  (3063, 15168)	1
  (3063, 15178)	2
  (3063, 15230)	1
  (3063, 15438)	1
  (3063, 15783)	1
  (3063, 15910)	1
  (3063, 16213)	1
  (3063, 16234)	1
  (3063, 16680)	1


In [43]:
x_train_dtm.shape

(3064, 16757)

In [41]:
x_test_dtm=vect.transform(x_test)
x_test_dtm

<1022x16757 sparse matrix of type '<class 'numpy.int64'>'
	with 77494 stored elements in Compressed Sparse Row format>

In [42]:
print(x_test_dtm)

  (0, 2385)	1
  (0, 6082)	1
  (0, 7220)	1
  (0, 7324)	1
  (0, 7640)	1
  (0, 8783)	1
  (0, 10371)	1
  (0, 11156)	1
  (0, 16506)	1
  (0, 16706)	1
  (1, 337)	1
  (1, 720)	1
  (1, 815)	4
  (1, 891)	1
  (1, 990)	1
  (1, 1101)	1
  (1, 1544)	2
  (1, 2183)	1
  (1, 2335)	1
  (1, 2873)	1
  (1, 2923)	1
  (1, 3250)	1
  (1, 3764)	1
  (1, 4408)	1
  (1, 5064)	1
  :	:
  (1021, 9583)	1
  (1021, 9728)	1
  (1021, 9765)	1
  (1021, 10271)	1
  (1021, 10341)	1
  (1021, 10926)	1
  (1021, 11167)	2
  (1021, 11317)	1
  (1021, 11831)	1
  (1021, 12625)	1
  (1021, 12832)	1
  (1021, 12915)	1
  (1021, 14079)	1
  (1021, 14983)	1
  (1021, 14988)	6
  (1021, 15021)	1
  (1021, 15047)	1
  (1021, 15178)	1
  (1021, 15475)	1
  (1021, 16001)	1
  (1021, 16005)	1
  (1021, 16237)	2
  (1021, 16292)	3
  (1021, 16306)	1
  (1021, 16462)	1


In [44]:
x_test_dtm.shape

(1022, 16757)

## Task 5

Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix.

In [45]:
from sklearn.naive_bayes import MultinomialNB
nb=MultinomialNB()

In [46]:
%time
nb.fit(x_train_dtm,y_train)

Wall time: 0 ns


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [47]:
y_pred_class=nb.predict(x_test_dtm)

In [108]:
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
accuracy=accuracy_score(y_test,y_pred_class)
accuracy

0.9324853228962818

In [49]:
cm=confusion_matrix(y_test,y_pred_class)

array([[136,  54],
       [ 15, 817]], dtype=int64)

## Task 6 (Challenge)

Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!

Null accuracy: accuracy that could be achieved by always predicting the most frequent class

In [51]:
# examine the class distribution of the testing set (using a Pandas Series method)
y_test.value_counts()

5    832
1    190
Name: stars, dtype: int64

In [52]:
# calculate null accuracy (for binary classification problems coded as 0/1)
max(y_test.mean(), 1 - y_test.mean())

4.256360078277886

## Task 7 (Challenge)

Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of "false positives" and "false negatives".
- **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the "positive class"?

In [55]:
#false positive(5 star given but actually should receive 1 star)
x_test[y_test<y_pred_class].shape

(54,)

In [66]:
x_test[y_test<y_pred_class]

7803    I'm sad to report that we dined here for lunch...
8545    Entering the store is visually overwhelming. T...
9026    Quick spot for a mall-made mayo-based Californ...
71      Yikes, reading other reviews I realize my bad ...
4473    It is what you would expect from any themed pl...
6584    Jimmy Johns is cheaper and better ... The Capr...
3125    Let me first say if I could have given 0 stars...
8532    The bill was 150.00 and that was after a free ...
8514                                      Out of business
8569    This place has the oiliest food I've ever eate...
8068    I'm not sure what all of the buzz is about bec...
7131    This was the best camera/photography store in ...
575     Here's the 1. 2. 3...\n\n1. Great Food. I love...
6051    This place has really bad service and the food...
7956    What was Dunkin Donuts thinking when they took...
126     My friend kept telling me how good their lunch...
7845    They wouldn't honor pet co's online sale price...
190     What a

In [56]:
#false negative(1 star given but actually should receive 5 star)
x_test[y_test>y_pred_class].shape

(15,)

In [67]:
x_test[y_test>y_pred_class]

2306    There are certain people in your life that you...
5805    One of our Lexus car keys/key fob was cracked ...
3149    I was told to see Greg after a local shop diag...
6334    I came here today for a manicure and pedicure....
696     This is the only auto repair place I've ever s...
4034    "Fine dining" is not just a setting.  it isn't...
8053    My wife called to have our vent cleaned since ...
3635    I love Coach.\nStylish and trendy without spen...
7903    First, I'm sorry this review is lengthy, but i...
241     I was sad to come back to lai lai's and they n...
8571    If it's an emergency, they will generally see ...
5332    I had a great experience.  Nice people.   My m...
5565    I`ve had work done by this shop a few times th...
5923    My daughter's 128i BMW was horribly keyed this...
354     We happened upon this location when meeting a ...
Name: text, dtype: object

54 restaurants received 5 stars but they should actually have received 1 star.

15 restaurants received 1 star but they should have received 5 stars.

## Task 8 (Challenge)

Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.

- **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object.

In [57]:
x_train_tokens=vect.get_feature_names()
len(x_train_tokens)

16757

In [68]:
star1_token_count =nb.feature_count_[1]
star1_token_count

array([36.,  4.,  1., ...,  4.,  1.,  1.])

In [69]:
star5_token_count =nb.feature_count_[0,:]
star5_token_count

array([23.,  2.,  3., ...,  0.,  0.,  0.])

In [75]:
tokens = pd.DataFrame({'token':x_train_tokens, '5_star':star5_token_count, '1_star':star1_token_count}).set_index('token')
tokens.head(10)

Unnamed: 0_level_0,5_star,1_star
token,Unnamed: 1_level_1,Unnamed: 2_level_1
00,23.0,36.0
000,2.0,4.0
00am,3.0,1.0
00pm,1.0,5.0
01,1.0,2.0
03,1.0,0.0
03342,1.0,0.0
04,0.0,2.0
05,1.0,2.0
06,0.0,2.0


In [76]:
tokens.sample(10, )

Unnamed: 0_level_0,5_star,1_star
token,Unnamed: 1_level_1,Unnamed: 2_level_1
chickafila,0.0,1.0
physics,0.0,1.0
idea,14.0,48.0
mechanic,1.0,4.0
tailor,1.0,1.0
weigh,0.0,3.0
bundled,1.0,0.0
audition,1.0,2.0
arty,0.0,1.0
marriott,0.0,2.0


In [78]:
# add 1 to 5_star and 1_star counts to avoid dividing by 0
tokens['5_star'] = tokens["5_star"] + 1
tokens['1_star'] = tokens['1_star'] + 1
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,5_star,1_star
token,Unnamed: 1_level_1,Unnamed: 2_level_1
edwards,2.0,4.0
cones,2.0,5.0
pea,2.0,11.0
months,25.0,53.0
skeptic,2.0,3.0


In [79]:
# convert the 5_star and 1_star counts into frequencies
tokens['5_star'] = tokens['5_star']/ nb.class_count_[0]
tokens['1_star'] = tokens['1_star'] / nb.class_count_[1]
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,5_star,1_star
token,Unnamed: 1_level_1,Unnamed: 2_level_1
edwards,0.003578,0.001597
cones,0.003578,0.001996
pea,0.003578,0.004391
months,0.044723,0.021158
skeptic,0.003578,0.001198


In [80]:
# calculate the ratio of 1_star-to-5_star for each token
tokens['star_ratio'] = tokens['1_star'] / tokens['5_star']
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,5_star,1_star,star_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
edwards,0.003578,0.001597,0.446307
cones,0.003578,0.001996,0.557884
pea,0.003578,0.004391,1.227345
months,0.044723,0.021158,0.473086
skeptic,0.003578,0.001198,0.334731


In [82]:
# examine the DataFrame sorted by star_ratio
tokens.sort_values('star_ratio', ascending=False).head(10)

Unnamed: 0_level_0,5_star,1_star,star_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
fantastic,0.005367,0.082635,15.397605
perfect,0.007156,0.094611,13.221856
favorite,0.010733,0.127345,11.864338
amazing,0.023256,0.183633,7.896208
art,0.003578,0.023154,6.471457
yum,0.003578,0.022355,6.248303
loved,0.010733,0.062275,5.801996
fabulous,0.005367,0.030339,5.653227
awesome,0.019678,0.105788,5.375975
great,0.116279,0.606786,5.218363


In [83]:
# examine the DataFrame sorted by star_ratio
tokens.sort_values('star_ratio', ascending=False).tail(10)

Unnamed: 0_level_0,5_star,1_star,star_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
worst,0.112701,0.00479,0.042505
awful,0.076923,0.003194,0.041517
unacceptable,0.019678,0.000798,0.040573
worse,0.060823,0.002395,0.03938
rude,0.093023,0.003593,0.038623
remove,0.021467,0.000798,0.037192
disgusting,0.041145,0.001198,0.029107
horrible,0.128801,0.003593,0.027894
yuck,0.030411,0.000798,0.026253
staffperson,0.0322,0.000798,0.024795


## Task 9 (Challenge)

Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.

Here are the steps:

- Define X and y using the original DataFrame. (y should contain 5 different classes.)
- Split X and y into training and testing sets.
- Create document-term matrices using CountVectorizer.
- Calculate the testing accuracy of a Multinomial Naive Bayes model.
- Compare the testing accuracy with the null accuracy, and comment on the results.
- Print the confusion matrix, and comment on the results. (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) explains how to read a multi-class confusion matrix.)
- Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix!

In [84]:
df.stars.value_counts()

4    3526
5    3337
3    1461
2     927
1     749
Name: stars, dtype: int64

In [87]:
x1=df["text"]
x1.head()

0    My wife took me here on my birthday for breakf...
1    I have no idea why some people give bad review...
2    love the gyro plate. Rice is so good and I als...
3    Rosie, Dakota, and I LOVE Chaparral Dog Park!!...
4    General Manager Scott Petello is a good egg!!!...
Name: text, dtype: object

In [88]:
y1=df["stars"]
y1.head()

0    5
1    5
2    4
3    5
4    5
Name: stars, dtype: int64

In [89]:
from sklearn.model_selection import train_test_split
x1_train,x1_test,y1_train,y1_test=train_test_split(x1,y1,random_state=10)

In [90]:
print(x1_train.shape,x1_test.shape)

(7500,) (2500,)


In [91]:
print(y1_train.shape,y1_test.shape)

(7500,) (2500,)


In [92]:
vect.fit(x1_train)
x1_train_dtm=vect.transform(x1_train)

In [93]:
x1_train_dtm

<7500x25668 sparse matrix of type '<class 'numpy.int64'>'
	with 621370 stored elements in Compressed Sparse Row format>

In [94]:
x1_train_dtm.shape

(7500, 25668)

In [95]:
x1_test_dtm=vect.transform(x1_test)
x1_test_dtm

<2500x25668 sparse matrix of type '<class 'numpy.int64'>'
	with 201877 stored elements in Compressed Sparse Row format>

In [96]:
x1_test_dtm.shape

(2500, 25668)

In [97]:
%time
nb.fit(x1_train_dtm,y1_train)

Wall time: 0 ns


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [98]:
y1_pred_class=nb.predict(x1_test_dtm)

In [99]:
accuracy1=accuracy_score(y1_test,y1_pred_class)
accuracy1

0.4964

In [102]:
cm1=confusion_matrix(y1_test,y1_pred_class)
cm1

array([[ 59,  29,   8,  51,  26],
       [ 12,  18,  39, 125,  23],
       [  6,  11,  35, 264,  36],
       [  5,   2,  16, 648, 219],
       [  5,   2,   6, 374, 481]], dtype=int64)

In [111]:
cr=classification_report(y1_test,y1_pred_class)
print(cr)

              precision    recall  f1-score   support

           1       0.68      0.34      0.45       173
           2       0.29      0.08      0.13       217
           3       0.34      0.10      0.15       352
           4       0.44      0.73      0.55       890
           5       0.61      0.55      0.58       868

   micro avg       0.50      0.50      0.50      2500
   macro avg       0.47      0.36      0.37      2500
weighted avg       0.49      0.50      0.46      2500



**formula:**  
recall=TP_1/(TP_1+FN_1)
precision=TP_1/(TP_1+FP_1)

**for star 1:**  
the percentage of right results among all the predicted results is 68%

**for star 2:**  
the percentage of right results among all the predicted results is 29%

**for star 3:**  
the percentage of right results among all the predicted results is 34%

**for star 4:**  
the percentage of right results among all the predicted results is 44%

**for star 5:**  
the percentage of right results among all the predicted results is 61%