# Homework with Yelp reviews data

## Introduction

This assignment uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.

**Description of the data:**

- **`yelp.csv`** contains the dataset. It is stored in the course repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website.
- Each observation (row) in this dataset is a review of a particular business by a particular user.
- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
- The **text** column is the text of the review.

**Goal:** Predict the star rating of a review using **only** the review text.

**Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations.

In [37]:
# Imports
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.metrics import classification_report

In [2]:
pd.options.display.max_colwidth = 500

## Task 1

Read **`yelp.csv`** into a Pandas DataFrame and examine it.

In [3]:
path = r"C:\Users\Mark\Anaconda3\envs\Dataschool\work\data\yelp.csv"
df = pd.read_csv(path)
df.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better.\r\n\r\nDo yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I've ever had. I'm pretty sure they only use ing...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,"I have no idea why some people give bad reviews about this place. It goes to show you, you can please everyone. They are probably griping about something that their own fault...there are many people like that.\r\n\r\nIn any case, my friend and I arrived at about 5:50 PM this past Sunday. It was pretty crowded, more than I thought for a Sunday evening and thought we would have to wait forever to get a seat but they said we'll be seated when the girl comes back from seating someone else. We we...",review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I also dig their candy selection :),review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!! It's very convenient and surrounded by a lot of paths, a desert xeriscape, baseball fields, ballparks, and a lake with ducks.\r\n\r\nThe Scottsdale Park and Rec Dept. does a wonderful job of keeping the park clean and shaded. You can find trash cans and poopy-pick up mitts located all over the park and paths.\r\n\r\nThe fenced in area is huge to let the dogs run, play, and sniff!",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,"General Manager Scott Petello is a good egg!!! Not to go into detail, but let me assure you if you have any issues (albeit rare) speak with Scott and treat the guy with some respect as you state your case and I'd be surprised if you don't walk out totally satisfied as I just did. Like I always say..... ""Mistakes are inevitable, it's how we recover from them that is important""!!!\r\n\r\nThanks to Scott and his awesome staff. You've got a customer for life!! .......... :^)",review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


## Task 2

Create a new DataFrame that only contains the **5-star** and **1-star** reviews.

- **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](https://www.youtube.com/watch?v=YPItfQ87qjM&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=9) explains how to do this.

In [4]:
df_5_1 = df[(df.stars == 5) | (df.stars == 1)]
df_5_1.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better.\r\n\r\nDo yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I've ever had. I'm pretty sure they only use ing...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,"I have no idea why some people give bad reviews about this place. It goes to show you, you can please everyone. They are probably griping about something that their own fault...there are many people like that.\r\n\r\nIn any case, my friend and I arrived at about 5:50 PM this past Sunday. It was pretty crowded, more than I thought for a Sunday evening and thought we would have to wait forever to get a seat but they said we'll be seated when the girl comes back from seating someone else. We we...",review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!! It's very convenient and surrounded by a lot of paths, a desert xeriscape, baseball fields, ballparks, and a lake with ducks.\r\n\r\nThe Scottsdale Park and Rec Dept. does a wonderful job of keeping the park clean and shaded. You can find trash cans and poopy-pick up mitts located all over the park and paths.\r\n\r\nThe fenced in area is huge to let the dogs run, play, and sniff!",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,"General Manager Scott Petello is a good egg!!! Not to go into detail, but let me assure you if you have any issues (albeit rare) speak with Scott and treat the guy with some respect as you state your case and I'd be surprised if you don't walk out totally satisfied as I just did. Like I always say..... ""Mistakes are inevitable, it's how we recover from them that is important""!!!\r\n\r\nThanks to Scott and his awesome staff. You've got a customer for life!! .......... :^)",review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0
6,zp713qNhx8d9KCJJnrw1xA,2010-02-12,riFQ3vxNpP4rWLk_CSri2A,5,"Drop what you're doing and drive here. After I ate here I had to go back the next day for more. The food is that good.\r\n\r\nThis cute little green building may have gone competely unoticed if I hadn't been driving down Palm Rd to avoid construction. While waiting to turn onto 16th Street the ""Grand Opening"" sign caught my eye and my little yelping soul leaped for joy! A new place to try!\r\n\r\nIt looked desolate from the outside but when I opened the door I was put at easy by the decor...",review,wFweIWhv2fREZV_dYkz_1g,7,7,4


## Task 3

Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.

- **Hint:** Keep in mind that X should be a Pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows.

In [5]:
X = df_5_1.text
y = df_5_1.stars

X_train, X_test, y_train, y_test = train_test_split(X,y)

## Task 4

Use CountVectorizer to create **document-term matrices** from X_train and X_test.

In [6]:
vect = CountVectorizer()

In [7]:
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

## Task 5

Use Multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix.

In [8]:
nb = MultinomialNB()

In [9]:
nb.fit(X_train_dtm, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [10]:
y_pred_class = nb.predict(X_test_dtm)

In [11]:
metrics.accuracy_score(y_test, y_pred_class)

0.93052837573385516

## Task 6 (Challenge)

Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!

In [12]:
print(metrics.confusion_matrix(y_test, y_pred_class))

[[124  56]
 [ 15 827]]


In [13]:
null_accuracy = (max(y_test.value_counts())/len(y_test))
print('Null accuracy =', null_accuracy)

Null accuracy = 0.823874755382


## Task 7 (Challenge)

Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of "false positives" and "false negatives".
- **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the "positive class"?

In [14]:
test_pred = pd.DataFrame({'message': X_test, 'rating':y_test, 'predicted': y_pred_class})
test_pred.head()

Unnamed: 0,message,predicted,rating
2285,We purchased a Groupon deal and went and tried this restaurant . The food is excellent and the service is very good . If you are wanting real Texas BBQ then this is the place to go!,5,5
803,"The Thai menu is totally amazing! I had the Panang; and the coconut curry was to die for. I have been mistakenly driving past this wonderful place for a year. Do yourself a favor, try Na Rai, you will be so pleased you did. \r\nThey have a lunch menu as well.",5,5
3497,"This place was the worst place to EVER live! I lived here from Aug 2010 to may 2011, i was always in the office see when the grates would be fixed?? the office always gave me the run around very rude. You would think after the gun shoots, assaults, and car break ins that they would fix the gates but nope and to this day they still aren't fixed. When we moved in they said they would be fixed in a week and they lied. Also the gym is huge but NONE of the equipment works! living here was SO gro...",1,1
666,"I waited 45min and ended up with a tiny gross little serving of tortilla with vegetable scraps thrown in it, served in a aluminum take-out box. The tables are folding tables, like the kind you play beer pong on. And they have pictures of kids and babies on the walls and signs asking people to send in pictures of their kids for the walls too, which has nothing to do with Asian/Mexican cuisine, it creeped me out.",1,1
9688,Yummy!,5,5


In [15]:
false_pred = test_pred[test_pred.rating != test_pred.predicted]
false_pred.head(20)

Unnamed: 0,message,predicted,rating
2765,"Went last week, and ordered a dozen variety. I didn't care for any of them, very dry, the frosting wasn't very good,even my 9 yr old daughter spit her cupcake out!",5,1
3082,"Currently having a liquidation sale, but it's not really worth the trip. I found that most books were cheaper at Amazon then the sale price. A few were even cheaper at Barnes and Noble then the liquidation sale price! Which just made me say to myself well that's probably why you're going out of business try some competitive pricing! The place is an absolute disaster they've moved a lot so the sales staff doesn't even know where everything is. Also the place just smells dirty and old, it...",5,1
8519,"Overcrowded, sprawling mess of a mall (and I normally like malls). Traffic and parking are almost as bad as at Tempe Marketplace. The confusing layout makes it hard to find specific stores (even with the help of the directories).",5,1
212,"I had not been to an Oregano's in like 10 years. They seems to be popping up all over here in AZ now so my buddy was in from out of town and he wanted to try, so why not.\r\n\r\nI'll tell you ""Why NOT""... we has two thin crust pizzas... and I even hate to call them that. More like... crackers with melted cheese and some toppings.\r\n\r\nThat was some of the worst excuse for pizza I've ever had. Most of the fast-food delivery pizza I've had beats it -- and I do not like, at all, delivery pizz...",5,1
4374,"Cadillac Ranch looked really awesome from the outside, this nice red cadillac out front, nice patio outside, and an enclosed bar and grill patio on the side. Upon entering it was pretty dead, well, it was a Monday afternoon after all.\r\n\r\nI ordered the Southwestern Country Fried Steak, huge plate of food that came with your choice of House Salad or Ceaser Salad. I dove into the mashed potatos, which upon their many options of how you want your potatos to taste (garlic, cheese, horshradish...",5,1
8083,"After unsuccessfully attempting a walk in gel at the posh looking ""Sundrops"" nail salon with my two year old in tow, I was drawn to a half deserted strip mall just down the street and a small space in the center labeled Andy's Nails. I Yelped the place and saw the decent reviews so I walked in and was immediately attended to by the friendliest faces I've seen in a while. Despite the fact that I was short one babysitter, they began work on my nails right away and when my toddler got restless,...",1,5
119,"Take your money elsewhere, unless you've got kids. I really try to like this place. A family member signed me up for the discount card, so I've been going more often, but I just don't love it. It's simply ok, but the prices are outrageous. And the sounds and animatronics are a huge distraction from the so-so food. The cocktails are alright, but, again, the price is not right. The ony thing fun about the place for an adult is the gift shop and the light-up cocktail glasses (which cost e...",5,1
9953,"""Hipster,Trendy"" ????-I think NOT !!!! Very disappointing......weird crowd ( older men on the prowl? ) , and unfriendly bartenders w/ lots of attitude. I've given this a few tries thinking I just hit it on a bad night , but won't be going back. So many other great places in Scottsdale to visit !",5,1
165,This place is not there anymore.,5,1
4362,"I eat a lot of Asian food (of different sorts). Asian is actually my go-to cuisine since I have a lot of food allergies. This place deserves a thumbs-down, unless you love Americanized Asian. The food was bland, breaded and just not worth any money. Next time I'm in Phoenix, I definitely won't eat here.",5,1


## Task 8 (Challenge)

Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.

- **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object.

In [16]:
tokens = pd.DataFrame({'feature': vect.get_feature_names(), 'low':nb.feature_count_[0,:], 'high':nb.feature_count_[1,:]})
tokens.head()

Unnamed: 0,feature,high,low
0,00,38.0,34.0
1,000,9.0,5.0
2,00a,0.0,1.0
3,00am,2.0,3.0
4,00pm,6.0,1.0


In [17]:
tokens.nlargest(10, columns= 'high')

Unnamed: 0,feature,high,low
15026,the,14023.0,4398.0
809,and,10346.0,2719.0
15219,to,6433.0,2443.0
10288,of,4586.0,1351.0
8003,is,4301.0,820.0
8018,it,4218.0,1373.0
16257,was,3528.0,1506.0
7665,in,3444.0,981.0
6054,for,3299.0,992.0
16736,you,2911.0,674.0


In [18]:
tokens.nlargest(10, columns= 'low')

Unnamed: 0,feature,high,low
15026,the,14023.0,4398.0
809,and,10346.0,2719.0
15219,to,6433.0,2443.0
16257,was,3528.0,1506.0
8018,it,4218.0,1373.0
10288,of,4586.0,1351.0
15021,that,2541.0,1018.0
6054,for,3299.0,992.0
7665,in,3444.0,981.0
9888,my,2651.0,860.0


## Task 9 (Challenge)

Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.

Here are the steps:

- Define X and y using the original DataFrame. (y should contain 5 different classes.)
- Split X and y into training and testing sets.
- Create document-term matrices using CountVectorizer.
- Calculate the testing accuracy of a Multinomial Naive Bayes model.
- Compare the testing accuracy with the null accuracy, and comment on the results.
- Print the confusion matrix, and comment on the results. (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) explains how to read a multi-class confusion matrix.)
- Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix!

In [19]:
df.columns

Index(['business_id', 'date', 'review_id', 'stars', 'text', 'type', 'user_id',
       'cool', 'useful', 'funny'],
      dtype='object')

In [20]:
X5 = df['text']
y5 = df['stars']

In [21]:
X_train5, X_test5, y_train5, y_test5 = train_test_split(X5,y5)

In [22]:
vect5 = CountVectorizer()

In [29]:
X_train5_dtm = vect5.fit_transform(X_train5)
X_test5_dtm = vect5.transform(X_test5)

In [25]:
nb5 = MultinomialNB()

In [30]:
nb5.fit(X_train5_dtm,y_train5)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [31]:
y_pred5_class = nb5.predict(X_test5_dtm)

In [32]:
metrics.accuracy_score(y_test5,y_pred5_class)

0.48120000000000002

In [36]:
null_accuracy5 = max(y_test5.value_counts())/len(y_test5)
null_accuracy5

0.34760000000000002

In [33]:
metrics.confusion_matrix(y_test5,y_pred5_class)

array([[ 71,   8,  19,  50,  16],
       [ 24,  11,  50, 157,  17],
       [  6,   4,  42, 273,  38],
       [  5,   3,  23, 636, 202],
       [  5,   0,   4, 393, 443]])

In [46]:
print(vect5.get_feature_names())



In [47]:
print(classification_report(y_test5, y_pred5_class))

             precision    recall  f1-score   support

          1       0.64      0.43      0.52       164
          2       0.42      0.04      0.08       259
          3       0.30      0.12      0.17       363
          4       0.42      0.73      0.53       869
          5       0.62      0.52      0.57       845

avg / total       0.49      0.48      0.44      2500

