In [1]:
# usual imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd


%matplotlib notebook

from sklearn.cross_validation import train_test_split

# Each is a different implemntation of a text transform tool: Bag of Words & Tfidf
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

#### read yelp_labelled data and split it using \t

In [2]:
url = "https://raw.githubusercontent.com/ga-students/DS-SF-24/master/Data/yelp_labelled.txt"
Yelp_data = pd.read_csv(url , sep = "\t", names = ['text','sentiment'])
Yelp_data.head()

Unnamed: 0,text,sentiment
0,Wow... Loved this place.,1.0
1,I learned that if an electric slicer is used t...,
2,But they don't clean the chiles?,
3,Crust is not good.,0.0
4,Not tasty and the texture was just nasty.,0.0


#### Put your yelp data into a dataframe and drop na values.

In [3]:
Yelp_data.dropna(inplace = True)
Yelp_data.head()

Unnamed: 0,text,sentiment
0,Wow... Loved this place.,1
3,Crust is not good.,0
4,Not tasty and the texture was just nasty.,0
10,Stopped by during the late May bank holiday of...,1
11,The selection on the menu was great and so wer...,1


In [4]:
Yelp_data.describe()

Unnamed: 0,sentiment
count,1000.0
mean,0.5
std,0.50025
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


#### Using Pipeline, RandomForestClasifier, and GridSearchCV, play with min_df and max_df on your yelp data. Split your data to test and training. You can use either of CountVetorizer or TfidfVectorizer

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV

In [6]:
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report
# Resetting our data
X_train, X_test, y_train, y_test = train_test_split(Yelp_data['text'], Yelp_data['sentiment'], test_size=0.2)

In [7]:
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('clf', RandomForestClassifier())])

In [8]:
parameters = {'vect__min_df':[1,2,3],
              'vect__max_df':[5,10,100,200,500,1000],
              'clf__n_estimators':[1000]}

gs_clf = GridSearchCV(text_clf, parameters, cv = 10, n_jobs = -1)

In [9]:
fit_grid = gs_clf.fit(X_train, y_train)


#### How much test error do you get based on the optimizer you found above?

In [10]:
fit_grid.score(X_test, y_test)

0.78500000000000003

#### Look over few (first 5) X_test instances and compare the category predicted for the observation and the actual review sentence. 

In [11]:
for i in range(5):
    print(fit_grid.predict(X_test)[i])
    print(X_test.values[i])

1.0
definitely will come back here again.
0.0
I mean really, how do you get so famous for your fish and chips when it's so terrible!?!
0.0
My fella got the huevos rancheros and they didn't look too appealing.
0.0
The cashier had no care what so ever on what I had to say it still ended up being wayyy overpriced.
1.0
Good prices.


## Bonus Quetions: Can you find the test instances that are correctly classified and thos that are misclassified?

In [12]:
#Misclassified instances
count  = range(len(y_test))
for i in count:
    if fit_grid.predict(X_test)[i] != y_test.values[i]:
        print (X_test.values[i])


An absolute must visit!
I *heart* this place.
Main thing I didn't enjoy is that the crowd is of older crowd, around mid 30s and up.
The Veggitarian platter is out of this world!
The one down note is the ventilation could use some upgrading.
The Greek dressing was very creamy and flavorful.
The grilled chicken was so tender and yellow from the saffron seasoning.
The waiter wasn't helpful or friendly and rarely checked on us.
This is a disgrace.
These are the nicest restaurant owners I've ever come across.
What I really like there is the crepe station.
One of the few places in Phoenix that I would definately go back to again .
You can't beat that.
So they performed.
Hands down my favorite Italian restaurant!
So flavorful and has just the perfect amount of heat.
Omelets are to die for!
It was attached to a gas station, and that is rarely a good sign.
Level 5 spicy was perfect, where spice didn't over-whelm the soup.
First time going but I think I will quickly become a regular.
I would rec

In [13]:
#Correctly Classified instances

count  = range(len(y_test))
for i in count:
    if fit_grid.predict(X_test)[i] == y_test.values[i]:
        print (X_test.values[i])

definitely will come back here again.
I mean really, how do you get so famous for your fish and chips when it's so terrible!?!
My fella got the huevos rancheros and they didn't look too appealing.
The cashier had no care what so ever on what I had to say it still ended up being wayyy overpriced.
Good prices.
The servers are not pleasant to deal with and they don't always honor Pizza Hut coupons.
All in all, Ha Long Bay was a bit of a flop.
After two I felt disgusting.
The sergeant pepper beef sandwich with auju sauce is an excellent sandwich as well.
Their menu is diverse, and reasonably priced.
The selection was probably the worst I've seen in Vegas.....there was none.
Great Subway, in fact it's so good when you come here every other Subway will not meet your expectations.
The burger is good beef, cooked just right.
Definitely worth venturing off the strip for the pork belly, will return next time I'm in Vegas.
This place is a jewel in Las Vegas, and exactly what I've been hoping to f