# Yelp reviews



**Description of the data:**

- **`yelp.csv`** contains the dataset. 
- Each observation (row) in this dataset is a review of a particular business by a particular user.
- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
- The **text** column is the text of the review.

**Goal:** Predict the star rating of a review using **only** the review text.




Read yelp.csv into a pandas DataFrame and examine it.

In [1]:
import pandas as pd

In [2]:
path = path = 'data/yelp.csv'
yelp_df = pd.read_csv(path)

In [3]:
yelp_df.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


In [4]:
yelp_df.shape

(10000, 10)

In [5]:
yelp_df.dtypes

business_id    object
date           object
review_id      object
stars           int64
text           object
type           object
user_id        object
cool            int64
useful          int64
funny           int64
dtype: object

In [6]:
yelp_df.columns

Index(['business_id', 'date', 'review_id', 'stars', 'text', 'type', 'user_id',
       'cool', 'useful', 'funny'],
      dtype='object')



Creating a new DataFrame that only contains the **5-star** and **1-star** reviews.



In [7]:
yelp_5_or_1_star = yelp_df.loc[(yelp_df['stars']==1) | (yelp_df['stars']==5)]

In [8]:
yelp_5_or_1_star.shape

(4086, 10)

In [9]:
yelp_5_or_1_star['stars'].value_counts()

5    3337
1     749
Name: stars, dtype: int64



Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.



In [10]:
X = yelp_5_or_1_star['text']
y = yelp_5_or_1_star['stars']

In [11]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)



In [12]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(3064,)
(1022,)
(3064,)
(1022,)




Use CountVectorizer to create **document-term matrices** from X_train and X_test.

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(stop_words='english')

In [14]:
vect.fit(X_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [15]:
vect.get_feature_names()

['00',
 '000',
 '00a',
 '00am',
 '00pm',
 '01',
 '02',
 '03',
 '03342',
 '04',
 '05',
 '06',
 '07',
 '09',
 '0buxoc0crqjpvkezo3bqog',
 '0l',
 '10',
 '100',
 '1000',
 '1000x',
 '1001',
 '100th',
 '101',
 '102',
 '105',
 '1070',
 '108',
 '10am',
 '10ish',
 '10min',
 '10mins',
 '10minutes',
 '10pm',
 '10th',
 '10x',
 '11',
 '110',
 '1100',
 '111',
 '111th',
 '112',
 '115th',
 '118',
 '11a',
 '11am',
 '11p',
 '11pm',
 '12',
 '120',
 '128i',
 '129',
 '12am',
 '12oz',
 '12pm',
 '12th',
 '13',
 '14',
 '140',
 '147',
 '14lbs',
 '15',
 '150',
 '1500',
 '150mm',
 '15am',
 '15mins',
 '15pm',
 '15th',
 '16',
 '160',
 '165',
 '169',
 '16th',
 '17',
 '17p',
 '18',
 '180',
 '18th',
 '19',
 '1900',
 '1913',
 '1928',
 '1929',
 '1930s',
 '1940',
 '1952',
 '1955',
 '1956',
 '1960',
 '1961',
 '1969',
 '1970',
 '1980',
 '1980s',
 '1987',
 '1990s',
 '1992',
 '1995',
 '1996',
 '1998',
 '1999',
 '19th',
 '1cent',
 '1k',
 '1p',
 '1pm',
 '1st',
 '20',
 '200',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007'

In [16]:
X_train_dtm = vect.transform(X_train)

In [17]:
X_train_dtm

<3064x16528 sparse matrix of type '<class 'numpy.int64'>'
	with 143743 stored elements in Compressed Sparse Row format>

In [18]:
X_test_dtm = vect.transform(X_test)



Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.



In [19]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [20]:
nb.fit(X_train_dtm,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [21]:
y_pred_class = nb.predict(X_test_dtm)

In [22]:
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.9158512720156555

In [23]:
metrics.confusion_matrix(y_test, y_pred_class)

array([[124,  60],
       [ 26, 812]], dtype=int64)

In [26]:
labels = list(set(y_test))
labels
#labels.sort()

[1, 5]

In [27]:
df = pd.DataFrame(
    data=metrics.confusion_matrix(y_test, y_pred_class, labels=labels),
    columns=labels,
    index=labels
)
df

Unnamed: 0,1,5
1,124,60
5,26,812




Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.



In [28]:
y_test.value_counts()


5    838
1    184
Name: stars, dtype: int64

In [29]:
(y_test.value_counts().values.tolist()[0]/sum(y_test.value_counts().values.tolist()))*100

81.99608610567515



Browse through the review text of some of the **false positives** and **false negatives**.



In [30]:
#false positives
X_test[(y_test==1) & (y_pred_class==5)]

2175    This has to be the worst restaurant in terms o...
1781    If you like the stuck up Scottsdale vibe this ...
995     Not worth the money. Food was average and the ...
3392    I found Lisa G's while driving through phoenix...
8283    Don't know where I should start. Grand opening...
2765    Went last week, and ordered a dozen variety. I...
2839    Never Again,\r\nI brought my Mountain Bike in ...
1423    I hadn't been to Fuddruckers for about 20 year...
321     My wife and I live around the corner, hadn't e...
1919                                         D-scust-ing.
2490    Lazy Q CLOSED in 2010.  New Owners cleaned up ...
6916    Brought a group to Metro for brunch, made the ...
8755    Not lesbian/gay friendly at all. I should have...
9125    La Grande Orange Grocery has a problem. It can...
7380    I keep wanting to like this place.  I want to ...
9185    For frozen yogurt quality, I give this place a...
436     this another place that i would give no stars ...
2051    Sadly 

In [32]:
#false negatives
X_test[(y_test==5) & (y_pred_class==1)]

7148    I now consider myself an Arizonian. If you dri...
4963    This is by far my favourite department store, ...
6318    Since I have ranted recently on poor customer ...
380     This is a must try for any Mani Pedi fan. I us...
5565    I`ve had work done by this shop a few times th...
3448    I was there last week with my sisters and whil...
6050    I went to sears today to check on a layaway th...
2504    I've passed by prestige nails in walmart 100s ...
2475    This place is so great! I am a nanny and had t...
241     I was sad to come back to lai lai's and they n...
3149    I was told to see Greg after a local shop diag...
423     These guys helped me out with my rear windshie...
763     Here's the deal. I said I was done with OT, bu...
5805    One of our Lexus car keys/key fob was cracked ...
8956    I took my computer to RedSeven recently when m...
750     This store has the most pleasant employees of ...
9765    You can't give anything less than 5 stars to a...
4646    Wow, I



Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.



In [33]:
X_train_tokens = vect.get_feature_names()
len(X_train_tokens)

16528

In [34]:
print(X_train_tokens[0:50])

['00', '000', '00a', '00am', '00pm', '01', '02', '03', '03342', '04', '05', '06', '07', '09', '0buxoc0crqjpvkezo3bqog', '0l', '10', '100', '1000', '1000x', '1001', '100th', '101', '102', '105', '1070', '108', '10am', '10ish', '10min', '10mins', '10minutes', '10pm', '10th', '10x', '11', '110', '1100', '111', '111th', '112', '115th', '118', '11a', '11am', '11p', '11pm', '12', '120', '128i']


In [35]:
print(X_train_tokens[-50:])

['yyyyy', 'z11', 'za', 'zabba', 'zach', 'zam', 'zanella', 'zankou', 'zappos', 'zatsiki', 'zen', 'zero', 'zest', 'zexperience', 'zha', 'zhou', 'zia', 'zihuatenejo', 'zilch', 'zin', 'zinburger', 'zinburgergeist', 'zinc', 'zinfandel', 'zing', 'zip', 'zipcar', 'zipper', 'zippers', 'zipps', 'ziti', 'zoe', 'zombi', 'zombies', 'zone', 'zones', 'zoning', 'zoo', 'zoyo', 'zucca', 'zucchini', 'zuchinni', 'zumba', 'zupa', 'zuzu', 'zwiebel', 'zzed', 'éclairs', 'école', 'ém']


In [36]:
nb.feature_count_

array([[26.,  4.,  1., ...,  0.,  0.,  0.],
       [39.,  5.,  0., ...,  1.,  1.,  1.]])

In [37]:
nb.feature_count_.shape

(2, 16528)

In [38]:
one_star_count = nb.feature_count_[0, :]
one_star_count

array([26.,  4.,  1., ...,  0.,  0.,  0.])

In [39]:
five_star_count = nb.feature_count_[1, :]
five_star_count

array([39.,  5.,  0., ...,  1.,  1.,  1.])

In [43]:
tokens = pd.DataFrame({'token':X_train_tokens, '1-star':one_star_count, '5-star':five_star_count}).set_index('token')
tokens.tail(10)

Unnamed: 0_level_0,1-star,5-star
token,Unnamed: 1_level_1,Unnamed: 2_level_1
zucchini,1.0,10.0
zuchinni,1.0,1.0
zumba,0.0,3.0
zupa,0.0,1.0
zuzu,0.0,3.0
zwiebel,0.0,1.0
zzed,0.0,1.0
éclairs,0.0,1.0
école,0.0,1.0
ém,0.0,1.0
