## Yelp Votes

This assignment uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.

**Description of the data:**

* `yelp.json` is the original format of the file. `yelp.csv` contains the same data, in a more convenient format. Both of the files are in this repo, so there is no need to download the data from the Kaggle website.
* Each observation in this dataset is a review of a particular business by a particular user.
* The "stars" column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
* The "cool" column is the number of "cool" votes this review received from other Yelp users. All reviews start with 0 "cool" votes, and there is no limit to how many "cool" votes a review can receive. In other words, it is a rating of the review itself, not a rating of the business.
* The "useful" and "funny" columns are similar to the "cool" column.


In [6]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

#### Load yelp.csf into a DataFrame and explore it, take a look at some correlations, distribution of star ratings etc

In [7]:
df = pd.read_csv('yelp.csv')

In [8]:
df.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


#### Let's say that a review is considered positive if it has 4 or 5 stars. Define a column containing 1 for positive reviews and 0 for negative ones. We will try to predict positive reviews using the review text.

In [9]:
df['category'] = (df.stars >=4).astype(int)
df.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny,category
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0,1
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0,1
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,1,0,1
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0,1
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0,1


#### Find how many positive and negative reviews there are in the dataset

In [10]:
categories =df.groupby(df['category']).count()
categories.business_id

category
0    3137
1    6863
Name: business_id, dtype: int64

## Vectorizing text data

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [12]:
# Define a vectorizer, max_features=2000 will only use 2000 most common words from the combined text of all reviews 
v = CountVectorizer(max_features=2000)

In [13]:
v.fit(df.text)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=2000, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

#### Take a look at the dictionary of words

In [14]:
v.get_feature_names()

['00',
 '10',
 '100',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '18',
 '20',
 '24',
 '25',
 '30',
 '40',
 '45',
 '50',
 '60',
 '75',
 '80',
 '90',
 '95',
 '99',
 'able',
 'about',
 'above',
 'absolute',
 'absolutely',
 'accommodating',
 'across',
 'actual',
 'actually',
 'add',
 'added',
 'adding',
 'addition',
 'additional',
 'admit',
 'advantage',
 'advice',
 'affordable',
 'after',
 'afternoon',
 'again',
 'against',
 'age',
 'ago',
 'agree',
 'agreed',
 'ahead',
 'ahi',
 'air',
 'airport',
 'al',
 'alcohol',
 'all',
 'almost',
 'alone',
 'along',
 'alot',
 'already',
 'alright',
 'also',
 'alternative',
 'although',
 'always',
 'am',
 'amazing',
 'ambiance',
 'ambience',
 'american',
 'among',
 'amount',
 'an',
 'and',
 'annoying',
 'another',
 'answer',
 'any',
 'anymore',
 'anyone',
 'anything',
 'anyway',
 'anyways',
 'anywhere',
 'apart',
 'app',
 'apparently',
 'appetizer',
 'appetizers',
 'apple',
 'appointment',
 'appreciate',
 'appreciated',
 'are',
 'area',
 'areas',
 'ar

#### Transforming the text to vectors.
Note that X is a "sparce array"

In [15]:
X = v.transform(df.text)

In [16]:
X

<10000x2000 sparse matrix of type '<class 'numpy.int64'>'
	with 696866 stored elements in Compressed Sparse Row format>

#### Take a look at the ndarray version of X and the shape of X. What is represented by each row and by each column?

In [17]:
a = X.toarray()

In [18]:
X.shape

(10000, 2000)

In [114]:
X.ndim

2

#### We would like to predict if a review is positive. Define the y variable to run `clf.fit(X, y)`

In [20]:
y = df.category
y.shape

(10000,)

#### Set 25% of the data aside as testing data (use the special sklearn function for this)

In [21]:
from sklearn.model_selection import train_test_split

In [22]:
train_test_split(X,y,test_size=0.25)

[<7500x2000 sparse matrix of type '<class 'numpy.int64'>'
 	with 523270 stored elements in Compressed Sparse Row format>,
 <2500x2000 sparse matrix of type '<class 'numpy.int64'>'
 	with 173596 stored elements in Compressed Sparse Row format>,
 1521    1
 5211    1
 7974    0
 8009    1
 5593    1
 9893    1
 4211    1
 5444    0
 2352    1
 2517    1
 1292    1
 2041    1
 6305    0
 5140    1
 775     0
 957     1
 3059    0
 1314    1
 7975    0
 5831    1
 6690    1
 1537    1
 7838    1
 8186    1
 8124    1
 7118    0
 8061    1
 9461    1
 438     1
 656     1
        ..
 5697    0
 5193    1
 1765    1
 7139    0
 4251    0
 5016    0
 5678    1
 1959    0
 9981    1
 3316    0
 639     1
 8933    1
 4847    1
 901     1
 1019    1
 4349    1
 5240    1
 3509    1
 7501    0
 3527    0
 5021    1
 5980    1
 3983    1
 4894    1
 8012    1
 2529    0
 4273    0
 9585    1
 356     1
 1383    0
 Name: category, Length: 7500, dtype: int32,
 3504    1
 805     0
 7999    1
 9523  

#### Import the MultinomialNB classifier from sklearn

#### Create an instance of MultinomialNB classifier with no parameter

#### Train the classifier using the training data

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

#### Find the predictions for test data and compare them to actual target, what's the accuracy score?

#### Compare it to training set accuracy score (predictions for training data)

#### Try using more words when vectorizing, see if this improves the accuarcy

#### Plot a graph of test accuracy as a function of number of words

If done with all tasks in this notebook, continue with this one using the same data for a regression exercise: https://github.com/justmarkham/DAT8/blob/master/homework/10_yelp_votes.md