# Linear regression homework with Yelp votes

## Introduction

This assignment uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.

**Description of the data:**

- `yelp.json` is the original format of the file. `yelp.csv` contains the same data, in a more convenient format. Both of the files are in this repo, so there is no need to download the data from the Kaggle website.
- Each observation in this dataset is a review of a particular business by a particular user.
- The "stars" column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
- The "cool" column is the number of "cool" votes this review received from other Yelp users. All reviews start with 0 "cool" votes, and there is no limit to how many "cool" votes a review can receive. In other words, it is a rating of the review itself, not a rating of the business.
- The "useful" and "funny" columns are similar to the "cool" column.

## Task 1

Read `yelp.csv` into a DataFrame.

In [2]:
# access yelp.csv using a relative path
import pandas as pd
yelp = pd.read_csv('../data/yelp.csv')
yelp.head(1)

IOError: File ../data/yelp.csv does not exist

## Task 1 (Bonus)

Ignore the `yelp.csv` file, and construct this DataFrame yourself from `yelp.json`. This involves reading the data into Python, decoding the JSON, converting it to a DataFrame, and adding individual columns for each of the vote types.

In [None]:
# read the data from yelp.json into a list of rows
# each row is decoded into a dictionary named "data" using using json.loads()
import json
with open('../data/yelp.json', 'rU') as f:
    data = [json.loads(row) for row in f]

In [None]:
yelp.head()

In [None]:
# show the first review
data[0]

In [3]:
# convert the list of dictionaries to a DataFrame
yelp = pd.DataFrame(data)

NameError: name 'data' is not defined

In [4]:
yelp.head()

NameError: name 'yelp' is not defined

In [5]:
# add DataFrame columns for cool, useful, and funny
yelp['cool'] = [row['votes']['cool'] for row in data]
yelp['useful'] = [row['votes']['useful'] for row in data]
yelp['funny'] = [row['votes']['funny'] for row in data]

NameError: name 'data' is not defined

In [6]:
yelp.head()

NameError: name 'yelp' is not defined

In [7]:
# drop the votes column and then display the head
yelp.drop('votes',axis=1, inplace= True )
yelp.head()


NameError: name 'yelp' is not defined

## Task 2

Explore the relationship between each of the vote types (cool/useful/funny) and the number of stars.

In [8]:
# treat stars as a categorical variable and look for differences between groups by comparing the means of the groups
import numpy as np
smean =yelp.groupby('stars').mean()
sstd =yelp.groupby('stars').std()

print(smean)

NameError: name 'yelp' is not defined

In [9]:
# display acorrelation matrix of the vote types (cool/useful/funny) and stars
%matplotlib inline
import seaborn as sns
sns.heatmap(yelp.corr())

NameError: name 'yelp' is not defined

In [None]:
# display multiple scatter plots (cool, useful, funny) with linear regression line
feature_cols = ['cool', 'useful', 'funny']
sns.pairplot(yelp, x_vars=feature_cols, y_vars='stars', kind='reg')

## Task 3

Define cool/useful/funny as the feature matrix X, and stars as the response vector y.

In [None]:

x = yelp[feature_cols]
y = yelp.stars


## Task 4

Fit a linear regression model and interpret the coefficients. Do the coefficients make intuitive sense to you? Explore the Yelp website to see if you detect similar trends.

In [10]:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(x, y)

print linreg.intercept_
print linreg.coef_

NameError: name 'x' is not defined

## Task 5

Evaluate the model by splitting it into training and testing sets and computing the RMSE. Does the RMSE make intuitive sense to you?

In [11]:
from sklearn.cross_validation import train_test_split
from sklearn import metrics
import numpy as np

In [12]:
zip(feature_cols, linreg.coef_)

NameError: name 'feature_cols' is not defined

In [13]:
sns.heatmap(yelp.corr())

NameError: name 'yelp' is not defined

In [14]:
# define a function that accepts a list of features and returns testing RMSE
def train_test_rmse(feature_cols):
    x = yelp[feature_cols]
    y = yelp.stars
    x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=123)
    linreg = LinearRegression()
    linreg.fit(x_train, y_train)
    y_pred = linreg.predict(x_test)
    return np.sqrt(metrics.mean_squared_error(y_test, y_pred))

In [15]:
# calculate RMSE with all three features
print train_test_rmse(feature_cols)

NameError: name 'feature_cols' is not defined

## Task 6

Try removing some of the features and see if the RMSE improves.

In [16]:
print train_test_rmse(['cool', 'useful'])

print train_test_rmse(['useful', 'funny'])

print train_test_rmse(['cool', 'funny'])

print train_test_rmse(['cool'])

print train_test_rmse(['funny'])

print train_test_rmse(['useful'])

NameError: global name 'yelp' is not defined

## Task 7 (Bonus)

Think of some new features you could create from the existing data that might be predictive of the response. Figure out how to create those features in Pandas, add them to your model, and see if the RMSE improves.

In [17]:
data[5]
data[10]
data[15]

NameError: name 'data' is not defined

In [18]:
# new feature: 
yelp['great_food']=yelp.text.str.contains('delicious', case = False). astype(int)
yelp['great_food']=yelp.text.str.contains('tasty', case = False). astype(int)
yelp['great_food']=yelp.text.str.contains('great', case = False). astype(int)

NameError: name 'yelp' is not defined

In [19]:
# new features: 
yelp['great_food']=yelp.text.str.contains(['delicious','tasty','great'], case = False). astype(int)

NameError: name 'yelp' is not defined

In [20]:
# add new features to the model and calculate RMSE
train_test_rmse(['cool', 'useful', 'funny', 'great_food'])

NameError: global name 'yelp' is not defined

## Task 8 (Bonus)

Compare your best RMSE on the testing set with the RMSE for the "null model", which is the model that ignores all features and simply predicts the mean response value in the testing set.