# Multiple Regression (Interpretation)

In this notebook we will use data on house sales in King County to predict prices using multiple regression. We will:
* Use SFrames to do some feature engineering
* Use built-in graphlab functions to compute the regression weights (coefficients/parameters)
* Given the regression weights, predictors and outcome write a function to compute the Residual Sum of Squares
* Look at coefficients and interpret their meanings
* Evaluate multiple models via RSS

# Fire up graphlab create

In [63]:
import graphlab
from math import log

# Load in house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

In [20]:
sales = graphlab.SFrame('kc_house_data.gl/')
sales.head(3)

id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront
7129300520,2014-10-13 00:00:00+00:00,221900.0,3.0,1.0,1180.0,5650,1,0
6414100192,2014-12-09 00:00:00+00:00,538000.0,3.0,2.25,2570.0,7242,2,0
5631500400,2015-02-25 00:00:00+00:00,180000.0,2.0,1.0,770.0,10000,1,0

view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat
0,3,7,1180,0,1955,0,98178,47.51123398
0,3,7,2170,400,1951,1991,98125,47.72102274
0,3,6,770,0,1933,0,98028,47.73792661

long,sqft_living15,sqft_lot15
-122.25677536,1340.0,5650.0
-122.3188624,1690.0,7639.0
-122.23319601,2720.0,8062.0


# Split data into training and testing.
We use seed=0 so that everyone running this notebook gets the same results. In practice, we may set a random seed

In [3]:
train_data,test_data = sales.random_split(.8,seed=0)

# Learning a multiple regression model

Recall we can use the following code to learn a multiple regression model predicting 'price' based on the following features:
example_features = ['sqft_living', 'bedrooms', 'bathrooms'] on training data with the following code:

(Aside: We set validation_set = None to ensure that the results are always the same)

In [4]:
example_features = ['sqft_living', 'bedrooms', 'bathrooms']
example_model = graphlab.linear_regression.create(train_data, target = 'price', features = example_features, validation_set = None)

Now that we have fitted the model we can extract the regression weights (coefficients) as an SFrame as follows:

In [9]:
example_model.get("coefficients")

name,index,value,stderr
(intercept),,87910.0724924,7873.3381434
sqft_living,,315.403440552,3.45570032585
bedrooms,,-65080.2155528,2717.45685442
bathrooms,,6944.02019265,3923.11493144


# Making Predictions

In [10]:
example_predictions = example_model.predict(train_data)
print example_predictions[0] # should be 271789.505878

271789.505878


# Compute RSS

In [64]:
def get_residual_sum_of_squares(model, data, outcome):

    predictions = model.predict(data)
    difference_sqaure = (outcome - predictions)**2
    RSS = difference_sqaure.sum()
    return(RSS)    

In [18]:
rss_example_train = get_residual_sum_of_squares(example_model, test_data, test_data['price'])
print rss_example_train

2.7376153833e+14


# Create some new features

Although we often think of multiple regression as including multiple different features (e.g. # of bedrooms, squarefeet, and # of bathrooms) but we can also consider transformations of existing features e.g. the log of the squarefeet or even "interaction" features such as the product of bedrooms and bathrooms.

We will use the logarithm function to create a new feature, that's why we imported the math library above.

* bedrooms_squared = bedrooms\*bedrooms
* bed_bath_rooms = bedrooms\*bathrooms
* log_sqft_living = log(sqft_living)
* lat_plus_long = lat + long 

# Adding new features to the dataset

In [37]:
train_data['bedrooms_squared'] = train_data['bedrooms'].apply(lambda x: x**2)
test_data['bedrooms_squared'] = test_data['bedrooms'].apply(lambda x: x**2)


train_data['bed_bath_rooms'] = train_data.apply(lambda x: x['bedrooms'] * x['bathrooms'])
test_data['bed_bath_rooms'] = test_data.apply(lambda x: x['bedrooms'] * x['bathrooms'])

train_data['log_sqft_living'] = train_data['sqft_living'].apply(lambda x: log(x))
test_data['log_sqft_living'] = test_data['sqft_living'].apply(lambda x: log(x))

train_data['lat_plus_long'] = train_data.apply(lambda x: x['lat'] + x['long'])
test_data['lat_plus_long'] = test_data.apply(lambda x: x['lat'] + x['long'])

In [38]:
train_data.head(3)

id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront
7129300520,2014-10-13 00:00:00+00:00,221900.0,3.0,1.0,1180.0,5650,1,0
6414100192,2014-12-09 00:00:00+00:00,538000.0,3.0,2.25,2570.0,7242,2,0
5631500400,2015-02-25 00:00:00+00:00,180000.0,2.0,1.0,770.0,10000,1,0

view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat
0,3,7,1180,0,1955,0,98178,47.51123398
0,3,7,2170,400,1951,1991,98125,47.72102274
0,3,6,770,0,1933,0,98028,47.73792661

long,sqft_living15,sqft_lot15,bed_bath_rooms,log_sqft_living,lat_plus_long,bedrooms_squared
-122.25677536,1340.0,5650.0,3.0,7.07326971746,-74.74554138,9.0
-122.3188624,1690.0,7639.0,6.75,7.85166117789,-74.59783966,9.0
-122.23319601,2720.0,8062.0,2.0,6.64639051485,-74.4952694,4.0


* Squaring bedrooms will increase the separation between not many bedrooms (e.g. 1) and lots of bedrooms (e.g. 4) since 1^2 = 1 but 4^2 = 16. Consequently this feature will mostly affect houses with many bedrooms.
* Bedrooms times bathrooms gives what's called an "interaction" feature. It is large when *both* of them are large.
* Taking the log of squarefeet has the effect of bringing large values closer together and spreading out small values.
* Adding latitude to longitude is totally non-sensical but we will do it anyway

# Calculating mean of feature 1 added on testing data

In [60]:
test_data['bedrooms_squared'].mean()

12.4466777015843

# Calculating mean of feature 2 added on testing data

In [61]:
test_data['bed_bath_rooms'].mean()

7.5039016315913925

# Calculating mean of feature 3 added on testing data

In [62]:
test_data['log_sqft_living'].mean()

7.550274679645938

# Calculating mean of feature 4 added on testing data

In [43]:
test_data['lat_plus_long'].mean()

-74.65333497217308

# Learning Multiple Models

* Model 1: squarefeet, # bedrooms, # bathrooms, latitude & longitude
* Model 2: add bedrooms\*bathrooms
* Model 3: Add log squarefeet, bedrooms squared, and the (nonsensical) latitude + longitude

# Extracting features for Model 1

In [58]:
model_1_features = ['sqft_living', 'bedrooms', 'bathrooms', 'lat', 'long']

# Extracting features for Model 2

In [None]:
model_2_features = model_1_features + ['bed_bath_rooms']

# Extracting features for Model 3

In [44]:
model_3_features = model_2_features + ['bedrooms_squared', 'log_sqft_living', 'lat_plus_long']

# Training Model 1

In [None]:
model_1 = graphlab.linear_regression.create(train_data, features = model_1_features, target = 'price', validation_set = None, verbose=False)


# Training Model 2

In [None]:
model_2 = graphlab.linear_regression.create(train_data, features = model_2_features, target = 'price', validation_set = None, verbose=False)


# Training Model 3

In [46]:
model_3 = graphlab.linear_regression.create(train_data, features = model_3_features, target = 'price', validation_set = None, verbose=False)

# Coefficients of Model 1

In [47]:
model_1.get('coefficients')

name,index,value,stderr
(intercept),,-56140675.7451,1649985.42026
sqft_living,,310.263325777,3.18882960408
bedrooms,,-59577.116068,2487.27977322
bathrooms,,13811.8405418,3593.54213296
lat,,629865.789514,13120.7100326
long,,-214790.285181,13284.2851608


# Coefficients of  Model 2

In [48]:
model_2.get('coefficients')

name,index,value,stderr
(intercept),,-54410676.1159,1650405.16539
sqft_living,,304.449298056,3.20217535637
bedrooms,,-116366.04323,4805.54966545
bathrooms,,-77972.3305135,7565.0599109
lat,,625433.834982,13058.3530975
long,,-203958.602954,13268.1283712
bed_bath_rooms,,26961.6249092,1956.36561555


# Coefficients of Model 3

In [49]:
model_3.get('coefficients')

name,index,value,stderr
(intercept),,-52974974.0611,1615194.9439
sqft_living,,529.196420563,7.6991349851
bedrooms,,28948.5277312,9395.72889106
bathrooms,,65661.2072312,10795.3380703
lat,,704762.148356,2001005325.41
long,,-137780.020008,2001005325.51
bed_bath_rooms,,-8478.36410521,2858.95391257
bedrooms_squared,,-6072.38466065,1494.97042777
log_sqft_living,,-563467.784265,17567.8230814
lat_plus_long,,-83217.1978594,2001005325.4


# Comparing multiple models

Now that we've learned three models and extracted the model weights we want to evaluate which model is best.

## Using our function from earlier to compute the RSS on TRAINING Data for each of the three models.

## RSS on Model 1

In [50]:
get_residual_sum_of_squares(model_1, train_data, train_data['price'])

971328233543220.0

## RSS on Model 2

In [51]:
get_residual_sum_of_squares(model_2, train_data, train_data['price'])

961592067855294.8

## RSS on Model 3

In [52]:
get_residual_sum_of_squares(model_3, train_data, train_data['price'])

905276314554878.0

# Now computing the RSS on TEST data for each of the three models.

In [53]:
get_residual_sum_of_squares(model_1, test_data, test_data['price'])

226568089092706.03

In [54]:
get_residual_sum_of_squares(model_2, test_data, test_data['price'])

224368799993518.56

In [55]:
get_residual_sum_of_squares(model_3, test_data, test_data['price'])

251829318951338.28

# Good Luck !!!