## Boston Housing Assignment

In this assignment you'll be using linear regression to estimate the cost of house in boston, using a well known dataset.

Goals:
+  Measure the performance of the model I created using $R^{2}$ and MSE
> Learn how to use sklearn.metrics.r2_score and sklearn.metrics.mean_squared_error
+  Implement a new model using L2 regularization
> Use sklearn.linear_model.Ridge or sklearn.linear_model.Lasso 
+  Get the best model you can by optimizing the regularization parameter.   

In [1]:
from sklearn import datasets
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

In [2]:
bean = datasets.load_boston()
print bean.DESCR

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

In [3]:
def load_boston():
    scaler = StandardScaler()
    boston = datasets.load_boston()
    X=boston.data
    y=boston.target
    X = scaler.fit_transform(X)
    return train_test_split(X,y)
    

In [4]:
X_train, X_test, y_train, y_test = load_boston()

In [5]:
X_train.shape

(379L, 13L)

### Fitting a Linear Regression

It's as easy as instantiating a new regression object (line 1) and giving your regression object your training data
(line 2) by calling .fit(independent variables, dependent variable)



In [6]:

clf = LinearRegression()
clf.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### Making a Prediction
X_test is our holdout set of data.  We know the answer (y_test) but the computer does not.   

Using the command below, I create a tuple for each observation, where I'm combining the real value (y_test) with
the value our regressor predicts (clf.predict(X_test))

Use a similiar format to get your r2 and mse metrics working.  Using the [scikit learn api](http://scikit-learn.org/stable/modules/model_evaluation.html) if you need help!

In [7]:
zip (y_test, clf.predict(X_test))

[(19.100000000000001, 16.689455476234855),
 (22.5, 18.00387192198135),
 (27.100000000000001, 27.664138041661296),
 (22.699999999999999, 21.869041894915917),
 (24.100000000000001, 20.884018522847377),
 (29.600000000000001, 24.625122881343263),
 (21.199999999999999, 23.157707925045219),
 (34.899999999999999, 33.700968178389779),
 (12.1, 17.635214637911432),
 (20.100000000000001, 15.811584290271346),
 (32.0, 33.250990144419887),
 (13.800000000000001, 16.211620231701076),
 (17.300000000000001, 16.408679610704663),
 (16.600000000000001, 15.335230832988135),
 (18.800000000000001, 20.69190527919412),
 (27.5, 31.819358543720959),
 (16.100000000000001, 18.062591602622529),
 (32.700000000000003, 30.344374712096226),
 (17.800000000000001, 21.559146670648275),
 (18.5, 19.463003197290025),
 (19.899999999999999, 18.47419281752461),
 (13.9, 13.44266892373671),
 (30.300000000000001, 33.054338295064646),
 (7.0, 8.4152248712337769),
 (19.600000000000001, 18.925147149479834),
 (19.100000000000001, 19.279

### Making a Prediction:  R^2 (coefficient and determination) regression score function
Using R2 to determine how well the data fits the model.  The goal is to be as close to 1 as possible (1 being that the data fits the model perfectly and 0 being that the data does not fit the model at all).


In [28]:
#import r2_score
from sklearn.metrics import r2_score
#import Lasso
from sklearn.linear_model import Lasso

In [29]:
alpha = 0.1
lasso = Lasso(alpha=alpha)

y_pred_lasso = lasso.fit(X_train, y_train).predict(X_test)
r2_score(y_test, y_pred_lasso)

0.75453347189497411

The Coefficient of Determination seems to indicate that the model fits the data well with a number of 0.75.

In [31]:
#Trying a different alpha
alpha = 0.2
lasso = Lasso(alpha=alpha)

y_pred_lasso = lasso.fit(X_train, y_train).predict(X_test)
r2_score(y_test, y_pred_lasso)

0.73420547407869918

The increase in alpha brought our Coefficient of Determination farther away from 1.

In [32]:
#Trying a decrease in alpha
alpha = 0.08
lasso = Lasso(alpha=alpha)

y_pred_lasso = lasso.fit(X_train, y_train).predict(X_test)
r2_score(y_test, y_pred_lasso)

0.758942529151996

In [33]:
#Trying a decrease in alpha
alpha = 0.06
lasso = Lasso(alpha=alpha)

y_pred_lasso = lasso.fit(X_train, y_train).predict(X_test)
r2_score(y_test, y_pred_lasso)

0.76273282129931486

In [34]:
#Trying a decrease in alpha
alpha = 0.04
lasso = Lasso(alpha=alpha)

y_pred_lasso = lasso.fit(X_train, y_train).predict(X_test)
r2_score(y_test, y_pred_lasso)

0.76591016979004933

In [35]:
#Trying a decrease in alpha
alpha = 0.02
lasso = Lasso(alpha=alpha)

y_pred_lasso = lasso.fit(X_train, y_train).predict(X_test)
r2_score(y_test, y_pred_lasso)

0.76843526759152736

In [36]:
#Trying a decrease in alpha
alpha = 0.008
lasso = Lasso(alpha=alpha)

y_pred_lasso = lasso.fit(X_train, y_train).predict(X_test)
r2_score(y_test, y_pred_lasso)

0.7695598523708822

Making our alpha smaller does get us closer to 1, but it is not a significant change from the original alpha of 0.1.

### Making a Prediction:  RMSE (Root Mean Square Error) 
<a href=https://upload.wikimedia.org/math/1/7/3/173e0dd312ace976dbc640af8f9014b8.png><img border="0" alt="Wiki" src="https://upload.wikimedia.org/math/1/7/3/173e0dd312ace976dbc640af8f9014b8.png" width="300"> </a>

Root Mean Square Error is a way to test a linear model to see if it's good, or not. RMSE is calculated by taking the squared average differences of each prediction and then getting the square root to get a positive number.


In [25]:
#import mean_squared_error
from sklearn.metrics import mean_squared_error
#import math
import math

In [27]:
#calculate and display RMSE
math.sqrt(mean_squared_error(y_test, clf.predict(X_test)))

4.386605689364285

In [50]:
min(y_test)

7.0

In [51]:
max(y_test)

50.0

Since the dependent variable range is from 7 to 50, I would conclude that 4.34 isn't a terrible RMSE value, although I would like to see it a little smaller.  According to The Analysis Factor (http://www.theanalysisfactor.com/assessing-the-fit-of-regression-models/), RMSE is an absolute measure of fit that shows how accurately the model predicts the response.