## Boston Housing Assignment

In this assignment you'll be using linear regression to estimate the cost of house in boston, using a well known dataset.

Goals:
+  Measure the performance of the model I created using $R^{2}$ and MSE
> Learn how to use sklearn.metrics.r2_score and sklearn.metrics.mean_squared_error
+  Implement a new model using L2 regularization
> Use sklearn.linear_model.Ridge or sklearn.linear_model.Lasso 
+  Get the best model you can by optimizing the regularization parameter.   

In [1]:
from sklearn import datasets
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

In [2]:
bean = datasets.load_boston()
print(bean.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

In [3]:
def load_boston():
    scaler = StandardScaler()
    boston = datasets.load_boston()
    X=boston.data
    y=boston.target
    X = scaler.fit_transform(X)
    return train_test_split(X,y)
    

In [4]:
X_train, X_test, y_train, y_test = load_boston()

In [5]:
X_train.shape

(379, 13)

### Fitting a Linear Regression

It's as easy as instantiating a new regression object (line 1) and giving your regression object your training data
(line 2) by calling .fit(independent variables, dependent variable)



In [6]:
clf = LinearRegression()
clf.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### Making a Prediction
X_test is our holdout set of data.  We know the answer (y_test) but the computer does not.   

Using the command below, I create a tuple for each observation, where I'm combining the real value (y_test) with
the value our regressor predicts (clf.predict(X_test))

Use a similiar format to get your r2 and mse metrics working.  Using the [scikit learn api](http://scikit-learn.org/stable/modules/model_evaluation.html) if you need help!

In [7]:
list(zip (y_test, clf.predict(X_test)))

[(17.699999999999999, 20.468084407718234),
 (15.4, 18.210280083848915),
 (15.6, 15.706865981828631),
 (23.199999999999999, 17.690916576535301),
 (19.5, 18.926094188834611),
 (34.700000000000003, 30.356779475257607),
 (24.399999999999999, 23.679498870724313),
 (8.3000000000000007, 11.052817808926841),
 (13.1, 20.623027808005965),
 (18.300000000000001, 20.779572924133603),
 (17.399999999999999, 22.94108620835835),
 (24.100000000000001, 20.176818703525669),
 (25.0, 22.42090692530023),
 (29.399999999999999, 31.072913765485644),
 (31.600000000000001, 34.090906772625715),
 (21.600000000000001, 25.073927158084906),
 (22.699999999999999, 24.999306282649407),
 (24.100000000000001, 30.150496440774607),
 (21.100000000000001, 20.686230800029609),
 (16.100000000000001, 22.7073507182165),
 (21.800000000000001, 21.574830688757618),
 (32.899999999999999, 30.914426448698585),
 (28.5, 34.473546444106972),
 (13.1, 14.54034445124363),
 (14.9, 16.143802739508068),
 (14.5, 18.212215047438413),
 (18.39999999

### CSC570R Module 2 - Boston Housing Assignment

Melanie Klein

12 February 2017

In [8]:
#calculate the MSE for Mike's linear regression model
mean_squared_error(y_test, clf.predict(X_test))

14.505441354853557

In [9]:
#calculate R-squared for Mike's linear regression model
r2_score(y_test, clf.predict(X_test))

0.77551316806731807

In [10]:
#implement Lasso linear regression model
from sklearn.linear_model import Lasso

In [11]:
lasso = Lasso(alpha=0.03)
lasso.fit(X_train, y_train)

Lasso(alpha=0.03, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [12]:
#calculate MSE for Lasso model
mean_squared_error(y_test, lasso.predict(X_test))

14.1226086365391

In [13]:
#calculate R-squared for Lasso model
r2_score(y_test, lasso.predict(X_test))

0.78143790361945853

### Comments
<li>Lasso model, with an alpha value set to 0.03, provides marginal improvement over the Ordinary Least Squares linear regression model
<li>Alpha values between 0 and 1 all resulted in MSE and R^2 values similar to that of the Ordinary Lease Squares model
<li>As alpha values increased beyond 1, MSE grew and R^2 decreased, suggesting a less accurate model