## Week 5, Boston Housing
UIS CSC 570R - Data Science Essentials<br>
2017 Fall<br>
Jason Burrell<br>

Linear Regression to estimate the cost of housing in Boston, based on https://github.com/mbernico/CS570/blob/master/module_2/boston_assignment.ipynb


## Boston Housing Assignment

In this assignment you'll be using linear regression to estimate the cost of house in boston, using a well known dataset.

Goals:
+  Measure the performance of the model I created using $R^{2}$ and MSE
> Learn how to use sklearn.metrics.r2_score and sklearn.metrics.mean_squared_error
+  Implement a new model using L2 regularization
> Use sklearn.linear_model.Ridge or sklearn.linear_model.Lasso 
+  Get the best model you can by optimizing the regularization parameter.   

In [1]:
from sklearn import datasets
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression



In [2]:
bean = datasets.load_boston()
print(bean.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

In [3]:
def load_boston():
    scaler = StandardScaler()
    boston = datasets.load_boston()
    X=boston.data
    y=boston.target
    X = scaler.fit_transform(X)
    return train_test_split(X,y)
    

In [4]:
X_train, X_test, y_train, y_test = load_boston()

In [5]:
X_train.shape

(379, 13)

### Fitting a Linear Regression

It's as easy as instantiating a new regression object (line 1) and giving your regression object your training data
(line 2) by calling .fit(independent variables, dependent variable)



In [6]:

clf = LinearRegression()
clf.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### Making a Prediction
X_test is our holdout set of data.  We know the answer (y_test) but the computer does not.   

Using the command below, I create a tuple for each observation, where I'm combining the real value (y_test) with
the value our regressor predicts (clf.predict(X_test))

Use a similiar format to get your r2 and mse metrics working.  Using the [scikit learn api](http://scikit-learn.org/stable/modules/model_evaluation.html) if you need help!

In [7]:
list(zip (y_test, clf.predict(X_test)))

[(18.199999999999999, 19.662362593707851),
 (23.0, 23.765219951453126),
 (17.5, 16.481847413521429),
 (23.300000000000001, 25.617926601065271),
 (36.399999999999999, 32.806756107367669),
 (23.399999999999999, 23.73243634720216),
 (19.0, 14.652860511621414),
 (30.100000000000001, 29.918343850902829),
 (17.0, 22.339683850280036),
 (20.100000000000001, 15.945787314147164),
 (22.199999999999999, 24.153243826543967),
 (21.0, 22.890489804229944),
 (21.800000000000001, 20.334598700800754),
 (14.5, 18.301996393411606),
 (14.6, 7.7114083883951032),
 (27.5, 32.472673063477487),
 (23.699999999999999, 9.9252028395426599),
 (13.9, 17.66776435991892),
 (22.600000000000001, 23.594422612163566),
 (19.899999999999999, 17.918600437253541),
 (43.5, 39.051792135987881),
 (17.800000000000001, 23.02665100561449),
 (7.0, -5.8250057383508427),
 (21.800000000000001, 21.202913067719575),
 (21.699999999999999, 20.531833775530693),
 (18.699999999999999, 20.637775164175704),
 (14.199999999999999, 18.45402414524683

In [8]:
y_pred_linear = clf.predict(X_test)
r2_score(y_test, y_pred_linear), mean_squared_error(y_test, y_pred_linear)

(0.70510085614841167, 25.409086173432673)

In [9]:
mean_squared_error(y_test, y_pred_linear)
Linear_tests = [{'model': 'LinearRegression',
                'r2': r2_score(y_test, y_pred_linear),
                'mse': mean_squared_error(y_test, y_pred_linear),
               }]
Linear_tests

[{'model': 'LinearRegression',
  'mse': 25.409086173432673,
  'r2': 0.70510085614841167}]

In [10]:
def measure_perf_of_model(model, alpha, X_train, y_train, X_test, y_test):
    clf = model(alpha=alpha)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    out = {'model': model.__name__, 'alpha': alpha, 'r2': r2, 'mse': mse}
    #print('%(name)s(alpha=%(alpha)f): r2_score = %(r2)f, mean_squared_error = %(mse)f' % out)
    return out

from sklearn.linear_model import Ridge
measure_perf_of_model(Ridge, 1.0, X_train, y_train, X_test, y_test)

{'alpha': 1.0,
 'model': 'Ridge',
 'mse': 25.375887266453876,
 'r2': 0.70548616434792866}

In [11]:
import numpy

alphas = numpy.append(numpy.linspace(0.0001, 100, 400), numpy.arange(10, 100))

Ridge_tests = [measure_perf_of_model(Ridge, a, X_train, y_train, X_test, y_test) for a in alphas]

from sklearn.linear_model import Lasso
Lasso_tests = [measure_perf_of_model(Lasso, a, X_train, y_train, X_test, y_test) for a in alphas]


In [12]:
import pandas
df = pandas.DataFrame(Linear_tests + Ridge_tests + Lasso_tests)
df.describe()

Unnamed: 0,alpha,mse,r2
count,980.0,981.0,981.0
mean,50.826571,54.907123,0.362745
std,28.487,29.982503,0.347978
min,0.0001,25.242233,-0.000551
25%,26.315863,25.639464,-0.000551
50%,50.938621,26.496308,0.692483
75%,75.438621,86.209461,0.702427
max,100.0,86.209461,0.707037


In [13]:
df[df.mse == df.mse.min()]

Unnamed: 0,alpha,model,mse,r2
51,12.531416,Ridge,25.242233,0.707037


In [14]:
df[df.r2 == df.r2.max()]

Unnamed: 0,alpha,model,mse,r2
51,12.531416,Ridge,25.242233,0.707037


Using Ridge(alpha=12.531416) is the best model from this run.