# Lab 7b - Regularisation

### Regularised regression with Ridge & LASSO

- FUNCTIONS: Ridge, RidgeCV, Lasso, LassoCV
- DOCUMENTATION: http://scikit-learn.org/stable/modules/linear_model.html
- DATA: 
  - Dataset 'Crime' (n=319 non-null, p=122, type=regression)
    - This data set contains data on violent crimes within a community.
    - Data Dictionary: http://archive.ics.uci.edu/ml/datasets/Communities+and+Crime
  - Dataset 'boston' 
    - This data set contains Boston house prices and candidate predictors.

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as st

In [2]:
# read data, remove categorical features, remove rows with missing values
crime = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.data', 
                    header=None, na_values=['?'])
crime = crime.iloc[:, 5:]
crime.dropna(inplace=True)
crime.head()

Unnamed: 0,5,6,7,8,9,10,11,12,13,14,...,118,119,120,121,122,123,124,125,126,127
0,0.19,0.33,0.02,0.9,0.12,0.17,0.34,0.47,0.29,0.32,...,0.12,0.26,0.2,0.06,0.04,0.9,0.5,0.32,0.14,0.2
16,0.15,0.31,0.4,0.63,0.14,0.06,0.58,0.72,0.65,0.47,...,0.06,0.39,0.84,0.06,0.06,0.91,0.5,0.88,0.26,0.49
20,0.25,0.54,0.05,0.71,0.48,0.3,0.42,0.48,0.28,0.32,...,0.09,0.46,0.05,0.09,0.05,0.88,0.5,0.76,0.13,0.34
21,1.0,0.42,0.47,0.59,0.12,0.05,0.41,0.53,0.34,0.33,...,1.0,0.07,0.15,1.0,0.35,0.73,0.0,0.31,0.21,0.69
23,0.11,0.43,0.04,0.89,0.09,0.06,0.45,0.48,0.31,0.46,...,0.16,0.12,0.07,0.04,0.01,0.81,1.0,0.56,0.09,0.63


In [3]:
# optional: read column names:
crimenames = pd.read_csv('communities.data.names', header=None)
crimenames = crimenames.iloc[5:, :]
crimenames.head()

Unnamed: 0,0
5,population
6,householdsize
7,racepctblack
8,racePctWhite
9,racePctAsian


In [4]:
# define X and y
X = crime.iloc[:, :-1]
y = crime.iloc[:, -1]

# split into train/test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [5]:
# How many columns are in X?
X.shape

(319, 122)

### Linear Regression Model Without Regularisation 

In [6]:
# linear regression
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)
lm.intercept_
lm.coef_
# What are these numbers?



array([ -3.66188167e+00,   6.98124465e-01,  -2.61955467e-01,
        -2.85270027e-01,  -1.64740837e-01,   2.46972333e-01,
        -1.09290051e+00,  -5.96857796e-01,   1.11200239e+00,
        -7.21968931e-01,   4.27346598e+00,  -2.28040268e-01,
         8.04875769e-01,  -2.57934732e-01,  -2.63458023e-01,
        -1.04616958e+00,   6.07784197e-01,   7.73552561e-01,
         5.96468029e-02,   6.90215922e-01,   2.16759430e-02,
        -4.87802949e-01,  -5.18858404e-01,   1.39478815e-01,
        -1.24417942e-01,   3.15003821e-01,  -1.52633736e-01,
        -9.65003927e-01,   1.17142163e+00,  -3.08546690e-02,
        -9.29085548e-01,   1.24654586e-01,   1.98104506e-01,
         7.30804821e-01,  -1.77337294e-01,   8.32927588e-02,
         3.46045601e-01,   5.01837338e-01,   1.57062958e+00,
        -4.13478807e-01,   1.39350802e+00,  -3.49428114e+00,
         7.09577818e-01,  -8.32141352e-01,  -1.39984927e+00,
         1.02482840e+00,   2.13855006e-01,  -6.18937325e-01,
         5.28954490e-01,

In [7]:
st.describe(lm.coef_)

DescribeResult(nobs=122, minmax=(-36.794120528694357, 36.715295684774787), mean=-0.0083246317121393465, variance=23.294819823693572, skewness=-0.021465688332913447, kurtosis=53.062383694622255)

In [8]:
# make predictions and evaluate
import numpy as np
from sklearn import metrics
preds = lm.predict(X_test)
print('RMSE (no regularisation) =', np.sqrt(metrics.mean_squared_error(y_test, preds)))

RMSE (no regularisation) = 0.233813676495


### Ridge Regression Model 

In [9]:
# ridge regression (alpha must be positive, larger means more regularisation)
from sklearn.linear_model import Ridge
rreg = Ridge(alpha=0.1, normalize=True)
rreg.fit(X_train, y_train)
rreg.coef_
preds = rreg.predict(X_test)
print('RMSE (Ridge reg.) =', np.sqrt(metrics.mean_squared_error(y_test, preds)))
# Is this model better? Why?

RMSE (Ridge reg.) = 0.164279068049


#### Ridge Regression with Cross-Validation 

In [10]:
# use RidgeCV to select best alpha:
from sklearn.linear_model import RidgeCV
alpha_range = 10.**np.arange(-2, 3)
rregcv = RidgeCV(normalize=True, scoring='neg_mean_squared_error', alphas=alpha_range)
rregcv.fit(X_train, y_train)

# Print the optimal value of Alpha for Ridge Regression
print('Optimal Alpha Value: ', rregcv.alpha_)

# Print the RMSE for the ridge regression model
preds = rregcv.predict(X_test)
print ('RMSE (Ridge CV reg.) =', np.sqrt(metrics.mean_squared_error(y_test, preds)))
# What is the range of alpha values we are searching over?

Optimal Alpha Value:  1.0
RMSE (Ridge CV reg.) = 0.163129782343


### LASSO Regression Model 

In [11]:
# lasso (alpha must be positive, larger means more regularisation)
from sklearn.linear_model import Lasso
las = Lasso(alpha=0.01, normalize=True)
las.fit(X_train, y_train)
las.coef_
preds = las.predict(X_test)
print('RMSE (Lasso reg.) =', np.sqrt(metrics.mean_squared_error(y_test, preds)))

RMSE (Lasso reg.) = 0.198165225429


In [12]:
# try a smaller alpha
las = Lasso(alpha=0.001, normalize=True)
las.fit(X_train, y_train)
las.coef_
preds = las.predict(X_test)
print('RMSE (Lasso reg.) =', np.sqrt(metrics.mean_squared_error(y_test, preds)))


RMSE (Lasso reg.) = 0.160039024044


In [13]:
# use LassoCV to select best alpha (tries 100 alphas by default)
from sklearn.linear_model import LassoCV
alpha_range = 10.**np.arange(-5, 5)
print(alpha_range)
lascv = LassoCV(normalize=True, alphas=alpha_range)
lascv.fit(X_train, y_train)
print('Optimal Alpha Value: ',lascv.alpha_)
lascv.coef_
preds = lascv.predict(X_test)
print('RMSE (Lasso CV reg.) =', np.sqrt(metrics.mean_squared_error(y_test, preds)))

[  1.00000000e-05   1.00000000e-04   1.00000000e-03   1.00000000e-02
   1.00000000e-01   1.00000000e+00   1.00000000e+01   1.00000000e+02
   1.00000000e+03   1.00000000e+04]
Optimal Alpha Value:  0.001
RMSE (Lasso CV reg.) = 0.160039024044




### Task 1: Elastic Net Regularised Regression

#### Look up [Elastic Net](http://scikit-learn.org/stable/modules/linear_model.html#elastic-net) and complete the following.


(1) What is elastic net?

Ridge (L2) and Lasso (L1) regularization differ in how they cope with correlated predictors: Ridge will divide the coefficient loading equally among them whereas Lasso will place all the loading on one of them while shrinking the others towards zero. Elastic Net combines the advantages of both: it tends to either select a group of correlated predictors in which case it puts equal loading on all of them, or it completely shrinks the group.

Elastic Net tends to select more predictors, distributing the loading evenly among them, whereas Lasso tends to select fewer predictors. For elastic net there is no limitation to the number of selected variables.

(2) How does it work?

Minimises the objective function of,

` 1 / (2 * n_samples) * ||y - Xw||^2_2 + alpha * l1_ratio * ||w||_1 + 0.5 * alpha * (1 - l1_ratio) * ||w||^2_2 `

Which is a way of combining the penalties we saw in ridge and lasso regression.

(3) Run elastic net on the above dataset.

In [14]:
# Set up and run the elastic net model
from sklearn.linear_model import ElasticNetCV
l1_ratio_list = [.1, .5, .7, .9, .95, .99, 1]
enet = ElasticNetCV(l1_ratio=l1_ratio_list, eps=0.001, n_alphas=100, fit_intercept=True, normalize=False, 
                    precompute='auto', max_iter=10000, tol=0.0001, cv=None, 
                    copy_X=True, verbose=0, n_jobs=1, positive=False, random_state=None, selection='cyclic')

enet.fit(X_train, y_train)
print('Optimal Alpha Value: ',enet.alpha_, ',  Optimal l1_ratio: ', enet.l1_ratio_)
enet.coef_
preds = enet.predict(X_test)
print('RMSE (Elastic Net CV reg.) =', np.sqrt(metrics.mean_squared_error(y_test, preds)))

Optimal Alpha Value:  0.0257031221707 ,  Optimal l1_ratio:  0.1
RMSE (Elastic Net CV reg.) = 0.159971754497


### Task 2: Carry out Regularised Regression

(1) Run all three forms of regularised regression on the Boston Housing dataset.

(2) What do the coefficients mean?

(3) What would you advise someone living in Boston to try and raise the value of their home?


In [15]:
# load libraries and data:
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_boston

# Nb. the sklearn.datasets.load_*() functions return a 'bunch' object:
# (ref = http://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_mldata.html)
boston = load_boston()

# standardise variables:
scaler = StandardScaler()
X = scaler.fit_transform(boston["data"])

Y = boston["target"]
names = boston["feature_names"]

# Build Lasso models over a range of alpha:
alpha_range = 10.**np.arange(-5, 5)
print('Alpha Range: ', alpha_range, '\n')
lasso_model = LassoCV(normalize=True, alphas=alpha_range)
lasso_model.fit(X, Y)

# helper method for pretty-printing linear models
def pretty_print_linear(coefs, names = None, sort = False):
    if names is None:
        names = ["X%s" % x for x in range(len(coefs))]
    lst = zip(coefs, names)
    if sort:
        lst = sorted(lst,  key = lambda x:-np.abs(x[0]))
    return " + ".join("%s * %s" % (round(coef, 3), name)
                                   for coef, name in lst)
  
print("Lasso model: ", pretty_print_linear(lasso_model.coef_, names, sort = True))

Alpha Range:  [  1.00000000e-05   1.00000000e-04   1.00000000e-03   1.00000000e-02
   1.00000000e-01   1.00000000e+00   1.00000000e+01   1.00000000e+02
   1.00000000e+03   1.00000000e+04] 

Lasso model:  -3.717 * LSTAT + 2.972 * RM + -1.772 * PTRATIO + -1.564 * DIS + -0.992 * NOX + 0.664 * B + 0.597 * CHAS + -0.308 * CRIM + 0.305 * ZN + -0.0 * INDUS + -0.0 * AGE + 0.0 * RAD + -0.0 * TAX


In [16]:
# Build cross-validated Ridge regression model
from sklearn.linear_model import RidgeCV

alpha_range = 10.**np.arange(-2, 3)
ridge_model = RidgeCV(normalize=True, scoring='neg_mean_squared_error', alphas=alpha_range)
ridge_model.fit(X_train, y_train)

print("Ridge model: ", pretty_print_linear(ridge_model.coef_, names, sort = True))

Ridge model:  -0.067 * CHAS + 0.057 * INDUS + -0.02 * LSTAT + 0.015 * AGE + -0.008 * DIS + 0.008 * TAX + 0.008 * NOX + 0.004 * B + 0.004 * RM + 0.004 * ZN + -0.001 * CRIM + -0.001 * PTRATIO + -0.001 * RAD


In [17]:
# Build cross-validated Elastic Net Model
from sklearn.linear_model import ElasticNetCV
l1_ratio_list = [.1, .5, .7, .9, .95, .99, 1]
enet = ElasticNetCV(l1_ratio=l1_ratio_list, eps=0.001, n_alphas=100, fit_intercept=True, normalize=False, 
                    precompute='auto', max_iter=10000, tol=0.0001, cv=None, 
                    copy_X=True, verbose=0, n_jobs=1, positive=False, random_state=None, selection='cyclic')

enet.fit(X_train, y_train)

print("Elastic Net model: ", pretty_print_linear(enet.coef_, names, sort = True))

Elastic Net model:  -0.116 * CHAS + 0.11 * INDUS + -0.006 * LSTAT + 0.0 * CRIM + -0.0 * ZN + 0.0 * NOX + 0.0 * RM + 0.0 * AGE + 0.0 * DIS + 0.0 * RAD + 0.0 * TAX + 0.0 * PTRATIO + 0.0 * B


In [18]:
# Print dataset description
print(boston.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

### *Interpret Results*

When looking at the outputs of the elastic net model it becomes evident that some of the most important features of the model (CHAS, INDUS and LSTAT) are things a home owner can do very little about. Location is the primary predictor. Recommendations: don't purchase near the Charles River, look for a high proportion of non-retail business acres, and avoid poorer socioeconomic population areas.