# Regularization

In statistics and machine learning, **regularization** is used to help avoid the problem of overfitting.

Models that are overfit can effectively explain observed data but generalize poorly to unseen data. 

Regularization stops our model overfitting our training data by constraining it's learned parameters, this in turn stops any variable/s having too much influence on the model.

# Formulation

We know that the least squares solution to the linear regression problem $w_{ls}$ given a set of observations $y$ and data $X$ is equal to $(X^TX)^{-1}X^Ty$. 

Least squares also has a probabilistic interpretation which allows us to understand some interesting properties about our feature vector $w$.

If we imagine our observations are drawn from a normal distribution with $\mu = Xw$ and a diagonal covariance matrix $\Sigma = \sigma^2I$ we can then solve for $w$ using maximum likelihood estimation.

$$ w_{ML} = arg\,max\;\;ln(p(y|\mu = Xw))$$

$$ w_{ML} = -\frac{1}{2\sigma^{-2}}\,\Vert y - Xw \Vert^2 - \frac{n}{2} ln(2\pi\sigma^2)$$

Least squares (LS) and maximum likelihood (ML) share the same solution:

$$ LS:   arg min\; \Vert y - Xw\Vert^2\;\Leftarrow\Rightarrow\;ML:   arg max\; -\frac{1}{2\sigma^{-2}}\,\Vert y - Xw \Vert^2 $$

In a sense we are making an independent Gaussian noise assumption about the error term $\epsilon$, this is equivalent to $$y_i = x_i^Tw + \epsilon_i, \epsilon_i \sim N(0,\sigma^2) \equiv y \sim N(Xw, \sigma^2)$$

Under the gaussian assumption $y \sim N(Xw,\sigma^2)$ it can be shown that $$ \mathbb E[w_{ml}]= w$$ and $$Var[w_{ml}]=\sigma^2(X^TX)^{-1}$$

This implies that the maximum likelihood estimate of w is unbiased, however when the value $\sigma^2(X^TX)^{-1}$ is very large then our derived $w$ is extremely sensitive to the observed $y$

One way of constraining the the values in $w$ is to apply a penalty term $\lambda g(w)$ to the least squares objective function so that it becomes: $$w = argmin \; \Vert y - Xw \Vert + \lambda g(w)$$ where $lambda > 0 :$ is a regularisation parameter and $g(w) > 0$ is a function that enforces certain desrired characters about w

# Ridge Regression

Ridge regression is one such regularization method — it uses the squared penalty on the regularisation parameter 

$$w_{rr} = argmin \; \Vert y - Xw \Vert + \lambda \Vert w \Vert^2$$.

The inclusion of $\Vert w \Vert^2$ as the regularisation parameter penalises large values in $w$.


We can solve for the ridge regression solution in exactly the same manor as we do for ordinary least squares.

$$ L = \Vert y - XW \Vert ^2 + \lambda \Vert w \Vert^2 = (y - Xw)^T(y-Xw)+\lambda w^Tw $$

To solve we take the gradient of $L$ w.r.t $w$ and set to zero

$$\nabla L = 2X^Ty+2X^TXw+2\lambda w = 0$$

it then follows that:

$$ w_{rr} = (\lambda I + X^TX)^{-1}X^Ty $$

#Code Example

Let's take a look at the House Prices dataset from Kaggle

In [8]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

In [29]:
df = pd.read_csv("housing-data.csv",index_col=0)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 80 columns):
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-

In [32]:
x = df.apply(pd.Series.nunique)

x.sort_values()

Utilities          1
CentralAir         2
Street             2
BsmtHalfBath       3
BsmtFullBath       3
KitchenAbvGr       3
LandSlope          3
PavedDrive         3
HalfBath           3
GarageFinish       3
MasVnrType         4
ExterCond          4
BsmtQual           4
BsmtCond           4
BsmtExposure       4
GarageCars         4
Heating            4
KitchenQual        4
LandContour        4
LotShape           4
FullBath           4
Fireplaces         4
ExterQual          4
HeatingQC          5
Electrical         5
Foundation         5
GarageQual         5
BldgType           5
MSZoning           5
RoofStyle          5
                ... 
OverallQual        9
SaleType           9
TotRmsAbvGrd      10
MoSold            12
Exterior1st       14
MSSubClass        15
LowQualFinSF      15
3SsnPorch         16
Exterior2nd       16
MiscVal           17
Neighborhood      25
YearRemodAdd      61
ScreenPorch       63
GarageYrBlt       97
EnclosedPorch     98
LotFrontage      107
BsmtFinSF2   

In [33]:
cols_to_drop = ["PoolQC","Fence","MiscFeature","FireplaceQu","Alley","Utilities"]

df = df[[c for c in df.columns if c not in cols_to_drop]].dropna()

In [34]:
df.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,LotConfig,LandSlope,Neighborhood,...,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,Reg,Lvl,Inside,Gtl,CollgCr,...,0,0,0,0,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,Reg,Lvl,FR2,Gtl,Veenker,...,0,0,0,0,0,5,2007,WD,Normal,181500
3,60,RL,68.0,11250,Pave,IR1,Lvl,Inside,Gtl,CollgCr,...,0,0,0,0,0,9,2008,WD,Normal,223500
4,70,RL,60.0,9550,Pave,IR1,Lvl,Corner,Gtl,Crawfor,...,272,0,0,0,0,2,2006,WD,Abnorml,140000
5,60,RL,84.0,14260,Pave,IR1,Lvl,FR2,Gtl,NoRidge,...,0,0,0,0,0,12,2008,WD,Normal,250000


In [94]:
df = pd.get_dummies(df,drop_first=True)

X = df.drop("SalePrice", axis=1).as_matrix()

y = df["SalePrice"].as_matrix()

In [97]:
def standardize_matrix(A):
    
    X_std = (A - np.mean(A, axis=0)) / np.std(A, axis=0)
    
    return X_std

def de_mean(X):
    
    return X - X.mean()


def train_test_split(x,ratio):
    
    mask = np.random.rand(x.shape[0]) < ratio
    
    train, test = x[msk,:], x[~msk,:]
    
    return train, test

X = standardize_matrix(X)

y = np.expand_dims(y, axis=1)

y = de_mean(y)

X_train,X_test = train_test_split(X,0.8)

y_train, y_test = train_test_split(y,0.8)

In [98]:
def ridge_regression(l, X, y):
    
    n = X.shape[1]
    
    """#for i in range(X.shape[1]-1):
    #    X[:,i] = (X[:,i] - X[:,i].mean()) / X[:,i].std()"""
    
    X_T = X.T
    
    eye = np.eye(n)
    
    eye[n-1,n-1] = 0 
    
    C = X.T.dot(X) + l *np.eye(X.shape[1])
    
    w = np.linalg.inv(C).dot(X.T.dot(y))
    
    return w


w = ridge_regression(1,X_train,y_train)

np.sqrt(np.square(X_train.dot(w) - y_train).sum() / len(y_train))

20828.20825151174

In [99]:
from sklearn.linear_model import Ridge

clf = Ridge(alpha=1.0)

clf.fit(X_train, y_train) 

y_pred = clf.predict(X_train)

np.sqrt(np.square(y_pred - y_train).sum() / len(y_train))

20827.147928664002

In [164]:
class RidgeRegressor(object):
    """
    Linear Least Squares Regression with Tikhonov regularization.
    More simply called Ridge Regression.
    We wish to fit our model so both the least squares residuals and L2 norm
    of the parameters are minimized.
    argmin Theta ||X*Theta - y||^2 + alpha * ||Theta||^2
    A closed form solution is available.
    Theta = (X'X + G'G)^-1 X'y
    Where X contains the independent variables, y the dependent variable and G
    is matrix alpha * I, where alpha is called the regularization parameter.
    When alpha=0 the regression is equivalent to ordinary least squares.
    http://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)
    http://en.wikipedia.org/wiki/Tikhonov_regularization
    http://en.wikipedia.org/wiki/Ordinary_least_squares
    """

    def fit(self, X, y, alpha=0):
        """
        Fits our model to our training data.
        Arguments
        ----------
        X: mxn matrix of m examples with n independent variables
        y: dependent variable vector for m examples
        alpha: regularization parameter. A value of 0 will model using the
        ordinary least squares regression.
        """
        X = np.hstack((np.ones((X.shape[0], 1)), X))
        G = alpha * np.eye(X.shape[1])
        G[0, 0] = 0  # Don't regularize bias
        self.params = np.dot(np.linalg.inv(np.dot(X.T, X) + np.dot(G.T, G)),
                             np.dot(X.T, y))

    def predict(self, X):
        """
        Predicts the dependent variable of new data using the model.
        The assumption here is that the new data is iid to the training data.
        Arguments
        ----------
        X: mxn matrix of m examples with n independent variables
        alpha: regularization parameter. Default of 0.
        Returns
        ----------
        Dependent variable vector for m examples
        """
        X = np.hstack((np.ones((X.shape[0], 1)), X))
        return np.dot(X, self.params)

r = RidgeRegressor()

r.fit(X_train,y_train,1.0)

r.predict(X_train)[:10]

array([ 200329.04224738,  205025.78666961,  201719.81815852,
        162748.18252521,  294229.74646662,  119512.03361855,
        284563.50232797,  132922.08983186,  114623.63701272,
        392210.09571446])

In [160]:
y_pred[:10]

array([ 200329.04224737,  205025.7866696 ,  201719.81815852,
        162748.1825252 ,  294229.74646662,  119512.03361856,
        284563.50232795,  132922.08983185,  114623.6370127 ,
        392210.09571446])

In [159]:
X_train.dot(w)[:10]

array([  13531.44384922,   18228.18827142,   14922.21976037,
        -24049.41587295,  107432.14806847,  -67285.56477961,
         97765.90392975,  -53875.50856632,  -72173.96138548,
        205412.49731629])

In [82]:
import numpy as np
from sklearn.preprocessing import StandardScaler



array([[ 0.09866795, -0.24877704, -0.20127756, ..., -0.13055824,
         0.49300665, -0.34662687],
       [-0.85472276,  0.34150783, -0.06727982, ..., -0.13055824,
         0.49300665, -0.34662687],
       [ 0.09866795, -0.13072007,  0.1249778 , ..., -0.13055824,
         0.49300665, -0.34662687],
       ..., 
       [ 0.33701563, -0.20942472, -0.13229785, ..., -0.13055824,
         0.49300665, -0.34662687],
       [-0.85472276, -0.13072007, -0.053647  , ..., -0.13055824,
         0.49300665, -0.34662687],
       [-0.85472276,  0.1447462 , -0.02801265, ..., -0.13055824,
         0.49300665, -0.34662687]])

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False,  True, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False,  True, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,

In [88]:
y.shape

(1094,)

In [89]:
y = np.expand_dims(y, axis=1)

y.shape

(1094, 1)

array([[181500],
       [223500],
       [140000],
       ..., 
       [266500],
       [142125],
       [147500]])