## <u>**Ridge Regression in Python Notes**</u>

Noah Rubin

May 2021

#### <u>**Main Ideas**</u>

* [Ridge regression](https://machinelearningmastery.com/ridge-regression-with-python/) extends the concepts of OLS but makes some subtle adjustments through [Tikhonov regularisation](http://anderson.ece.gatech.edu/ece6254/assets/11-regression-regularisation.pdf)
* The idea behind ridge regression is to address the concept of the bias-variance tradeoff in machine learning that suggests that optimising one tends to degrade the other
* We purposely introduce bias into the regression model in an effort to reduce the variance, which can then potentially lower the mean squared error of our estimator, since $$\text{MSE} = \text{Bias}^2 + \text{Variance}$$
* Even though by the Gauss-Markov theorem, OLS has the lowest sampling variance out of any linear unbiased estimator, there may be a biased estimator that can achieve a lower mean squared error, such as the ridge estimator
* Ridge regression is also a tool to help reduce the impact of multicollinearity within our feature matrix 

---

#### <span style="color:black"><u>**Algorithm Details**</u><a name="Ridge"></a></span>

The loss function for OLS regression is given as:

$$J(\beta_0, \beta_1, ... , \beta_p) = RSS = \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{i,j})^2.$$

This can be expressed in matrix form as:
$$J(\vec{\beta}) = (\vec{y} - X\vec{\beta})^T(y - X\vec{\beta})$$

---

Ridge regression makes a small modification to the OLS loss function, through adding a shrinkage penalty through [L2 regularisation penalty](https://towardsdatascience.com/intuitions-on-l1-and-l2-regularisation-235f2db4c261#f810), hence for ridge regression:

$$J(\beta_0, \beta_1, ... , \beta_p) = \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p}\beta_j X_{i,j})^2 + \lambda \sum_{j=1}^{p} \beta_j ^2$$

This can be expressed in matrix form as:

$$J(\vec{\beta}) = (\vec{y} - X\vec{\beta})^T(y - X\vec{\beta}) + \lambda\vec{\beta}^{T}\vec{\beta}$$

<b>By convention, columns in $X$ are assumed to have zero mean and unit variance (after scaling), and the response vector $\vec{y}$ is assumed to be centered to have mean zero.<b>

---

The lambda parameter $\lambda \in [0, \infty)$ is a constant that can be chosen through resampling methods such as cross validation. Ultimately, if $\lambda = 0$ in the final model, the shrinkage penalty (the second term) disappears and we get OLS coefficient estimates. As $\lambda$ gets larger, the shrinkage penalty becomes increasingly pertinent, and coefficient estimates will tend towards zero (but will not be exactly zero). Since $\lambda$ is a hyperparameter that can be tuned, we get different coefficient estimates depending on which value for $\lambda$ is chosen. Ultimately the shrinkage penalty aims to encourage simpler models that have smaller values for the coefficients as "it turns out that shrinking the coefficient estimates can significantly reduce their variance" - *An Introduction to Statistical Learning: With Applications in R*.
    
Also, the size constraint on the coefficients in the ridge
regression "alleviates the problem of large coefficients (in absolute value) and its high variance, which may be a consequence of multicollinearity." - *Rice University STAT 410 Lecture Slides*

[Resource linked here](https://cpb-us-e1.wpmucdn.com/blogs.rice.edu/dist/e/8375/files/2017/08/Lecture16-1l5v69b.pdf) 

--- 

Expanding the terms in the loss function, we get

$$J(\vec{\beta}) = \vec{y}^T\vec{y} -2\vec{\beta}^TX^T \vec{y} + \beta^TX^TX\vec{\beta} + \lambda\vec{\beta}^{T}\vec{\beta}$$

which is a convex function with a closed form solution when optimising coefficients. Taking the derivative of the loss function with respect to the beta vector we obtain:

$$\frac{\partial J(\vec{\beta})}{\partial \vec{\beta}} = -2X^{T}\vec{y} + 2X^{T}X\vec{\beta} + 2\lambda\vec{\beta}$$

Since $J(\vec{\beta})$ is convex, to minimise this quantity, we can set the derivative equal to 0 to find an estimate $\vec{b}_{ridge}$ for $\vec{\beta}$ thus:

$$-2X^{T}\vec{y} + 2X^{T}X\vec{b} + 2\lambda\vec{b} = 0$$

Moving, $-2X^{T}\vec{y}$ to the other side, and dividing terms by two, we get 

$$X^{T}X\vec{b} + \lambda\vec{b} = X^{T}\vec{y}$$

Factorising out a common factor of $\vec{b}$ we get

$$(X^{T}X + \lambda I)\vec{b} = X^{T}\vec{y}$$

"Pre-multiplying" both sides by $(X^{T}X + \lambda I)^{-1}$ allows us to obtain

$$\vec{b}_{ridge} = (X^{T}X + \lambda I)^{-1}X^{T}\vec{y}$$

Including a positive lambda ensures that we obtain a non singular matrix for $(X^{T}X + \lambda I)^{-1}$, even if $X^TX$ is singular (not of full rank)

This optimisation problem to find $\vec{b}_{ridge}$ could have also been solved using [Lagrange Multipliers](https://en.wikipedia.org/wiki/Lagrange_multiplier), where we would find our estimator using the Karush Kuhn-Tucker (KKT) multiplier method.

$$\text{argmin}_{||\vec{\beta}||_2 ^2 \leq c}||\vec{y} - X\vec{\beta}||_2 ^2$$

where we optimise the beta vector subject to the constraint that $\sum_{j=1}^p \beta_{j}^2 \leq c$.



---

**Proving that $\vec{b}_{ridge}$ is biased:**

From above,

$$\vec{b}_{ridge} = (X^{T}X + \lambda I)^{-1}X^{T}\vec{y}$$

Let $M = X^{T}X$, then:

$$\vec{b}_{ridge} = (M + \lambda I)^{-1}M(M^{-1}X^{T}\vec{y})$$

Factorising $M$ out in the first term and substituting the expression for $M$ into the second term, we obtain:

$$\vec{b}_{ridge} = [M(I + \lambda M^{-1})]^{-1}M[(X^TX)^{-1}X^T\vec{y}]$$

Since by matrix inverse laws, $(AB)^{-1} = B^{-1}A^{-1}$, and since $\vec{b}_{ols} = (X^TX)^{-1}X^T\vec{y}$:

$$\vec{b}_{ridge} = (I + \lambda M^{-1})^{-1}M^{-1}M\vec{b}_{ols}$$

Since $A^{-1}A$ is the identity matrix for a matrix $A$, then:

$$\vec{b}_{ridge} = (I + \lambda M^{-1})\vec{b}_{ols}$$

Taking the expectation of this simplified quantity, 

$$E(\vec{b}_{ridge}) = E((I + \lambda M^{-1})\vec{b}_{ols})$$

As $(I + \lambda M^{-1})$ is not random and as the OLS estimator under Gauss Markov assumptions is unbiased, 

$$E(\vec{b}_{ridge}) = (I + \lambda M^{-1})\vec{\beta}_{ols}$$

Which is not equal to $\vec{\beta}_{ols}$ if lambda is non-zero (and positive). But if lambda was zero then it is technically not ridge regression but rather just OLS.

---

**Variance of the ridge estimator**

The variance of the OLS estimator was shown in a previous jupyter notebook to be given as:

$$\text{Var}(\vec{b}_{ols}) = \sigma^2(X^TX)^{-1}$$

The ridge estimator of $\vec{\beta}$ can be given as 
$$\vec{b}_{ridge} = (X^{T}X + \lambda I)^{-1}X^{T}\vec{y}$$

This can also be expressed as,

$$\vec{b}_{ridge} = (X^{T}X + \lambda I)^{-1}X^{T}X(X^{T}X)^{-1}X^T\vec{y}$$

Since $(X^{T}X)^{-1}X^T\vec{y} = \vec{b}_{ols}$,

$$\vec{b}_{ridge} = (X^{T}X + \lambda I)^{-1}X^{T}X\vec{b}_{ols}$$

Taking the variance of both sides:

$$\text{Var}(\vec{b}_{ridge}) = \text{Var}((X^{T}X + \lambda I)^{-1}X^{T}X\vec{b}_{ols})$$

As $\vec{b}_{ols}$ is a random vector, 

$$\text{Var}(\vec{b}_{ridge}) = (X^{T}X + \lambda I)^{-1}X^{T}X\text{Var}(\vec{b}_{ols})((X^{T}X + \lambda I)^{-1}X^{T}X)^T$$

Recognising that $\text{Var}(\vec{b}_{ols}) = \sigma^2(X^TX)^{-1}$ under the homoskedasticity assumption, and by applying the idea that $(AB)^T = B^TA^T$ for matrices $A$ and $B$

$$\text{Var}(\vec{b}_{ridge}) = (X^{T}X + \lambda I)^{-1}X^{T}X\sigma^2 (X^TX)^{-1}X^{T}X(X^{T}X + \lambda I)^{-1}$$

Cancelling terms out and assuming $\sigma^2$ is constant,

$$\text{Var}(\vec{b}_{ridge}) = \sigma^2(X^{T}X + \lambda I)^{-1}X^{T}X(X^{T}X + \lambda I)^{-1}$$
    
It is indeed true that the variance of the ridge estimator is always lower than that of OLS. The [proof](https://www.statlect.com/fundamentals-of-statistics/ridge-regression ) is quite long so consider a case where
$X^TX = I$
    
If we substitute $X^TX = I$ into the equation above, we obtain
    
$$\text{Var}(\vec{b}_{ridge}) = \sigma^2(I + \lambda)^{-1}(I + \lambda)^{-1}$$

Factorising out the identity matrix
    
$$\text{Var}(\vec{b}_{ridge}) = \sigma^2(1 + \lambda)^{-1}(1 + \lambda)^{-1}I$$

Simplifying, we get

$$\text{Var}(\vec{b}_{ridge}) = \sigma^2(1 + \lambda)^{-2}I$$
    
Which is certainly lower than the variance of the OLS estimator. Ultimately, different values of lambda will allow us to control both the magnitiude of the variance and the coefficients


---


**Useful property of the ridge estimator**

In cases whether the columns of $X$ are orthonormal (i.e. the columns are orthogonal and each have unit length), then this orthogonal matrix $X$ adheres to:
$$X^TX = X^{-1}X = I.$$ 

More profoundly, if can also be shown that when this condition is met, the ridge estimator is a multiple of the OLS estimator such that,

$$\vec{b}_{ridge} = \frac{1}{1 + \lambda}\vec{b}_{ols}$$
    
If we were now to take the expectation of this quantity, we'd see that ridge estimator, on average, underestimates the true coefficient since 
$$E(\vec{b}_{ridge}) = \frac{1}{1+\lambda}E(\vec{b}_{ols}) = \frac{1}{1+\lambda}\beta$$



Extra resource [here](https://arxiv.org/pdf/1509.09169.pdf)









In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.impute import KNNImputer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler 
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, cross_val_score

https://medium.com/swlh/randomized-or-grid-search-with-pipeline-cheatsheet-719c72eda68

In [None]:
train = pd.read_csv('../data/Train_updated.csv')
test = pd.read_csv('../data/Test_updated.csv')

In [None]:
# Lets use regular GDP per cap rather than ln(GDP_cap) 
# since idk what the go is with scaling an already logged variable

X_train = train.drop(['Country', 'ln(GDP_cap)', 'Life_exp', 'HDI'], axis = 1)
y_train = train['Life_exp']

X_test = test.drop(['Country', 'ln(GDP_cap)', 'Life_exp', 'HDI'], axis = 1)
y_test = test['Life_exp']

In [None]:
# Impute, scale then run ridge regression
steps = [('knn', KNNImputer()), ('scaler', StandardScaler()), ('ridge', Ridge())]

# Construct pipeline with specified steps
pipeline = Pipeline(steps=steps)

At any point (before or after fitting), you can access any/all of the individual estimators in the pipeline with:

In [None]:
pipeline.named_steps['knn']

3) Define your hyperparameter ranges using any of the instantiated pipeline’s param keys.

In [None]:
# First access the parameter of the individual estimators
pipeline.get_params()

In [None]:
[key for key in pipeline.get_params().keys()]

In [None]:
[val for val in pipeline.get_params().values()]

In [None]:
param_grid = dict()

# KNN imputer parameter grid with k parameter something from 1-25
param_grid['knn__n_neighbors'] = np.arange(1, 26)
param_grid['knn__weights'] = ['uniform', 'distance']

# Ridge params (lambda, called alpha in sklearn)
param_grid['ridge__alpha'] = np.linspace(0, 5, 25)

In [None]:
# Maybe run cell magic timeit here
print("Fitting started...")

# Set random state for replicateble results
randomised_search = RandomizedSearchCV(pipeline, 
                                       n_iter=200,
                                       param_distributions=param_grid, 
                                       cv=10, 
                                       verbose=10, 
                                       n_jobs=-1, 
                                       random_state=3)

randomised_search.fit(X_train, y_train)

In [None]:
randomised_search.best_params_

In [None]:
scores_df = pd.DataFrame(randomised_search.cv_results_).loc[:, 'params':]
scores_df.sort_values(by=['rank_test_score'], inplace=True)

# Get insi
scores_df.head(10)

In [None]:
svm_param_grid = {
    "reduce_dim": ["passthrough", TruncatedSVD(10), TruncatedSVD(20)],
    "tfidf__analyzer": ["word", "char"],
    "tfidf__smooth_idf": [True, False],
    "tfidf__ngram_range": [(1, 1), (1, 2)],
    "tfidf__use_idf": [True, False],
    "tfidf__stop_words": [None, STOP_WORDS],
    "classifier__class_weight": [None, "balanced"],
    "classifier__C": [1, 10, 100, 1000],
    "classifier__gamma": [0.001, 0.0001],
    "classifier__kernel": ["linear", "rbf"],
}