### Ridge regression
    —also known as L2 regularization—is one of several types of regularization for linear regression models. Regularization is a statistical method to reduce errors caused by overfitting on training data. Ridge regression specifically corrects for multicollinearity in regression analysis. This is useful when developing machine learning models that have a large number of parameters, particularly if those parameters also have high weights.

In [1]:
# data
import numpy as np
np.random.seed(42)
m = 100
X = 6 * np.random.rand(m,1) - 3
y = 0.5 * X **2 + X + 2 + np.random.randn(m,1)

In [2]:
from sklearn.linear_model import Ridge

ridge_reg = Ridge(alpha=0.1, solver="cholesky")
ridge_reg.fit(X,y)

In [3]:
X_new = np.array([[1.5],[2]])
ridge_reg.predict(X_new)

array([[4.82899748],
       [5.25067411]])

#### Let's evaluate the predicted value using stochastic GD

In [4]:
from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(penalty="l2", alpha=0.1/m, tol=None, max_iter=1000, eta0=0.01, random_state=42)
sgd_reg.fit(X, y.ravel()) # y.ravel() because fit() expects 1D targets

In [5]:
sgd_reg.predict(X_new)

array([4.82830117, 5.24903344])

Ridge regression does not perform feature selection, it cannot reduce model complexity by eliminating features. But if one or more features too heavily affect a model’s output, ridge regression can shrink high feature weights (i.e. coefficients) across the model per the L2 penalty term. This reduces the complexity of the model and helps make model predictions less erratically dependent on any one or more feature.

## Bias-variance tradeoff

In machine learning terms, ridge regression amounts to adding bias into a model for the sake of decreasing that model’s variance. Bias-variance tradeoff is a well-known problem in machine learning. But to understand bias-variance trade-off, it’s necessary to first know what “bias” and “variance” respectively mean in machine learning research.

To put it briefly: bias measures the average difference between predicted values and true values; variance measures the difference between predictions across various realizations of a given model. As bias increases, a model predicts less accurately on a training dataset. As variance increases, a model predicts less accurately on other datasets. Bias and variance thus measure model accuracy on training and test sets respectively. Obviously, developers hope to reduce model bias and variance. Simultaneous reduction in both is not always feasible, however, and thus the need for regularization techniques such as ridge regression.

As mentioned, ridge regression regularization introduces additional bias for the sake of decreased variance. In other words, models regularized through ridge regression produce less accurate predictions on training data (higher bias) but more accurate predictions on test data (lower variance). This is bias-variance tradeoff. Through ridge regression, users determine an acceptable loss in training accuracy (higher bias) in order to increase a given model’s generalization (lower variance).13  In this way, increasing bias can help improve overall model performance.

The strength of the L2 penalty, and so the model’s bias-variance tradeoff, is determined by the value λ in the ridge estimator loss function equation. If λ is zero, then one is left with an ordinary least squares function. This creates a standard linear regression model without any regularization. By contrast, a higher λ value means more regularization. As λ increases, model bias increases while variance decreases. Thus, when λ equals zero, the model overfits the training data, but when λ is too high, the model underfits on all data.

Mean square error (MSE) can help determine a suitable λ value. MSE is closely related to RRS and is a means of measuring the difference, on average, between predicted and true values. The lower a model’s MSE, the more accurate its predictions. But MSE increases as λ increases.

Ref: https://www.ibm.com/topics/ridge-regression