# Exploring the Basics of Ridge Regression

Ridge regression introduces a penalty term to OLS in order to shrink the coefficient estimates. The motivation behind this is to prevent over fitting to the training data. OLS has low bias because it is fit to the data, but high variance because it will be fit to whatever sample is used. Ridge introduces some bias into the model in order to have less variance across different samples and potentially have better testing performance.

### The penalty term

Recall that OLS minimizes RSS. Ridge minimizes RSS plus a lambda parameter times the sum of the squares of the coefficients. Here is how that looks:

$$\text{RSS} + \lambda \sum_{j=1}^{p} \beta_j^2$$

Note that the intercept is not included in the penalty term and that $\lambda \ge 0$.

### Lambda
This parameter has to be tuned separately. When $\lambda = 0$, ridge regression is equivalent to OLS. As $\lambda$ approaches infinity, the coefficients approach 0 which produces a horizontal line that implies no relationship between the independent and dependent variables. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Load in Data

It is important to note that the inputs for ridge regression are generally standardized. With multiple features of different scales, the penalty would otherwise not have a uniform effect. 

In [12]:
prostate_df = pd.read_pickle('Data/prostate.pkl')

Notes:
should you standardize one hot encodings (does it only matter for penalized regression?)