## Regularization Methods Example 1.2:
In the following, we will discuss the function **sklearn.linear\_model.Ridge()** in depth. Firstly, note that the qualitative predictors in x have to be transformed into dummy variables. The flag **normalize = True** makes sure that the predictors are mean centred and scaled to unit variance. When comparing to **R**, note that the implementation is slightly different, which makes it hard to compare coefficients as a function of lambda. The optimal solution however, will generally be the same. 

We will now perform ridge regression in order to predict **Balance** in the **Credit** data set.

In [4]:
import pandas as pd
import numpy as np
from sklearn.linear_model import Ridge
import warnings
warnings.filterwarnings("ignore")

# Load data
df = pd.read_csv('./data/Credit.csv', index_col="Unnamed: 0")

# Convert Categorical variables
df = pd.get_dummies(data=df, drop_first=True, 
                    prefix=('Gender_', 'Student_', 
                            'Married_', 'Ethnicity_'))

# Define target and predictors
x = df.drop(columns='Balance') 
y = df['Balance']

# Fit model:
lambda_ = 100
reg = Ridge(alpha=lambda_, normalize=True)
reg = reg.fit(x, y)

# Coefficient and corresponding predictors
coef = np.round(reg.coef_, 3)
# coef = scaler.inverse_transform(coef)
x_cols = x.columns.values

We expect the coefficient estimates to be much smaller, in terms of $\ell_2$ norm, when a large value of $\lambda$ is used, as compared to when a small value of $\lambda$ is used. These are the coefficients when $\lambda = 100$, along with their $\ell_2$ norm:


In [5]:
print(pd.DataFrame(data={'Feature': x_cols,
                         'Coefficient':coef}),
      '\n\nl2-norm:', np.sqrt(np.sum(coef**2)))

                 Feature  Coefficient
0                 Income        0.006
1                  Limit        0.000
2                 Rating        0.003
3                  Cards        0.029
4                    Age        0.000
5              Education       -0.001
6           Gender__Male       -0.020
7           Student__Yes        0.396
8           Married__Yes       -0.005
9       Ethnicity__Asian       -0.010
10  Ethnicity__Caucasian       -0.003 

l2-norm: 0.3977901456798547


In contrast, here are the coefficients when $\lambda = 50$, along with their $\ell_2$ norm. Note the much larger $\ell_2$ norm of the coefficients associated with this smaller value of $\lambda$.

In [3]:
# Fit model:
lambda_ = 50
reg = Ridge(alpha=lambda_, normalize=True)
reg = reg.fit(x, y)

# Coefficient and corresponding predictors
coef = np.round(reg.coef_, 3)
x_cols = x.columns.values

print(pd.DataFrame(data={'Feature': x_cols,
                         'Coefficient':coef}),
      '\n\nl2-norm:', np.sqrt(np.sum(coef**2)))

                 Feature  Coefficient
0                 Income        0.112
1                  Limit        0.003
2                 Rating        0.049
3                  Cards        0.563
4                    Age       -0.002
5              Education       -0.021
6           Gender__Male       -0.379
7           Student__Yes        7.773
8           Married__Yes       -0.123
9       Ethnicity__Asian       -0.183
10  Ethnicity__Caucasian       -0.052 

l2-norm: 7.806846994786052


The standard least squares coefficient estimates are scale equivariant: multiplying a predictor variable $X_j$ by a constant $c$ simply leads to a scaling of the least squares coefficient estimates by a factor of $1/c$. In other words, regardless of how the $j$th predictor is scaled, $\hat{\beta}_j X_j $ will remain the
same. In contrast, the ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant. For instance, consider the **income** variable, which is measured in dollars. One could reasonably have measured income in thousands of dollars, which would result in a reduction in the observed values of income by a factor of $1000$. Now due to the sum of squared coefficients in the ridge regression formulation equation, such a change in scale will not simply cause the ridge regression coefficient estimate for **income** to change by a factor of $1000$. In other words,
$\hat{\beta}_j X_{j,\lambda}^{R} $ will depend not only on the value of $\lambda$, but also on the scaling of the $j$th predictor. In fact, the value of $\hat{\beta}_j X_{j,\lambda}^{R} $ may even depend on the scaling of the other predictors. Therefore, it is best to apply ridge regression after standardizing the predictors.

Note that by default, the **Ridge()** function does not standardizes the variables. To turn on scaling, use the argument **normalize = True**.