# Regularization: Occam's Razor for Models

> Frustra fit per plura quod potest fieri per pauciora. 
>
> (It is in vain to do with many things what can be done with few things.)

-- William of Ockham (c. 1287–1347)

**Regularization** is a technique that aims to prevent overfitting: Since excessive model complexity can lead to overfitting, we introduce a **preference for simplicity** into the objective function when training a model.

## Preamble

In [None]:
import matplotlib.pyplot as plt
import seaborn
import numpy as np
import pandas

In [None]:
import data_science_learning_paths

In [None]:
data_science_learning_paths.setup_plot_style()

## About Regularization

In the chapter [📓 Algorithm Selection and Hyperparameter Tuning](../ml/ml-algo-hyperparameter.ipynb) we have already discussed model complexity and how it can contribute to overfitting.

In [None]:
x = np.linspace(-10, 10)
y = -x**2 + np.random.normal(scale=10.0, size=len(x))

In [None]:
plt.figure(figsize=(5,5))
plt.scatter(x, y)
plt.plot(
    x, 
    np.poly1d(
        np.polyfit(x, y, 2)
    )(x),
    color="green"
)
plt.plot(
    x, 
    np.poly1d(
        np.polyfit(x, y, 42)
    )(x),
    color="red"
)

Which model is more likely to generalize? Clearly we should prefer the simpler model to the overfitted complex model in this case. In practice, we could treat the order of the polynomial as a hyperparameter and do an explicit parameter search, iteratively increasing the order and testing the performance. 

However, there is another possible approach: **We can give the model plenty of degrees of freedom while also pushing it in the direction of simplicity during training**. That is regularization.

**Regularization** means adding a term to our objective function that penalizes high model complexity. Before, we were only interested in minimizing the error of the predictions 

$$\min_f \sum_{i=1}^{n} E(f(x_i), y_i)$$

where:
- $f(x_i)$: the output of the model for input $x_i$
- $y_i$: the target value for input $x_i$
- $E$: the error function 

Now consider a new, regularized form for the objective function that has an additional term:

$$\min_f \sum_{i=1}^{n} E(f(x_i), y_i) + \lambda R(f)$$


- $R$: the regularization function
- $\lambda$: a weight controlling the strength of regularization

$R$ can be any function that penalizes complexity of the model - usually the number or magnitude of parameters.


## Regularization Methods Performing Feature Selection

Of course, a vast number of regularization terms fit the general form above, and many have been proposed to work well in practice. Among those that **effectively do feature selection**, i.e. that shrink the contribution of unimportant features to zero. They are also referred to as **embedded feature selection**.


### LASSO or L1 Regularization

Let $x$ be the vector of $k$ parameters for the model $f$. Then L1 regularization adds the regularization term

$$R(f) = |x|_1 = \sum_{i=0}^k |x_i|$$


This penalizes the absolute magnitude of the parameters, or equivalently, the [L1 norm](http://mathworld.wolfram.com/L1-Norm.html) of the parameter vector.

> Lasso is able to achieve both objectives [reducing overfitting and making the model more interpretable] by forcing the sum of the absolute value of the regression coefficients to be less than a fixed value, which forces certain coefficients to be set to zero, effectively choosing a simpler model that does not include those coefficients.

-- [Wikipedia](https://en.m.wikipedia.org/wiki/Lasso_(statistics))

In [None]:
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.svm import SVR

In [None]:
LinearRegression()

In [None]:
Lasso(alpha=1.0)

## Exercise: House Price Regression

1. Work on the feature-rich house price dataset:

In [None]:
data = data_science_learning_paths.datasets.read_house_prices()

In [None]:
data.shape

In [None]:
data.head()

In [None]:
target = "SalePrice"
features = data.columns.difference([target])

2. Explore the influence of regularization for types of models that include regularization and their non-regularized counterparts (e.g. `sklearn.linear_model.LinearRegression` vs `sklearn.linear_model.Lasso`).

Evaluate the model performance with cross-validation and appropriate error metrics. An example below:

In [None]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics.scorer import make_scorer

In [None]:
scoring = make_scorer(mean_absolute_error)

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
cross_val_score(
    estimator=LinearRegression(),
    X=data[features],
    y=data[target],
    scoring=scoring,
    cv=10
)

In [None]:
# Your code here...

## References/Further Reading

- [Cross Validated discussion: Do we still need to do feature selection while using Regularization algorithms?](https://stats.stackexchange.com/questions/149446/do-we-still-need-to-do-feature-selection-while-using-regularization-algorithms)

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_