In [None]:
import holoviews as hv
hv.extension('bokeh')
hv.opts.defaults(hv.opts.Curve(width=500),
                 hv.opts.Scatter(width=500, size=4),
                 hv.opts.Histogram(width=500),
                 hv.opts.Slope(color='k', alpha=0.5, line_dash='dashed'),
                 hv.opts.HLine(color='k', alpha=0.5, line_dash='dashed'))   

In [None]:
import numpy as np
import scipy.stats
import pandas as pd
import sklearn
from IPython.display import YouTubeVideo

# Linear models and Basis functions

In previous lessons we learned how to fit lines and hyperplanes to data. But the most general form of a linear model is

$$
g_\theta(x_i) = \sum_{k=0}^K \theta_k \phi_k(x_i),
$$

where $\phi_k: \mathbb{R}^D \to \mathbb{R}$ is a set of basis functions. Note that $K$ and $D$ are not necessarily related 

In this lesson we will 

- review some examples of basis functions
- learn how to perform linear regression with basis functions using Python
- learn how to avoid overfitting the data using regularization techniques


## Basis functions

### Polynomials

For a unidimensional variable $x_i \in \mathbb{R}, i=1,\ldots,N$, a general polynomial basis is defined 

$$
\phi_k(x_i) = x_i^k
$$

which yields a K-degree polynomial model

$$
g_\theta(x_i) = \sum_{k=0}^K \theta_k x_i^k = \theta_0 + \theta_1 x_i + \theta_2 x_i^2 + \ldots + \theta_K x_i^K \quad \forall i
$$


### Trigonometric

If we want to model periodic behavior a trigonometric basis is a suitable choice. For a unidimensional variable $x_i \in \mathbb{R}, i=1,\ldots,N$, a trigonometric basis with period $P=1/f_0$ is

$$
\phi_k(x_i) = \begin{cases} 1 & k=0 \\ \cos(2\pi k f_0 x_i) & k \in [1, K] \\  \sin(2\pi k f_0 x_i) & k \in [K+1, 2K] \end{cases}
$$

which yields a trigonometric model with $2K+1$ parameters

$$
g_\theta(x_i) = \theta_0 + \sum_{k=1}^K \theta_k \cos(2\pi k f_0 x_i) + \sum_{k=1}^K \theta_{k+K} \sin(2\pi k f_0 x_i)
$$



### Interactions between variables

We can create a basis that models linear and non-linear interactions between our independent variables

For example if we have a bidimensional variable $x_i = (w_{i}, v_{i}), i=1,\ldots,N$ a model with interactions up to the second degree would be

$$
g_\theta(x_i) = \theta_0 + \theta_1 w_i + \theta_2 v_i + \theta_3 w_i^2 + \theta_4 v_i w_i + + \theta_5 v_i^2 
$$

## Polynomial regression using `scikit-learn`

We can create a polynomial basis from our variables using 

```python
sklearn.preprocessing.PolynomialFeatures(degree=2, # Degree of the polinomial
                                         interaction_only=False, # Return only products between features
                                         include_bias=True, # Include the intercept (constant) term
                                         ...
                                        )
``` 

The `fit_transform` method of this object receives the data and returns the transformed data. For example if our dataset has two independent variables $x=[w, v]$, then `PolynomialFeatures(degree=2)` would return $[1, w, v, w^2, wv, v^2]$. 

The `sklearn.linear_model` submodule offers 

```python
sklearn.linear_model.LinearRegression(fit_intercept=True, # Fit the intercept term
                                      copy_X=True, 
                                      normalize=False, # Remove average and divide by standard deviation 
                                      n_jobs=None, # Number of CPU cores
                                      positive=False # Can be used to force positive coefficients
                                      ...
                                     )

```

which returns an object with the following attributes and methods

- `coef_`: Returns the slopes of the fitted model
- `intercept_`: Returns the intercept of the fitted model
- `fit(X, Y)`: Fit a model to predict the response `Y` given the features `X`
- `predict(X)`: Returns the predicted response for features `X`
- `score(X, Y)`: Returns the coefficient of determination ($r^2$)

:::{note}

`LinearRegression` is a wrapper for `linalg.lstsq`. The advantage of using `LinearRegression` instead of `linalg.lstsq` is that we can use `sklearn.pipeline` to create a polynomial regression model as follows

:::


In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

def polynomial_regressor(degree):
    # Either include_bias is True or fit_intercept is True
    return Pipeline([('features', PolynomialFeatures(degree, 
                                                     include_bias=False)),
                     ('regressor', LinearRegression(fit_intercept=True, 
                                                    normalize=True))])

polynomial_regressor(degree=2)

Using the data from the previous lesson let's train a polynomial regressor to predict `consumption` as a function of  `temperature`

In [None]:
df = pd.read_csv('data/ice_cream.csv', header=0, index_col=0)
df.columns = ['Consumption', 'Income', 'Price', 'Temperature']
Y = df["Consumption"].values
X = df["Temperature"].values

model = polynomial_regressor(degree=2).fit(X.reshape(-1, 1), Y)
# intercept_ and coef_ attributes return the value of the fitted parameters (theta)
display(model['regressor'].intercept_, 
        model['regressor'].coef_)

And we can plot the predictions of our polynomial regression model

In [None]:
hatx = np.linspace(10, 90, num=100)
haty = model.predict(hatx.reshape(-1, 1))

In [None]:
hv.Overlay([hv.Curve((hatx, haty), 'Temperature', 'Consumption', label='model'), 
            hv.Scatter((X, Y), label='data').opts(color='k')]).opts(legend_position='top_left')

We have moved from lines to parabolas! But,

> How does the result changes with the `degree` of our polynomial features?

In [None]:
hatx = np.linspace(np.amin(X), np.amax(X), num=1000)
haty = {}
for degree in [1, 2, 3, 4, 5, 10, 20]:
    model = polynomial_regressor(degree).fit(X.reshape(-1, 1), Y)
    haty[degree] = model.predict(hatx.reshape(-1, 1))

In [None]:
hMap = hv.HoloMap(kdims='degree')
for degree, haty_ in haty.items():
    p_model = hv.Curve((hatx, haty_), 'Temperature', 'Consumption', label='model')
    p_data = hv.Scatter((X, Y), label='data').opts(color='k')
    hMap[degree] = hv.Overlay([p_model, p_data]).opts(legend_position='top_left')

hMap            

:::{danger}

As `degree` grows we start overfitting the data more and more

:::

Let's define what is overfitting and how to combat it using regularization

## Overfitting and regularization


In the previous example `degree` represents the complexity of the model. In general, more complex models give more flexibility to fit the data

But too much complexity causes **overfitting**

- the model fits the noise
- we can't extract the underlying behavior 
- the model **does not generalize** to new data

Ways to avoid overfitting

- Using low complexity models
- Set complexity using cross-validation 
- Regularization

In what follows we will focus on the latter

### The Bias-Variance trade-off

Let's assume that our data can be modeled as a "true model" plus gaussian noise 

$$
y = f(x) + \varepsilon
$$

and that we use a linear model to find $f(x) = \sum_k \theta_k \phi_k(x)$

We can measure the quality of our model with the Mean Square Error (MSE)

$$
\begin{align}
\mathbb{E}[(y - \hat y)^2] &= \mathbb{E}[y^2 -2 y \hat y +\hat y^2] \nonumber \\
&= \mathbb{E}[(f+\varepsilon)^2 -2 (f+\varepsilon) \hat y +\hat y^2] \nonumber \\
&= \mathbb{E}[(f^2 +2 f \varepsilon + \varepsilon^2 -2 (f+\varepsilon) \hat y +\hat y^2] \nonumber \\
&= \mathbb{E}[\varepsilon^2] + f^2  -2 f \mathbb{E}[\hat y]  +\mathbb{E}[\hat y^2]  \pm \mathbb{E}[\hat y]^2  \nonumber \\
&= \mathbb{E}[\varepsilon^2] + (f - \mathbb{E}[\hat y])^2  +\mathbb{E}[(\hat y - \mathbb{E}[\hat y])^2]  \nonumber \\
&= \sigma^2 + (f - \mathbb{E}[\hat y])^2  + \text{Var}[\hat y]  \nonumber 
\end{align}
$$

:::{important}

The MSE can be decomposed as the irreducible error (data noise) + the squared bias of the estimator + the variance of the estimator

:::

The MSE can be small if either the bias or the variance are small. More complex models tend to have have lower bias and higher variance

The Gauss-Markov theorem says that OLS has the minimum variance among the unbiased estimator. But zero-bias models are not necessarily good (overfit). 

If we are overfitting the data we may want to trade variance for bias. This can be achieved by penalizing the complexity of the model: this is **regularization**.


### Bayesian (MAP) Least Squares and Ridge regression

If we assume a Gaussian likelihood and a Gaussian prior we can write the log joint as

$$
\begin{align}
\log p({x}, \theta) &= \log \prod_{i=1}^N \mathcal{N}(y_i | f_\theta(x_i), \sigma^2) + \log \prod_{j=1}^M \mathcal{N}(\theta_j | 0, \sigma_0^2) \nonumber \\
&= -\frac{N}{2} \log(2\pi \sigma^2) - \frac{1}{2\sigma^2} (Y-X\theta)^T (Y - X\theta) -\frac{M}{2} \log(2\pi \sigma_0^2) - \frac{1}{2\sigma_0^2} \|\theta\|^2 \nonumber 
\end{align}
$$

The MAP estimator of $\theta$ is given by

$$
\begin{align}
\hat \theta &= \text{arg}\max_\theta \log p({x}, \theta) \nonumber  \\
&= \text{arg}\max_\theta  - \frac{1}{2\sigma^2} (Y-X\theta)^T (Y - X\theta) - \frac{1}{2\sigma_0^2} \|\theta\|^2 \nonumber \\
&= \text{arg}\min_\theta  (Y-X\theta)^T (Y - X\theta) + \lambda \|\theta\|^2 \nonumber
\end{align}
$$

where $\lambda = \frac{\sigma^2}{\sigma_0^2}$

The solution is obtained by taking the derivative on $\theta$

$$
\frac{d}{d\theta} (Y-X\theta)^T (Y - X\theta) + \lambda \|\theta\|^2  = -X^T (Y - X\theta) + \lambda \theta = 0
$$

and finally

$$
\hat \theta = (X^T X + \lambda I)^{-1} X^T Y
$$

which is known as **Ridge regression** and **Tikhonov regularization**


:::{important}

The gaussian prior is equivalent to a restriction on the $L_2$ norm of $\theta$. Adding this prior forces the solution to be smooth

:::

:::{note}

Adding different priors yield different regularization effects. For example a Laplacian prior yields a $L_1$ norm  on $\theta$ which forces the solution to be sparse 

:::

In general, regularizing a model consists of adding a "penalty term" or restriction to the cost/loss function. Typically the additional term will penalize overly complex models. Because of this regularization can help to avoid overfitting with complex models or ill-posed problems, e.g. when we have more parameters than data samples 

But there is no free lunch. We now have the additional task of choosing $\lambda$

- Cross-validation: Minimize validation error
- L-curve: Plot $ \log (Y-X\theta)^T (Y - X\theta)$ vs $ \log \|\theta\|^2$ and find the elbow

### Example: Ridge Regression in Python

The `sklearn` library provides

```python
sklearn.linear_model.Ridge(alpha=1.0, # The regularization parameter (lambda)
                          fit_intercept=True, # Whether to include the intercept (constant)
                          normalize=False, # Subtract mean and divide std from the data
                          ...
)
```

The attributes and methods of `LinearRegression` are also available in this object

We will create a pipeline for the polynomial features plus the ridge regression and explore the influence of the regularization parameter

In [None]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
hatx = np.linspace(np.amin(X), np.amax(X), num=1000)
haty = {}
mse = {}
for degree in [1, 2, 3, 5, 10, 20]:
    for lamb in [0.0, 1e-3, 1e-1,  10]:
        non_linear_regressor = Pipeline([('features', PolynomialFeatures(degree)),
                                         ('regressor', Ridge(normalize=True, alpha=lamb))])
        # Fit
        model = non_linear_regressor.fit(X.reshape(-1, 1), Y)
        # Score
        mse[degree, lamb] = mean_squared_error(Y, model.predict(X.reshape(-1, 1)))
        # Predict on new data
        haty[degree, lamb] = model.predict(hatx.reshape(-1, 1))        

In [None]:
hMap = hv.HoloMap(kdims=['degree', 'lambda'])
for (degree, lamb), haty_ in haty.items():
    p_model = hv.Curve((hatx, haty_), 'Temperature', 'Consumption', label='model')
    p_data = hv.Scatter((X, Y), label='data').opts(color='k')
    hMap[degree, lamb] = hv.Overlay([p_model, p_data]).opts(legend_position='top_left')

hMap            

Observations:

- Regularization decreases overfitting in complex models
- But an excessive penalization may induce a trivial solution, for example a straigh horizontal line

## Extra topics

### A note on independence and correlation

Independence implies uncorrelatedness, but the reverse is not true. Two variables can have zero correlation but still be dependent. Also remember that linear regression (correlation) is only sensitive to linear relationships

If we are interested in testing independence we could use:

$$
p(x,y) = p(x)p(y)
$$

Several methods are based on this, for example Shannon's **Mutual Information**

$$
I(X,Y) = \int \int f_{XY}(x,y) \log \frac{f_{XY}(x,y)}{f_{X}(x) f_Y(y)} dx dy
$$

and the [Correlation distance](https://arxiv.org/pdf/0803.4101.pdf)

$$
R(X,Y) = \int \int |f_{XY}(x,y)  - f_{X}(x) f_Y(y)| dx dy
$$

Although these methods require that we estimate the joint and the marginals (KDE, Histogram, Parametric). For categorical variables we can use the **chi square test**


### Related topics 

- (Hastie 3.4 and 3.8) L1 regularization and Least Absolute Shrinkage and Selection Operator (LASSO)
- Robust regression: Least absolute regression and M-estimators for data with outliers (non-Gaussian)
- (Hastie 6 & Bishop 6) Kernel (non-parametric) regression  

Some of these topics can be found at [Huijse, Regresión](https://docs.google.com/presentation/d/1UUpK4zSdzRcS79V7_wU9nXe-sR7qYLEWhbmid-Rfp1k/edit#slide=id.g28044c0f85_0_34)

I also suggests the following lecture by Judea Pearl on **causality**

In [None]:
YouTubeVideo('ZaPV1OSEpHw')