In [None]:
%%HTML
<!-- Mejorar visualización en proyector -->
<style>
.rendered_html {font-size: 1.2em; line-height: 150%;}
div.prompt {min-width: 0ex; padding: 0px;}
.container {width:95% !important;}
</style>

In [None]:
%autosave 0
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
import torch

# Bayesian Linear Regression

## Summary of OLS

In linear regression we have a 

- continuous one-dimensional target $y$ 
- continuous D-dimensional input $x$ 

related by a linear mapping

$$
b + \sum_{d=1}^D w_d x_d = f_\theta(x)  \rightarrow y
$$

> The model is specified by $\theta=(b, w_1, w_2, \ldots, w_D)$

Typically, we fit this model by 

$$
\min_\theta\sum_n \left(y_n - f_\theta(x_n) \right)^2 = (Y - \Phi \theta)^T (Y - \Phi \theta)
$$

whose solution is

$$
\theta = (\Phi^T \Phi)^{-1} \Phi^T Y,
$$

where $\Phi  = \begin{pmatrix} 1 & x_{11} & x_{12} & \ldots & x_{1D} \\ 
1 & x_{21} & x_{22} & \ldots & x_{2D} \\
1 & \vdots & \vdots & \ddots & \vdots \\
1 & x_{N1} & x_{N2} & \ldots & x_{ND} \end{pmatrix}$,  $Y = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{pmatrix}$ and  $\theta =  \begin{pmatrix} b \\ w_1 \\ \vdots \\ w_D \end{pmatrix}$

> This is known as the ordinary least squares (OLS) solution

**Note: Linear regression is *linear on the parameters***

If we apply transformations we obtain the same solution. The only difference is in $\Phi$

For example
- Polynomial basis regression $f_\theta(x) = \sum_d w_d x^d + b$ 
- Sine-wave basis regression $f_\theta(x) = \sum_d \alpha_d \cos(2\pi d f_0 x)  + \sum_d \beta_d \sin(2\pi d f_0 x) + c$ 

## Probabilistic linear regression

We can assume that observations are noisy and write

$$
\begin{align}
y &= f_\theta(x) + \epsilon \nonumber \\
&= b + \sum_{d=1}^D w_d x_d   + \epsilon, \nonumber
\end{align}
$$

If the noise is independent and Gaussian distributed (iid) with variance $\sigma_\epsilon^2$ then

$$
p(y|x, \theta) = \mathcal{N}\left(y| f_\theta(x) , \sigma_\epsilon^2 \right)
$$

Additionally, we may want to discourage large values of $\theta~$ by placing a prior

$$
p(\theta) = \mathcal{N}(0, \Sigma_\theta)
$$

The prior on the parameters gives us the space of possible models (before presenting data)

In [None]:
line_x = np.linspace(-5, 5, num=100)[:, None].astype('float32') #100x1

sw, sb = 5., 5.
fig, ax = plt.subplots(figsize=(7, 3), tight_layout=True)
for i in range(100):
    linear_layer = torch.nn.Linear(1, 1)
    torch.nn.init.normal_(linear_layer.weight, 0.0, sw)
    torch.nn.init.normal_(linear_layer.bias, 0.0, sb)
    #y = W*x + b
    line_y = linear_layer(torch.from_numpy(line_x)).detach().numpy()
    ax.plot(line_x, line_y, c='tab:blue', alpha=0.25)

We constraint the space of solutions by presenting data

### Point-estimate solution (MAP)

For a dataset $\mathcal{D} = \{ (x_1, y_1), (x_2, y_2), \ldots, (x_N, y_N) \}$

The Maximum a posteriori estimator of $\theta~$ is given by

$$
\begin{align}
\hat \theta &= \text{arg}\max_\theta \log p(\mathcal{D}| \theta, \sigma_\epsilon^2) ~ \mathcal{N} (\theta|0, \Sigma_\theta) \nonumber  \\
&= \text{arg}\min_\theta  \frac{1}{2\sigma_\epsilon^2} (Y-\Phi\theta)^T (Y - \Phi\theta) + \frac{1}{2} \theta^T \Sigma_\theta^{-1} \theta  \nonumber
\end{align}
$$

where the log likelihood is

$$
\log p(\mathcal{D}| \theta, \sigma_\epsilon^2) = \sum_{n=1}^N \log \mathcal{N}(y_n|f_\theta(x_n),\sigma_\epsilon^2)
$$

and the result is

$$
\hat \theta = (\Phi^T \Phi + \lambda )^{-1} \Phi^T Y
$$

where $\lambda = \sigma_\epsilon^2 \Sigma_\theta^{-1}$

> This is the ridge regression or **regularized least squares** solution

What happens if the variance of the prior tends to infinite (uninformative prior)?


### Bayesian solution for the parameters

In this case we want the posterior of $\theta~$ given the dataset

Assuming that we know $\sigma_\epsilon$

$$
p(\theta|\mathcal{D}, \sigma_\epsilon^2) \propto  \mathcal{N}(Y| \Phi \theta, I\sigma_\epsilon^2) \mathcal{N}(\theta| \theta_0, \Sigma_{\theta_0})
$$

The likelihood is normal and the prior is normal, so

$$
p(\theta|\mathcal{D}, \sigma_\epsilon^2) \propto \frac{1}{Z} \exp \left ( -\frac{1}{2\sigma_\epsilon^2} (Y-\Phi\theta)^T (Y - \Phi\theta)  - \frac{1}{2} (\theta - \theta_{0})^{T} \Sigma_{\theta_0}^{-1} (\theta - \theta_0)\right)
$$

and (with a bit of algebra) it can be shown that this corresponds to a normal distribution 

$$
p(\theta|\mathcal{D}, \sigma_\epsilon^2) = \mathcal{N}(\theta|\theta_1, \Sigma_{\theta_1} )
$$

with parameters 
$$
\Sigma_{\theta_1} = \sigma_\epsilon^2 (\Phi^T \Phi + \sigma_\epsilon^2  \Sigma_{\theta_0}^{-1})^{-1}
$$

$$
\theta_1 = \Sigma_{\theta_1} \Sigma_{\theta_0}^{-1} \theta_{0} + \frac{1}{\sigma_\epsilon^2} \Sigma_{\theta_1} \Phi^T y
$$

> **Iterative framework:** We can present data and update the distribution of $\theta~$

**Example: Fitting a line**

We assume a zero-mean and diagonal covariance normal prior

In [None]:
# Initialization
mw, mb = 0., 0.
sw, sb = 5., 5.
So = np.diag(np.array([sb, sw])**2)
mo = np.array([mb, mw])
seps = 1. # What happens if this is larger/smaller?

The empirical distribution of $\theta$

In [None]:
theta_plot = np.random.multivariate_normal(mo, So, size=10000)

import corner
figure = corner.corner(theta_plot, smooth=1.,
                       labels=["b", "w"], bins=20, 
                       quantiles=[0.16, 0.5, 0.84], range=[(-8, 8), (-8, 8)],
                       show_titles=True, title_kwargs={"fontsize": 12})

We observe data at $x=2$, $y=2$ and we update the parameters

In [None]:
#Update
Phi = np.array([[1.0, 2.0]])
y = np.array([2.0])
Sn = seps**2*np.linalg.inv(np.dot(Phi.T, Phi) +  seps**2*np.linalg.inv(So))
mn = np.dot(Sn, np.linalg.solve(So, mo)) + np.dot(Sn, np.dot(Phi.T, y))/seps**2
display(Sn, mn)

with this the space of possible models is constrained

In [None]:
fig, ax = plt.subplots(figsize=(7, 3), tight_layout=True)

for i in range(100):
    linear_layer = torch.nn.Linear(1, 1)
    rparam = torch.from_numpy(np.random.multivariate_normal(mn, Sn).astype('float32'))
    linear_layer.weight.data = rparam[1].reshape(-1, 1)
    linear_layer.bias.data = rparam[0].reshape(-1, 1)
    line_y = linear_layer(torch.from_numpy(line_x)).detach().numpy()
    ax.plot(line_x, line_y, c='tab:blue', alpha=0.25)

ax.errorbar(2, 2, xerr=0, yerr=2*seps, fmt='none', c='k', zorder=100);

and the updated empirical distribution is 

In [None]:
theta_plot = np.random.multivariate_normal(mn, Sn, size=10000)

figure = corner.corner(theta_plot, smooth=1.,
                       labels=["b", "w"], bins=20, 
                       quantiles=[0.16, 0.5, 0.84], range=[(-8, 8), (-8, 8)],
                       show_titles=True, title_kwargs={"fontsize": 12})

Let's assume that we observe a additional data point at $x=-2$, $y=-2$

In [None]:
# Initialization
So = Sn
mo = mn
#Update
Phi = np.array([[1.0, -2.0]])
y = np.array([-2.0])
Sn = seps**2*np.linalg.inv(np.dot(Phi.T, Phi) +  seps**2*np.linalg.inv(So))
mn = np.dot(Sn, np.linalg.solve(So, mo)) + np.dot(Sn, np.dot(Phi.T, y))/seps**2
display(Sn, mn)

The space of possible models is further reduced

In [None]:
fig, ax = plt.subplots(figsize=(7, 3), tight_layout=True)

for i in range(100):
    linear_layer = torch.nn.Linear(1, 1)
    rparam = torch.from_numpy(np.random.multivariate_normal(mn, Sn).astype('float32'))
    linear_layer.weight.data = rparam[1].reshape(-1, 1)
    linear_layer.bias.data = rparam[0].reshape(-1, 1)
    line_y = linear_layer(torch.from_numpy(line_x)).detach().numpy()
    ax.plot(line_x, line_y, c='tab:blue', alpha=0.2)

ax.errorbar(2, 2, xerr=0, yerr=2*seps, fmt='none', c='k', zorder=100);
ax.errorbar(-2, -2., xerr=0, yerr=2*seps, fmt='none', c='k', zorder=100);

And the empirical distribution is constrained even more

In [None]:
theta_plot = np.random.multivariate_normal(mn, Sn, size=10000)

figure = corner.corner(theta_plot, smooth=1.,
                       labels=["b", "w"], bins=20, 
                       quantiles=[0.16, 0.5, 0.84], range=[(-8, 8), (-8, 8)],
                       show_titles=True, title_kwargs={"fontsize": 12})

### Bayesian solution for the predictions

Don't forget our goal

> We train the model to predict $y$ for new values of $x$

In the Bayesian setting we are interested in the **posterior predictive distribution**

This is obtained by marginalizing $\theta$

$$
\begin{align}
p(y | x, \mathcal{D}) &= \int p(y, \theta | x, \mathcal{D}) d\theta \nonumber \\
&= \int p(y| \theta, x, \mathcal{D}) p(\theta| \mathcal{D}) d\theta \nonumber \\
&= \int p(y| \theta, x) p(\theta| \mathcal{D}) d\theta, \nonumber 
\end{align}
$$

note that $y$ is conditionally independant on $\mathcal{D}$ given $\theta$

For our linear regression
$$
\begin{align}
p(y|x, \mathcal{D}, \sigma_\epsilon^2) &= \int p(y|f_\theta(x), \sigma_\epsilon^2) p(\theta| \theta_{N}, \Sigma_{\theta_N}) d\theta \nonumber \\
&= \mathcal{N}\left(y|f_{\theta_N} (x), \sigma_\epsilon^2 + x^T \Sigma_{\theta_N} x\right)
\end{align}
$$

**The posterior predictive is Gaussian** (convolution of gaussians is gaussian)

If we consider that $N$ samples were presented and that $\mu_0=0$ then 

$$
\theta_{N} =  (\Phi^T \Phi + \sigma_\epsilon^2 \Sigma_{\theta_0}^{-1})^{-1} \Phi^T y
$$

which is the MAP estimator, and

$$
\Sigma_{\theta_N} = \sigma_\epsilon^2 (\Phi^T \Phi + \sigma_\epsilon^2 \Sigma_{\theta_0}^{-1})^{-1}
$$


Finally, the variance (uncertainty) for the new $x$ is 
$$
\sigma^2(x) = \sigma_\epsilon^2 + x^T \Sigma_{\theta_N} x
$$

> The variance of the prediction has contribution from the noise (irreducible) and the model 

Uncertainty grows when we depart from the observed data points

In [None]:
fig, ax = plt.subplots(figsize=(7, 3), tight_layout=True)

Phi_x = np.vstack(([1]*100, line_x[:, 0])).T
sx = np.sqrt(np.diag(seps**2 + np.dot(np.dot(Phi_x, Sn), Phi_x.T)))
ax.plot(line_x, np.dot(Phi_x, mn), '--')
ax.fill_between(line_x[:, 0], np.dot(Phi_x, mn)-2*sx, np.dot(Phi_x, mn)+2*sx, alpha=0.5)
ax.errorbar(2, 2, xerr=0, yerr=2*seps, fmt='none', c='k', zorder=100);
ax.errorbar(-2, -2., xerr=0, yerr=2*seps, fmt='none', c='k');

**Activity:**

See how the posterior predictive distribution changes with increasing/decreasing $\sigma_\epsilon$ and $\Sigma_{\theta_0}$

# Self-study

- [Chapter 18 of D. Barber's book](http://web4.cs.ucl.ac.uk/staff/D.Barber/pmwiki/pmwiki.php?n=Brml.Online)
- In all this we assumed $\sigma_\epsilon$ known. For a bayesian treatment with unknown noise variance we would use a normal inverse gamma prior

## Model Evidence for Bayesian Linear Regression

Next iteration