In [None]:
%%HTML
<!-- Mejorar visualización en proyector -->
<style>
.rendered_html {font-size: 1.2em; line-height: 150%;}
div.prompt {min-width: 0ex; padding: 0px;}
.container {width:95% !important;}
</style>

In [None]:
%autosave 0
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt

# Linear Regression

In linear regression we have a 
- continuous one-dimensional target $y$ 
- continuous M-dimensional input $x$ 

related by a linear mapping

$$
\sum_{k=1}^M w_k x_k + b = f_\theta(x)  \rightarrow y
$$

> The model is specified by $\theta=(b, w_1, w_2, \ldots, w_M)$

Typically, we fit this model by 
$$
\min_\theta\sum_i \left(y_i - f_\theta(x_i) \right)^2
$$

whose solution is

$$
\theta = (\Phi^T \Phi)^{-1} \Phi^T Y,
$$

where $\Phi  = \begin{pmatrix} 1 & x_{11} & x_{12} & \ldots & x_{1M} \\ 
1 & x_{21} & x_{22} & \ldots & x_{2M} \\
1 & \vdots & \vdots & \ddots & \vdots \\
1 & x_{N1} & x_{N2} & \ldots & x_{NM} \end{pmatrix}$,  $Y = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{pmatrix}$ and  $\theta =  \begin{pmatrix} b \\ w_1 \\ \vdots \\ w_M \end{pmatrix}$

> This is known as the ordinary least squares (OLS) solution

#### Note: Linear regression is *linear on the parameters*

If we apply transformations we obtain the same solution. The only difference is in $\Phi$

For example
- Polynomial basis regression $f_\theta(x) = \sum_k w_k x^k + b$ 
- Sine-wave basis regression $f_\theta(x) = \sum_k a_k \cos(2\pi k f_0 x)  + \sum_k b_k \sin(2\pi k f_0 x) + c$ 

### Probabilistic linear regression

We can assume that observations are noisy and write

$$
y = f_\theta(x) + \epsilon = \sum_{k=1}^M w_k x_k  + b + \epsilon,
$$

If the noise is independent and Gaussian distributed (iid) then

$$
p(y|x, \theta) = \mathcal{N}\left(y| f_\theta(x) , \sigma_\epsilon^2 \right)
$$

Additionally, we may want to discourage large values of $\theta~$ by placing a prior

$$
p(\theta) = \mathcal{N}(0, \Sigma_\theta)
$$

The prior on the parameters gives us the space of possible models (before presenting data)

In [None]:
import torch

x = np.linspace(-5, 5, num=100).astype('float32')

sw, sb = 5., 5.
fig, ax = plt.subplots(figsize=(7, 3), tight_layout=True)
for i in range(10):
    linear_layer = torch.nn.Linear(1, 1)
    #W = sw*np.random.randn(1)
    #b = sb*np.random.randn(1)
    torch.nn.init.normal_(linear_layer.weight, 0.0, sw)
    torch.nn.init.normal_(linear_layer.bias, 0.0, sb)
    #y = W*x + b
    y = linear_layer(torch.from_numpy(x).unsqueeze(1)).detach().numpy()
    ax.plot(x, y, label='w: %0.2f, b: %0.2f' %(linear_layer.weight, linear_layer.bias))
#plt.legend()

We constraint the space of solutions by presenting data

### Point-estimate solution (MAP)

For a dataset $\mathcal{D} = \{ (x_1, y_1), (x_2, y_2), \ldots, (x_N, y_N) \}$

The Maximum a posteriori estimator of $\theta~$ is given by
$$
\begin{align}
\hat \theta &= \text{arg}\max_\theta \log p(\mathcal{D}| \theta, \sigma_\epsilon^2) ~ \mathcal{N} (\theta|0, \Sigma_\theta) \nonumber  \\
&= \text{arg}\min_\theta  \frac{1}{2\sigma_\epsilon^2} (Y-\Phi\theta)^T (Y - \Phi\theta) + \frac{1}{2} \theta^T \Sigma_\theta^{-1} \theta  \nonumber
\end{align}
$$
where the log likelihood is
$$
\log p(\mathcal{D}| \theta, \sigma_\epsilon^2) = \sum_{i=1}^N \log \mathcal{N}(y_i|f_\theta(x_i),\sigma_\epsilon^2)
$$
and the result is
$$
\hat \theta = (\Phi^T \Phi + \lambda )^{-1} \Phi^T Y
$$
where $\lambda = \sigma_\epsilon^2 \Sigma_\theta^{-1}$

> This is the ridge regression or **regularized least squares** solution

What happens if the variance of the prior tends to infinite (uninformative prior)

We get MLE : ordinary least squares solution

### Bayesian solution for the parameters

In this case we want the posterior of $\theta~$ given the dataset

Assuming that we know $\sigma_\epsilon$

$$
p(\theta|\mathcal{D}, \sigma_\epsilon^2) \propto  \mathcal{N}(Y| \Phi \theta, I\sigma_\epsilon^2) \mathcal{N}(\theta| \theta_0, \Sigma_{\theta_0})
$$

The likelihood is normal and the prior is normal, so
$$
\begin{align}
p(\mathcal{D}| \theta, \sigma_\epsilon^2) p(\theta|\theta_0, \Sigma_{\theta_0}) = \frac{1}{Z} \exp \left ( -\frac{1}{2\sigma_\epsilon^2} (Y-\Phi\theta)^T (Y - \Phi\theta)  - \frac{1}{2} (\theta - \theta_{0})^{T} \Sigma_{\theta_0}^{-1} (\theta - \theta_0)\right)
\end{align}
$$

(With a bit of algebra) This corresponds to a normal distribution with parameters 
$$
\Sigma_{\theta_1} = \sigma_\epsilon^2 (\Phi^T \Phi + \sigma_\epsilon^2  \Sigma_{\theta_0}^{-1})^{-1}
$$
$$
\theta_1 = \Sigma_{\theta_1} \Sigma_{\theta_0}^{-1} \theta_{0} + \frac{1}{\sigma_\epsilon^2} \Sigma_{\theta_1} \Phi^T y
$$

> **Iterative framework:** We can present data and update the distribution of $\theta~$

In [None]:
# Initialization
mw, mb = 0., 0.
sw, sb = 5., 5.
So = np.diag(np.array([sb, sw])**2)
mo = np.array([mb, mw])
seps = 0.25 # What happens if this is larger/smaller?

Sample $x=2$, $y=0$ is presented

In [None]:
#Update
Phi = np.array([[1.0, 2.0]])
y = np.array([0.0])
Sn = seps**2*np.linalg.inv(np.dot(Phi.T, Phi) +  seps**2*np.linalg.inv(So))
mn = np.dot(Sn, np.linalg.solve(So, mo)) + np.dot(Sn, np.dot(Phi.T, y))/seps**2
display(Sn, mn)

Now the space of possible models is constrained

In [None]:
x = np.linspace(-5, 5, num=100).astype('float32')

fig, ax = plt.subplots(figsize=(7, 3), tight_layout=True)
for i in range(10):
    linear_layer = torch.nn.Linear(1, 1)
    rparam = torch.from_numpy(np.random.multivariate_normal(mn, Sn).astype('float32'))
    linear_layer.weight.data = rparam[1].reshape(-1, 1)
    linear_layer.bias.data = rparam[0].reshape(-1, 1)
    y = linear_layer(torch.from_numpy(x).unsqueeze(1)).detach().numpy()
    ax.plot(x, y)

ax.errorbar(2, 0, xerr=0, yerr=2*seps, fmt='none', c='k', zorder=100);
#ax.errorbar(-2, -2., xerr=0, yerr=2*seps, fmt='none', c='k');

If we present now $x=-2$, $y=-2$

In [None]:
# Initialization
So = Sn
mo = mn
#Update
Phi = np.array([[1.0, -2.0]])
y = np.array([-2.0])
Sn = seps**2*np.linalg.inv(np.dot(Phi.T, Phi) +  seps**2*np.linalg.inv(So))
mn = np.dot(Sn, np.linalg.solve(So, mo)) + np.dot(Sn, np.dot(Phi.T, y))/seps**2
display(Sn, mn)

Then

In [None]:
x = np.linspace(-5, 5, num=100).astype('float32')

fig, ax = plt.subplots(figsize=(7, 3), tight_layout=True)
for i in range(10):
    linear_layer = torch.nn.Linear(1, 1)
    rparam = torch.from_numpy(np.random.multivariate_normal(mn, Sn).astype('float32'))
    linear_layer.weight.data = rparam[1].reshape(-1, 1)
    linear_layer.bias.data = rparam[0].reshape(-1, 1)
    y = linear_layer(torch.from_numpy(x).unsqueeze(1)).detach().numpy()
    ax.plot(x, y)

ax.errorbar(2, 0, xerr=0, yerr=2*seps, fmt='none', c='k', zorder=100);
ax.errorbar(-2, -2., xerr=0, yerr=2*seps, fmt='none', c='k', zorder=100);

### Bayesian solution for the predictions

What we want

> We train the model to predict $y$ for new values of $x$

In the Bayesian setting we are interested in the **posterior predictive distribution**

$$
\begin{align}
p(y|x, \mathcal{D}, \sigma_\epsilon^2) &= \int p(y|f_\theta(x), \sigma_\epsilon^2) p(\theta| \theta_{N}, \Sigma_{\theta_N}) d\theta \nonumber \\
&= \mathcal{N}\left(y|f_{\theta_N} (x), \sigma_\epsilon^2 + x^T \Sigma_{\theta_N} x\right)
\end{align}
$$

If we consider that $N$ samples were presented and that $\mu_0=0$ then 

$$
\theta_{N} =  (\Phi^T \Phi + \sigma_\epsilon^2 \Sigma_{\theta_0}^{-1})^{-1} \Phi^T y
$$

(MAP estimator) and

$$
\Sigma_{\theta_N} = \sigma_\epsilon^2 (\Phi^T \Phi + \sigma_\epsilon^2 \Sigma_{\theta_0}^{-1})^{-1}
$$

And the variance (uncertainty) for the new $x$ is 
$$
\sigma^2(x) = \sigma_\epsilon^2 + x^T \Sigma_{\theta_N} x
$$


For the previous example we can see that the uncertainty grows when we depart from the observed data points

In [None]:
fig, ax = plt.subplots(figsize=(7, 3), tight_layout=True)

Phi_x = np.vstack(([1]*100, x)).T
sx = np.sqrt(np.diag(seps**2 + np.dot(np.dot(Phi_x, Sn), Phi_x.T)))
ax.plot(x, np.dot(Phi_x, mn), '--')
ax.fill_between(x, np.dot(Phi_x, mn)-2*sx, np.dot(Phi_x, mn)+2*sx, alpha=0.5)
ax.errorbar(2, 0, xerr=0, yerr=2*seps, fmt='none', c='k', zorder=100);
ax.errorbar(-2, -2., xerr=0, yerr=2*seps, fmt='none', c='k');

# Self-study

- [Chapter 18 of D. Barber's book](http://web4.cs.ucl.ac.uk/staff/D.Barber/pmwiki/pmwiki.php?n=Brml.Online)

## BIAS VARIANCE TRADE OFF

# Artificial Neural networks

[Artificial neural networks](https://docs.google.com/presentation/d/1IJ2n8X4w8pvzNLmpJB-ms6-GDHWthfsJTFuyUqHfXg8/edit?usp=sharing) (ANN) are non-linear parametric function approximators built by connecting simple units

These units are simplified models of biological neurons: 

> linear regressor followed by a non-linear activation function

Feed-forward ANN are organized in layers. Each layer has a certain amount of neurons (user-defined)

> **Multilayer perceptron (MLP) architecture:** Every unit is connected to all units of its previous and next layers

Different ways of connecting neurons yields different ANN architectures (convolutional, recurrent, etc)

The parameter vector $\theta$ includes the weights and biases of all the neurons

- Let's consider a Gaussian prior for $\theta$ and study the space of possible models
- How does it compare to the linear regressor? 
    - What happens when you add more neurons? 
    - What happens if you remove the nonlinearity?
    - What happens when you add more layers?

In [None]:
x = np.linspace(-5, 5, num=100)[:, None]

def activation(z):
    #return np.maximum(0., z) # ReLU
    return 1.0/(1.0 + np.exp(-z)) # Logistic/Sigmoid

sw, sb = 5., 5.
Nh = 10
plt.figure(figsize=(7, 3))
for i in range(10):
    W = sw*np.random.randn(1, Nh)
    b = sb*np.random.randn(Nh)
    z = np.dot(x, W) + b
    z = activation(z)
    #W = sw*np.random.randn(Nh, Nh)
    #b = sb*np.random.randn(Nh)
    #z = activation(np.dot(z, W) + b)
    
    W = sw*np.random.randn(Nh, 1)
    b = sb*np.random.randn(1)
    
    plt.plot(x, np.dot(z, W) + b)

Pytorch tutorial: https://github.com/magister-informatica-uach/INFO267/blob/master/unidad1/3_redes_neuronales.ipynb

## Probabilistic interpretation of ANN

MLP for regression
- No activation in output layer
- Trained by minimizing the **Mean Square Error** as in Linear Regression

MLP for classification
- Sigmoid or softmax activation in output layer
- Trained by minimizing the **Cross Entropy Error** as in Logistic Regression

These cost functions arise from assuming a certain likelihood on the parameters