In [None]:
%%HTML
<!-- Mejorar visualización en proyector -->
<style>
.rendered_html {font-size: 1.2em; line-height: 150%;}
div.prompt {min-width: 0ex; padding: 0px;}
.container {width:95% !important;}
</style>

In [None]:
%autosave 0
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt

# Linear Regression

In linear regression we have a 
- continuous one-dimensional target $y$ 
- continuous D-dimensional input $x$ 

related by a linear mapping

$$
\sum_{k=1}^D w_k x_k + b = f_\theta(x)  \rightarrow y
$$

> The model is specified by $\theta=(b, w_1, w_2, \ldots, w_D)$

Typically, we fit this model by 
$$
\min_\theta\sum_i \left(y_i - f_\theta(x_i) \right)^2
$$

whose solution is

$$
\theta = (\Phi^T \Phi)^{-1} \Phi^T Y,
$$

where $\Phi  = \begin{pmatrix} 1 & x_{11} & x_{12} & \ldots & x_{1D} \\ 
1 & x_{21} & x_{22} & \ldots & x_{2D} \\
1 & \vdots & \vdots & \ddots & \vdots \\
1 & x_{N1} & x_{N2} & \ldots & x_{ND} \end{pmatrix}$,  $Y = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{pmatrix}$ and  $\theta =  \begin{pmatrix} b \\ w_1 \\ \vdots \\ w_D \end{pmatrix}$

> This is known as the ordinary least squares (OLS) solution

#### Note: 

Linear regression is **linear on the parameters**

If we apply transformations we obtain the same solution. The only difference is in $\Phi$

For example
- Polynomial basis regression $f_\theta(x) = \sum_j w_j x^j + b$ 
- Sine-wave basis regression $f_\theta(x) = \sum_j a_j \cos(2\pi j x)  + \sum_j b_j \sin(2\pi j x) + c$ 

### Probabilistic linear regression

We can assume that observations are noisy and write

$$
y = \sum_j w_j x_j  + b + \epsilon,
$$

If the noise is independent and Gaussian distributed then

$$
p(y|\theta) = \mathcal{N}\left(\sum_j w_j x_j + b, I\sigma_\epsilon^2 \right)
$$

Additionally, we may want to discourage large values of $\theta$ by placing a prior

$$
p(\theta) = \mathcal{N}(0, \sigma^2_\theta)
$$

The priors gives us the space of possible models

In [None]:
x = np.linspace(-5, 5, num=100)

sw, sb = 5., 5.
plt.figure(figsize=(7, 3))
for i in range(10):
    W = sw*np.random.randn(1)
    b = sb*np.random.randn(1)
    plt.plot(x, W*x + b)

We constraint the space of solutions by presenting data

### Point-estimate solution

The Maximum a posteriori estimator of $\theta$ is given by
$$
\begin{align}
\hat \theta &= \text{arg}\max_\theta \log p(\mathcal{D}| \theta, \sigma_\epsilon^2) ~ p (\theta) \nonumber  \\
&= \text{arg}\min_\theta  \frac{1}{2\sigma_\epsilon^2} (Y-\Phi\theta)^T (Y - \Phi\theta) + \frac{1}{2\sigma_\theta^2} \|\theta\|^2 \nonumber
\end{align}
$$
and the result is
$$
\hat \theta = (\Phi^T \Phi + \lambda I)^{-1} \Phi^T Y
$$
where $\lambda = \sigma_\epsilon^2 / \sigma_\theta^2$

> This is the ridge regression or regularized least squares solution

What happens if my prior totally uninformative ($\sigma_\theta \to \infty$) ?



## BIAS VARIANCE TRADE OFF

# Artificial Neural networks

[Artificial neural networks](https://docs.google.com/presentation/d/1IJ2n8X4w8pvzNLmpJB-ms6-GDHWthfsJTFuyUqHfXg8/edit?usp=sharing) (ANN) are non-linear parametric function approximators built by connecting simple units

These units are simplified models of biological neurons: 

> linear regressor followed by a non-linear activation function

Feed-forward ANN are organized in layers. Each layer has a certain amount of neurons (user-defined)

> **Multilayer perceptron (MLP) architecture:** Every unit is connected to all units of its previous and next layers

Different ways of connecting neurons yields different ANN architectures (convolutional, recurrent, etc)

The parameter vector $\theta$ includes the weights and biases of all the neurons

- Let's consider a Gaussian prior for $\theta$ and study the space of possible models
- How does it compare to the linear regressor? 
    - What happens when you add more neurons? 
    - What happens if you remove the nonlinearity?
    - What happens when you add more layers?

In [None]:
x = np.linspace(-5, 5, num=100)[:, None]

def activation(z):
    #return np.maximum(0., z) # ReLU
    return 1.0/(1.0 + np.exp(-z)) # Logistic/Sigmoid

sw, sb = 5., 5.
Nh = 10
plt.figure(figsize=(7, 3))
for i in range(10):
    W = sw*np.random.randn(1, Nh)
    b = sb*np.random.randn(Nh)
    z = np.dot(x, W) + b
    z = activation(z)
    #W = sw*np.random.randn(Nh, Nh)
    #b = sb*np.random.randn(Nh)
    #z = activation(np.dot(z, W) + b)
    
    W = sw*np.random.randn(Nh, 1)
    b = sb*np.random.randn(1)
    
    plt.plot(x, np.dot(z, W) + b)

Pytorch tutorial: https://github.com/magister-informatica-uach/INFO267/blob/master/unidad1/3_redes_neuronales.ipynb

## Probabilistic interpretation of ANN

MLP for regression
- No activation in output layer
- Trained by minimizing the **Mean Square Error** as in Linear Regression

MLP for classification
- Sigmoid or softmax activation in output layer
- Trained by minimizing the **Cross Entropy Error** as in Logistic Regression

These cost functions arise from assuming a certain likelihood on the parameters