In [1]:
import sys
sys.path.append('..')

import numpy as np
import matplotlib.pyplot as plt

import metrics
import utils

# Projection Pursuit Regression

The PPR model has the form:

$$f(x) = \sum_{m=1}^M g_m(w_m^Tx)$$

with $g_m: \mathbb{R} \to \mathbb{R}$ unspecified functions, $w_m$ and $x \in \mathbb{R}^p$.  

For $M$ large enough, and special kinds of $g_m$, the model can approximate any continuous functions, it's called a universal approximator.  

We fit the model by finding the $g_m$ and $w_m$ that minimizes the following criterion:
$$\sum_{i=1}^N \left( y_i - \sum_{m-1}^M g_m(w_m^Tx_i) \right) ^2$$  

If the $g_m$ are fixed, we can solve it one $g$ at a time, using a quasi-newton method:

$$\sum_{i=1}^N (y_i - g(w^Tx_i))^2 \approx \sum_{i=1}^N g'(w^T_\text{old}x_i)^2 \left( (w_\text{old}^Tx_i + \frac{y_i - g(w^T_{old}x_i)}{g'(w^T_\text{old}x_i)}) - w^T x_i \right)^2$$

We can minimize the right-hand side using a weighted least squares to find $w$, the new weights estimates.  
After several iterations, $w$ converges to the weights that minimizes the criterion.  
We can then go to the next $g$ and fit $w$ with the residual.

PPR Algorithm:

1. Set $f_0(x) = 0$
2. For $m=1$ to $M$:

    - Initialize $w_m$ randomly
    - Let the residual $r_i = y_i - f_{m-1}(x_i)$
    - Let the target $\hat{y}_i = w^T_\text{old}x_i \frac{y_i - g(w^T_\text{old} x_i)}{g'(w^T_\text{old} x_i)}$
    - Let the weights $v_i = g'(w^T_\text{old} x_i)^2$
    - Solve the WLS of $x_i$ onto $\hat{y}_i$ with weights $v_i$ iteratively until convergence
    - Set $f_m(x) = f_{m-1}(x) + g_m(w_m^Tx)$

3. Output $f(x) = f_M(x)$

# Neural Networks

We focused on vanilla net, with only one hidden layer. The network has $p$ inputs, $K$ outputs, and the hidden layer size is $M$.  
$$Z_m = \sigma (\alpha_{0m} + \alpha^T_m x), \space m=1,\text{...},M$$
$$T_k = \beta_{0k} + \beta^T_Z x), \space k=1,\text{...},K$$
$$f_k(X) = g_k(T), \space k=1,\text{...},K$$  

Usually, the activation function $\sigma$ is the sigmoid: $\sigma(z) = \frac{1}{1 + \exp(-z)}$

$g_k$ is a final transformation of output vectors. For regression, it's usually the identity. For classification, it's usually the softmax function:
$$g_k(T) = \frac{e^{T_k}}{\sum_{l=1}^K e^{T_l}}$$

We can think of $Z_m$ as the basis expansion (non-linear) of the original inputs $X$. The network is then a simple linear or logistic model with this transformation as input.  

If $\sigma$ is a linear function, the model reduces to a simple linear or logistic model. A non-linear $\sigma$ greaty enlarges the class of linear models.

# Fitting Neural Networks

We define $\theta$ as the whole set of weights of our network, containing $\alpha_{0m}, \alpha_m, \beta_{0k}, \beta_k$, so $M(p+1) + K(M + 1)$ weights.  

For regression we usually use the sum-of-squared errors:
$$R(\theta) = \sum_{k=1}^K\sum_{i=1}^N (y_{ik} - f_k(x_i))^2$$

For $K$-classes classification we usually use the cross-entropy (deviance):

$$R(\theta) = -\sum_{k=1}^K\sum_{i=1}^N y_{ik} \log f_k(x_i)$$


The model is fit using gradient descent. Backprogation makes use of the chain rule to compute the gradients of every weights:

For sum of squares, let's define:
$$R(\theta) = \sum_{i=1}^N R_i$$
$$\text{with } R_i = \sum_{k=1}^K (y_{ik} - f_k(x_i))^2$$

$$\frac{\partial R_i}{\partial \beta_{km}} = -2(y_{ik} - f_k(x_i)) g'_k(\beta^T_kz_i) z_{mi}$$
$$\frac{\partial R_i}{\partial \alpha_{ml}} = -2(y_{ik} - f_k(x_i)) g'_k(\beta^T_kz_i) \beta_{km} \sigma'(\alpha_m^Tx_i) x_{il}$$

A gradient descent update has the form:
$$\beta_{km} \leftarrow \beta_{km} - \gamma \sum_{i=1}^N \frac{\partial R_i}{\partial \beta_{km}}$$
$$\alpha_{ml} \leftarrow \alpha_{ml} - \gamma \sum_{i=1}^N \frac{\partial R_i}{\partial \alpha_{ml}}$$

with $\gamma$ the learning rate.  

The backprogagation is a 2-pass algorithm: A forward pass that computes the output from the input, and the backward pass, that goes backward to compute the weights gradients.  

These updates are in batch learning, getting all datasets at once. But it can also be done with a few or only one example between each update.  
An epoch is a sweep through the entiere training set.  
Usuaully, the learning rate $\gamma$ should decrease over time.

In [31]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_prime(x):
    return sigmoid(x) * (1 - sigmoid(x))

class VanillaNetReg:
    
    def __init__(self, ninputs, nhidden, noutputs):
        self.p = ninputs
        self.M = nhidden
        self.K = noutputs
        
        self.w0 = np.random.randn(ninputs, nhidden)
        self.b0 = np.random.randn(nhidden)
        self.w1 = np.random.randn(nhidden, noutputs)
        self.b1 = np.random.randn(noutputs)
    
    def fit(self, X, y, nepochs, batch_size, lr):
        
        for epoch in range(nepochs):
            
            p = np.random.permutation(len(X))
            X, y = X[p], y[p]
            for ki in range(0, len(X_train), batch_size):
                Xb = X[ki:ki+batch_size]
                yb = y[ki:ki+batch_size]
                self.backprop(Xb, yb, lr)
               
            if epoch % 10 == 0:
                print('Epoch {}: loss = {}'.format(epoch + 1,
                                                  self.loss(X, y)/len(X)))
        
    
    def backprop(self, X, y, lr):
        z0 = X @ self.w0 + self.b0
        y0 = sigmoid(z0)
        z1 = y0 @ self.w1 + self.b1
        y1 = z1
        
        dy1 = 2 * (y1 - y)
        dz1 = dy1
        dw1 = y0.T @ dz1
        db1 = np.sum(dz1, axis=0)
        dy0 = dz1 @ self.w1.T
        dz0 = z0 * sigmoid_prime(dy0)
        dw0 = X.T @ dz0
        db0 = np.sum(dz0, axis=0)
        dx = dz0 @ self.w0.T

        self.w0 -= lr * dw0
        self.b0 -= lr * db0
        self.w1 -= lr * dw1
        self.b1 -= lr * db1
        
    def forward(self, X):
        z0 = X @ self.w0 + self.b0
        y0 = sigmoid(z0)
        z1 = y0 @ self.w1 + self.b1
        y1 = z1
        return y1
    
    def loss(self, X, y):
        preds = self.forward(X)
        return np.sum((y - preds)**2)
        

X, y = load_boston().data, load_boston().target
X = X.astype(np.float32)
X = X / np.std(X, axis=0)
y = y.reshape(-1, 1).astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   test_size=0.2,
                                                   random_state=15)


model = VanillaNetReg(ninputs=X.shape[1], nhidden=40, noutputs=1)
model.fit(X_train, y_train, nepochs=200, batch_size=32, lr=1e-3)

Epoch 1: loss = 68.14316071888997
Epoch 11: loss = 31.283955204056245
Epoch 21: loss = 29.373307553038913
Epoch 31: loss = 31.003852829779998
Epoch 41: loss = 26.69283436805687
Epoch 51: loss = 25.902082837579204
Epoch 61: loss = 25.55904596360541
Epoch 71: loss = 25.7156240681505
Epoch 81: loss = 24.62320562334049
Epoch 91: loss = 26.556330393202682
Epoch 101: loss = 25.510289678671853
Epoch 111: loss = 24.68394423811551
Epoch 121: loss = 25.51523731746356
Epoch 131: loss = 25.82871033408864
Epoch 141: loss = 25.77346142519554
Epoch 151: loss = 26.84860736802371
Epoch 161: loss = 26.748085227790654


  """


Epoch 171: loss = 25.98840721012171
Epoch 181: loss = 26.203457693819388
Epoch 191: loss = 28.633488567920423


# Some Issues in training neural nets

## Starting values

If weights are near $0$, the sigmoid is roughly linear, and the model is roughly linear.  
Exact $0$ leads to $0$ derivative and the network doesn't learn.  
To high weights leads to poor solutions.  
Usually, it's better to start with near $0$ weights.

## Overfitting

Neural networks often overfit because they have too many weights.  
Early stopping is a technique to reduce overfitting. Whe stop training before reaching a minimum, when the error on a validation set starts increasing.  

Another solution is weight decay, we add a penatly $\lambda J(\theta)$ to $R(\theta)$:
$$J(\theta) = \sum_{k,m} \beta_{km}^2 + \sum_{m,l} \alpha_{ml}^2$$
with $\lambda \geq 0$ hyperparameter. It shrink the weights towards $0$.  

Another penaltty is weight elimination, that shrink smaller weights more:
$$J(\theta) = \sum_{k,m} \frac{\beta_{km}^2}{1 + \beta_{km}^2} + \sum_{m,l} \frac{\alpha_{ml}^2}{1+\alpha_{ml}^2}$$

## Scaling of the inputs

The scaling of the inputs also determines the effective scaling of the weights. It's best to standardize all inputs to mean $0$ and standard deviation $1$.

## Number of hidden units and layers

It's better to have too many hidden units than too few. With too few, the model might not be complex enought to learn the correct representation.  
With too many, extra weights can be shrunk towards $0$ with the proper regularization.  

The number of layers is chosen by background knowledge and experimentation. Multple hidden layers allows construction of hierarchical features.

## Multiple Minima

$R(\theta)$ as many local minima, and the final solution is quite dependant of the starting weights.  
One can try several starting configurations and choose the one with the lowest error, or average over several networks, or use another approach such as bagging.

# Bayesian Neural Net

[Bayesian Methods for Neural Networks](https://www.microsoft.com/en-us/research/wp-content/uploads/1995/01/NCRG_95_009.pdf)

Let consider a feedforward neural network that maps an input vector $x \in \mathbb{R}^p$ to an output value $y \in \mathbb{R}$, using a weights vector $w \in \mathbb{R}^W$.  
The observed dataset $D$ consist of $N$ input vector $x_i$ and corresponding target $t_i$.

## Prediction

We can find the posterior distribution of the weights using Bayes theorem:
$$p(w|D) = \frac{p(D|w)p(w)}{p(D)}$$

The conditional distribution $p(D|w)$ is called the likelihood. The conventional approach is to find $w^*$ maximizing the likelihood function.  
$p(w)$ is a prior distribution over the weights.  

In order to evaluate the posterior $p(w|D)$, we need expressions for both the prior distribution $p(w)$ and the likelihood function $p(D|w)$.  

One simple choice or a prior is a multivariate normal distribution with mean $\vec{0}$ and fixed variance $\alpha^{-1}$:

$$p(w) = \frac{1}{Z_W(\alpha)} \exp (-\frac{\alpha}{2}||w||^2)$$
$$\text{with } Z_W(\alpha) = \left( \frac{2 \pi}{\alpha} \right) ^{W/2}$$  

For the likelihhod function, let's suppose our model follows a Gaussian distribution, with mean the output of the network, and fixed variance $\beta^{-1}$:
$$p(t|x,w) = \left( \frac{\beta}{2 \pi} \right) ^{1/2} \exp \left( \frac{\beta}{2} (y(x;w) - t)^2 \right)$$


The likelihood function is then:
$$
\begin{equation}
\begin{split}
p(D|w)  & = \prod_{n=1}^N p(t^n|x^n,w) \\
& = \frac{1}{Z_D(\beta)} \exp \left( -\frac{\beta}{2} \sum_{i=1}^N (y(x_i;w) - t_i)^2 \right)
\end{split}
\end{equation}
$$

$$\text{with } Z_D(\beta) = \left( \frac{2\pi}{\beta} \right) ^{N/2}$$

$\alpha$ and $\beta$ are hyperpameters, let's suppose for now that they are know, fixed values.  

We can make prediction for a new $x$ by integrating over the weights:
$$p(t|x,D) = \int p(t|x,w)p(w|D)dw$$

If the posterior $p(w|D)$ is sharply peaked around $w_{MP}$ (center, maximum value of the distribution), then we can approximate is using:

$$p(t|x,D) \approx p(t|x,w_{MP})$$

Predictions are made using the neural network, with weights $w_{MP}$. We need to find the $w$ that maximizes the posterior $p(w|D)$.  
Instead of maximizing the posterior probability, let's minimize the negative logarithm of the posterior probability. For this particular prior distribution and likelihood function, we have to minize the following criterion:
$$E(w) = \frac{\beta}{2} \sum_{i=1}^N (y(x_i:w) - t_i))^2 + \frac{\alpha}{2} ||w||^2$$  
Up to a constant factor, this is the same as minimizing the usual sum-of-squares error with $L2$ regularization.  

## Confidence interval

We can use the Bayesian Neural Network to get a confidence interval around the predictions. We can make a Gaussian approximation of the posterior $p(w|D)$.    
Let's estimate $E(w)$ using a second-order taylor expansion:
$$E(w) = E(w_{MP}) + \frac{1}{2} (w - w_{MP})^T A (w - w_{MP})$$

$A$ is the hessian matrix of $E$ with respect to the weigthts, calculated at $w=w_{MP}$.  

Let's approximate the network $y(x;w)$ by a linar expansion:
$$y(x;w) = y(x;w_{MP}) + g^T\Delta w$$
with $\Delta w = w - w_{MP}$ and $g = \nabla_w y(x; w_{MP})$  

With these approximations, the prediction becames Gaussian and can be evaluated:
$$p(t|x,D) = \frac{1}{(2 \pi \sigma_t^2)^{1/2}} \exp \left( - \frac{(t - y_{MP})^2}{2 \sigma_t^2} \right)$$

The distribution has mean $y_{MP} = y(x;w_{MP})$ and variance $\sigma_t^2$ given by:
$$\sigma_t^2 = \frac{1}{\beta} + g^TA^{-1}g$$

We can use this distribution to estimate confidence intervals of new predictions.

## Hyperparameters

We need to chose the correct values for $\alpha$ and $\beta$.  
The posterior is given by:
$$p(w|D) = \int \int p(w|\alpha,\beta,D)p(\alpha,\beta|D)d\alpha d\beta$$

Let's suppose the posterior $p(\alpha,\beta|D)$ is sharply peaked around their maximum $\alpha_{MP}$ and $\beta_{MP}$. Then:
$$p(w|D) \approx p(w|\alpha_{MP}, \beta_{MP},D)$$

We need to find the hyperparemets values that maximize the probability of the posterior $p(\alpha, \beta|D)$:

$$p(\alpha, \beta|D) = \frac{p(D|\alpha, \beta) p(\alpha,\beta)}{p(D)}$$

We need to choose a prior $p(\alpha,\beta)$. Such a prior on hyperparameters is called a hyperprior.  
We chose an non-informative prior, because whe have no idea of what values they could be. It gives equal weights to all possible values.  
The maximum posterior value is found my maximizing the likelihood term, $p(D|\alpha,\beta)$, also called the evidence. It can be rewritten as:
$$p(D|\alpha,\beta) = \int p(D|w,\beta) p(w|\alpha)dw$$

It simplifies to:
$$p(D|\alpha,\beta) = \frac{1}{Z_D(\beta)} \frac{1}{Z_W(\alpha)} \int \exp (-E(w)) dw$$

Using the taylor expension over $E(w)$ as above, this distribution becames tractable, and the resulting expression can be maximized with respect to $\alpha$ and $\beta$.  
We finally get expressions for $\alpha_{MP}$ and $\beta_{MP}$ that can be used in the rest of the calculations.