
# Introduction to Machine Learning
## Neural Networks
### July 8rd, 2019
### Instructors: Melanie Fernandez Pradier (Havard), Weiwei Pan (Harvard), Javier Zazo Ruiz (Harvard)


## Can we build a model that is arbitrarily flexible AND scalable?

<img src="./fig/fig2.png" style='height:300px;'>


**REVISED GOAL:** Find models that can capture *arbitrarily complex* trends or decision boundaries **and** are fast to train as well as efficient for computing predictions.

# Neural Networks

## What is a Neural Network?

**Goal:** build a good approximation $\widehat{g}$ of a complex function $g$ by composing simple functions.

For example, let the following picture represents $f\left(\sum_{i}w_ix_i\right)$, where $f$ is a non-linear transform:

<img src="./fig/fig4.png" style='height:200px;'>


## Neural Networks as Function Approximators

Then we can define the approximation $\widehat{g}$ with a graphical schema representing a complex series of compositions and sums of the form, $f\left(\sum_{i}w_ix_i\right)$

<img src="./fig/fig5.png" style='height:300px;'>

This is a ***neural network***. We denote the weights of the neural network collectively by $\mathbf{W}$.
The non-linear function $f$ is called the ***activation function***.

## Common Choices for the Activation Function

<img src="./fig/fig8.png" style='height:500px;'>

# Using Neural Networks for Regression

## Neural Networks Regression

**Data:** features `x_train`, real-valued labels `y_train`

**Probabilistic Model:** `y_train` $= g_\mathbf{W}($ `x_train` $) + \epsilon,\quad$ $\epsilon \sim \mathcal{N}(0, \sigma^2)$, where $g_\mathbf{W}$ is a neural network with parameters $\mathbf{W}$.

**Training Objective:** find $\mathbf{W}$ to maximize the likelihood of our data. This is equivalent to minimizing the Mean Square Error,
$$
\max_{\mathbf{W}}\, \mathrm{MSE}(\mathbf{W}) = \frac{1}{N}\sum^N_{n=1} \left(y_n - g_\mathbf{W}(x_n)\right)^2
$$

**Optimizing the Training Objective:** For linear regression (when $g_\mathbf{W}$ is a linear function), we computed the gradient of the MSE with respective to the model parameters $\mathbf{W}$, set it equal to zero and solved for the optimal $\mathbf{W}$ analytically. 

Can we do the same when $g_\mathbf{W}$ is a neural network?

## Exercise: Optimizing Neural Networks

For the small neural network in the previous exercise, compute the gradient $\nabla_{\mathbf{W}}\,\mathrm{MSE}(\mathbf{W})$. Can you analytically solve for the optimal parameters $\mathbf{W}$? Is the training objective convex?

## Training Your NN --> Gradient descent

* Optimization Choices Matter

<img src="./fig/fig10.jpg" style='height:400px;'>

## Gradient Descent: the Algorithm
1. start at random place: $W_0\leftarrow \textbf{random}$

2. until (stopping condition satisfied):

  a. compute gradient: 
     gradient = $\nabla$ loss\_function($W_{t}$)

  b. take a step in the negative gradient direction: 
     $W_{t+1} \leftarrow W_{t}$ - eta * gradient

Here *eta* is called the ***learning rate***.

## Drawbacks of Gradient Descent

* Consider minimizing an average of functions (over examples):
$$
\min_{x} \frac{1}{N} \sum_{i=1}^N f_W(x_i)
$$

* The gradient update looks:
$$
W_{t+1} \leftarrow W_{t} - \eta * \sum_i \frac{1}{N} \nabla_W  f_W(x_i)
$$

* What happens if you have a 10k examples? 100k examples? a million examples?
* Is gradient descent still feasible?
* How can we optimize over such database?

## Stochastic Gradient Descent

* We could try to use less examples to approximate the gradient...

* With a single example:
$$
W_{t+1} \leftarrow W_{t} - \eta * \nabla_W  f_{W_t}(x_i),\qquad \forall t = 1,2,\dots
$$
* Every time we make an update on $W_t$, we take a new example randomly.
* How do you think this affects the optimization procedure?

<img src="./fig/sgd.png" style='height:450px;'>

## Mini-batch gradient descent

* Nobody calls it mini-batch gradient descent: SGD

* Instead of using a single example --> use a mini-batch!
$$
W_{t+1} \leftarrow W_{t} - \eta * \frac{1}{M} \sum_{i=1}^M \nabla_W  f_{W_t}(x_i),\qquad \forall t = 1,2,\dots
$$

* This will reduce the variance of the updates.

* Mini-batch size is another tuning parameter.

* **Epoch:** A whole loop over all data samples.

# Implementing Neural Networks in `python`

## `keras`: a Python Library for Neural Networks

`keras` is a `python` library that provides intuitive api's for build neural networks quickly. 

``` python
#keras model for feedforward neural networks
from keras.models import Sequential

#keras model for layers in feedforward networks
from keras.layers import Dense

#keras model for optimizing training objectives
from keras import optimizers
```

## Building a Neural Network for Regression in `keras`

`keras` is a `python` library that provides intuitive api's for build neural networks quickly.

``` python
#instantiate a feedforward model
model = Sequential()

#add layers sequentially

#input layer: 2 input dimensions
model.add(Dense(2, input_dim=2, activation='relu', 
                kernel_initializer='random_uniform',
                bias_initializer='zeros')) 
#hidden layer: 2 nodes
model.add(Dense(2, activation='relu', 
                kernel_initializer='random_uniform',
                bias_initializer='zeros')) 

#output layer: 1 output dimension
model.add(Dense(1, activation='relu', 
                kernel_initializer='random_uniform',
                bias_initializer='zeros')) 

#configure the model: specify training objective and training algorithm
adam = optimizers.Adam(lr=0.01)
model.compile(optimizer=adam,
              loss='mean_squared_error')
```

## Choosing an optimizer

* Adjust the learning rate dynamically.
* Average of gradients.
* Accelerated versions: increase convergence speed but also variance

<img src="./fig/adam_sgd.png" style='height:450px;'>

## Training a Neural Network in `keras`

``` python
#fit the model and return the mean squared error during training
history = model.fit(X_train, Y_train, batch_size=20, shuffle=True, epochs=100, verbose=0)
```

## Monitoring Neural Network Training

Visualize the mean square error over the training, this is called the training ***trajectory***.

``` python
#fit the model and return the mean squared error during training
history = model.fit(X_train, Y_train, batch_size=20, shuffle=True, epochs=100, verbose=0)

# Plot the loss function and the evaluation metric over the course of training
fig, ax = plt.subplots(1, 1, figsize=(10, 5))

ax.plot(np.array(history.history['mean_squared_error']), color='blue', label='training accuracy')

plt.show()
```

# The Bias Variance Trade-off: for Neural Networks

## Generalization Error and Bias/Variance
Complex models have ***low bias*** -- they can model a wide range of functions, given enough samples.

But complex models like neural networks can use their 'extra' capacity to explain non-meaningful features of the training data that are unlikely to appear in the test data (i.e. noise). These models have ***high variance*** -- they are very sensitive to small changes in the data distribution, leading to drastic performance decrease from train to test settings.

<table>
    <tr>
        <td>
            <img src="./fig/fig11.png" style="width: 350px;" align="center"/>
        </td>
        <td>
            <img src="./fig/fig12.png" style="width: 350px;" align="center"/>
        </td>
    </tr>
</table>

## Regularization
A way to prevent overfitting is to reduce the capacity of the model, thereby limiting the kinds of functions they can model. This **increases bias, but reduces variance**:
1. **$\ell_1$, $\ell_2$ weight regularization** - adding a term to the loss function that penalizes the $\ell_1$-norm (sum of absolute values) or the $\ell_2$-norm (sum of squares) of the weights. This prevents the network from learning extremely squiggly functions.
``` python
from keras import regularizers
model.add(Dense(64, input_dim=64,
                kernel_regularizer=regularizers.l2(0.01),
                activity_regularizer=regularizers.l1(0.01)))
```

2. **Dropout** - randomly zeroing out weights during training. This prevents the hidden nodes from "over specializing" or "memorizing" certain data points.
``` python
from keras.layers import Dropout,
model.add(Dense(64, activation='relu', input_dim=20))
model.add(Dropout(0.5))
```


<img src="./fig/dropout1.png" style='height:450px;'>


<img src="./fig/dropout2.png" style='height:450px;'>

## Exercise: Compare Neural Network Regression to Polynomial Regression

Compare a neural network regression to a polynomial regression. Which model do you think is "better" and for what kind of tasks?

# Using Neural Networks for Classification

**Data:** features `x_train`, real-valued labels `y_train`

**Probabilistic Model:** `y_train` $\sim \text{Bernoulli}(\sigma(g_\mathbf{W}($ `x_train` $))$, where $g_\mathbf{W}$ is a neural network with parameters $\mathbf{W}$, and $\sigma(z) = \frac{1}{1+e^{-z}}$.

**Training Objective:** find $\mathbf{W}$ to maximize the likelihood of our data. This is equivalent to minimizing the ***binary cross entropy*** or ***log loss***,
$$
\max_{\mathbf{W}}\, \mathrm{CrossEnt}(\mathbf{W}) =\sum^N_{n=1} y_n \log( g_\mathbf{W}(x_n)) + (1 - y_n) \log\left(1 - g_\mathbf{W}(x_n)\right)
$$

**Optimizing the Training Objective:** Since this objective is not convex and finding the zero's of the gradient is intractable, we will use gradient descent to find a "optimal" set of parameters $\mathbf{W}$.

## Building a Neural Network for Classification in `keras`

``` python
#instantiate a feedforward model
model = Sequential()

#add layers sequentially

#input layer: 2 input dimensions
model.add(Dense(2, input_dim=2, activation='relu', 
                kernel_initializer='random_uniform',
                bias_initializer='zeros')) 
#hidden layer: 2 nodes
model.add(Dense(2, activation='relu', 
                kernel_initializer='random_uniform',
                bias_initializer='zeros')) 

#output layer: 1 output dimension
model.add(Dense(1, activation='sigmoid', 
                kernel_initializer='random_uniform',
                bias_initializer='zeros')) 

#configure the model: specify training objective and training algorithm
sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(optimizer=sgd,
              loss='binary_crossentropy')
```

## Reminder of bias & variance trade-off in classification

<img src="./fig/bias_variance.png" style='height:450px;'>


## Exercise: Compare Neural Network Classification to Other Classifiers

Compare a neural network classifier to other classifiers. For what task is a neural network the most appropriate model?

## KEY IDEAS recap:

* NNs are universal approximators: can approximate ANY function with enough hidden units

* How to we train NN? Gradient descent + chain rule = Back-propagation.

* Wide variety of optimizers --> optimization alternatives.

* Deep NNs can overfit, they WILL overfit!

* Lot's of parameters to tune.

* **Extra note**: Linear/Logistic Regression ARE neural networks with no hidden layers.


### How to avoid overfitting?
- more training data
- L2/L1 penalties on weights
- data augmentation
- dropout
- early stopping

### Optimizing quality/speed
- learning rate
- batch size

## Orthogonalization

<img src="./fig/orthogonalization.png" style='height:450px;'>

## Data augmentation

<img src="./fig/data_aug.png" style='height:450px;'>

## Early stopping

<img src="./fig/early_stop.png" style='height:450px;'>

## Summary advantages/disadvantages:

* PROs:
    - flexible models
    - high performance (state-of-the-art: language, speech, images)
    - open-source software/several ressources

* CONs:
    - require lots of data
    - many tuning params
    - computationally expensive
    - optimization not easy: will it converge? is local minimum enough?
    - will it overfit?