# Week 3 Notes

## Tuning Process

Hyperparameters are parts of the neural network learning architecture that are treated as fixed. They are not learned from the data directly - because they would lower the training error but not aid generalization. Additionally, usually there is no clean way to learn them from data even if we wanted - because of the computational load.

Instead we rely on another split of the data and use a pseudo empirical Bayes procedure to find the best hyperparameters. This is a great discussion from [stackoverflow](https://stats.stackexchange.com/questions/365762/why-dont-we-just-learn-the-hyper-parameters).

```
A hyperparameter typically corresponds to a setting of the learning algorithm, rather than one of its parameters. In the context of deep learning, for example, this is exemplified by the difference between something like the number of neurons in a particular layer (a hyperparameter) and the weight of a particular edge (a regular, learnable parameter).

Why is there a difference in the first place? The typical case for making a parameter a hyperparameter is that it is just not appropriate to learn that parameter from the training set. For example, since it's always easier to lower the training error by adding more neurons, making the number of neurons in a layer a regular parameter would always encourage very large networks, which is something we know for a fact is not always desirable (because of overfitting).

To your question, it's not that we don't learn the hyper-parameters at all. Setting aside the computational challenges for a minute, it's very much possible to learn good values for the hyperparameters, and there are even cases where this is imperative for good performance; all the discussion in the first paragraph suggests is that by definition, you can't use the same data for this task.

Using another split of the data (thus creating three disjoint parts: the training set, the validation set, and the test set, what you could do in theory is the following nested-optimization procedure: in the outer-loop, you try to find the values for the hyperparameters that minimize the validation loss; and in the inner-loop, you try to find the values for the regular parameters that minimize the training loss.

This is possible in theory, but very expensive computationally: every step of the outer loop requires solving (till completion, or somewhere close to that) the inner-loop, which is typically computationally-heavy. What further complicates things is that the outer-problem is not easy: for one, the search space is very big.

There are many approaches to overcome this by simplifying the setup above (grid search, random search or model-based hyper-parameter optimization), but explaining these is well beyond the scope of your question. As the article you've referenced also demonstrates, the fact that this is a costly procedure often means that researchers simply skip it altogether, or try very few setting manually, eventually settling on the best one (again, according to the validation set). To your original question though, I argue that - while very simplistic and contrived - this is still a form of "learning".

```

### Examples of Hyperparameters

THere are many kinds of hyperparameters that one might want to tune. A good strategy (found empirically) is to tune in the following sequence:

1. learning rate $alpha$

1. number of hidden units, minibatch size, $\beta$ of Adam algorithm

1. number of layers, learing rate decay

1. Almost never done in practice, but is possible to tune $\beta_1, \beta_2, \epsilon$. Defaults (0.9, 0.999, $10^{-8}$) are usually good enough.


### Hyperparameter Search Strategy

There are some strategies that can be tried:


1. grid search - usually discouraged as computationally expensive.

1. random search - known to produce good results, but can be wasteful. Can also use coarse to fine search scheme where after some time, finding a promising neighborhood with good possibilities, we focus our search around that area - this is similar to the next section.

1. Bayesian hyperparameter optimization - balancing exploitation vs exploration.

There has been some innovation in this space, notes can be found [here](https://www.automl.org/wp-content/uploads/2018/09/chapter1-hpo.pdf)

## Using An Appropriate Scale

If we have a range for a parameter - which in itself is a task, then often using an appropriate scale that reflects the relative change can be a better sampling space. One good choice can be to sample from the log space. 

This could be for quantities such as:

- $\alpha$ sampled on log space
- $\beta$ sampled from $log(1-\beta)$


## HyperParameter Tuning In Practice

There are two ways to do hyperparameter search in practice.

### Panda Approach: Not much compute or data

The first is to babysit and mange a single model, adjusting hyperparameters manually and checking updates of performance. This is out of vogue in the era of cheap compute and big data. It is akin to a panda that has one offspring and takes care of it.

### Caviar Approach: A lot of compute and data

The second approach is much more used these days. This is to try many hundreds/thousands of hyperparameter settings and choose that which performs best. Such an approach is like caviar, with fish laying thousands of eggs and only some surviving with little supervision.

## Normalizing Activations In A Network

### Normalizing Inputs

We have seen that normalizing inputs (layer 0 activations) is a good idea because it makes the surface easier to navigate on a similar scale.  The same argument is used to justify batch norm, that is normalizing the linear combinations in each layer. This has been shown empirically to improve training and reduce problems like exploding and vanishing gradients.

Some papers suggest normalizing activations, but in practice most systems are made by normalizing linear combinations across the layers. The equations are as below:

For a fixed layer [l] and a minibatch t of size $q$, we have linear combinations for each of the q examples in this minibatch.

$z^{[l] \{t\}, (1)}, \ldots, z^{[l] \{t\}, (q)} $


Note that each of these z's is a vector with `z.shape` $= (n^[l],1)$

We now normalize these $z$'s assuming statistical independence and get location and scale vector parameters $\mu$ and $\sigma$. These have the same shape as the z's, namely $(n^[l],1)$

### Normalizing Hidden Layers Linear Combinations

$\mu$ and $\sigma$ are vectors (of shape $(n^{[l]},1)$ same as the z's) of the $i^{th}$ example on the $l^{th}$ layer of the $t^{th}$ minibatch estimated as follows.


$\mu^{[l], \{t\} } = \frac{1}{m} \sum_{i=1}^{i=q}  (z^{[l],\{t\}, (i)})$

$ (\sigma^{[l], \{t\}})^2 = \frac{1}{m} \sum_{i=1}^{i=q}  (z^{ [l],\{t\}, (i)} - \mu)^2$

The assumption is that there is not cross correlation.

Then the normalized values of z's are given as subtracting the mean estimate and dividing by the standard devation:

$z^{[l],\{t\}, (i)}_{\textbf{norm}} = \frac{z^{[l],\{t\}, (i)} - \mu^{[l], \{t\} }}{\sigma^{[l], \{t\}}}$


### Rescaling Normalized Linear Combinations

As will be explained later, it is useful to scale the normalized values to have arbitrary mean and variance. The reason is that certain centering and scalings are better for certain activation functions. For example, being scaling and shifting the tanh function results in the sigmoid function.

For a given location $\beta$ and scale $\gamma$ (which we will see later can be learned from the training set using gradient descent), we have:

$\widetilde{z}^{[l],\{t\}, (i)} = \gamma^{[l]} z^{[l],\{t\}, (i)}_{\textbf{norm}} + \beta^{[l]}$

These $\widetilde{z}$'s are used as arguments to the activation functions. The $\gamma$ and $\beta$ parameters need to be estimated, but this can usually be done with a single command added to most neural network frameworks such as tensorflow or pytorch.

### Why Rescale Linear Combinations?

Why do we do this? More details will be provided later, but broadly:

- Allows centering and scaling output to activation functions

- Consuming neurons can rely on some stability of the range and likely input neurons

- Better convergence properties

- Some small regularization effect

## Fitting Batch Norm Into Networks

In order to fit batch norm and learn the parameters needed, we need to make slight modifications. In particular the bias term falls away and is replaced by the $\beta^{[l]}$ term.

$\beta$ and $\gamma$ need to be learned using gradient descent, along with the $W$ parameters. That means initial conditions and then forward and backward propagation to get updates for the iterations of the gradient descent algorithm.

Due to the normalizing applied to z's, the bias parameters disappear and do not need to be estimated.


## Why Does Batch Norm Work?

BatchNorm works because it introduces a tranformation that makes covariate shifts invariant to the learned network. Usually a shifting of covariates in the hidden layers causes problems. However if they scale and location are estimated parameters from a normalized linear combination, then the algorithm can learn effectively.


- Allows centering and scaling output to activation functions

- Consuming neurons can rely on some stability of the range and likely input neurons

- Better convergence properties

- Some small regularization effect due to mini batch specific mean and variance estimates - which introduces some noise.

## Batch Norm At Test Time


At test time, estimating the normalization parameters is done by taking exponentially weighted average of the mean and variance parameters. This can be used to normalize the test cases and the apply the transformations to get out predictions.

Ideally we would estimate the mean and variance using all the data, but that would be a computationally expensive pass over all the data. Instead, by calculating an exponentially weighted moving average, we have a good estimate for little cost.

In practice, most of the frameworks do this or something similar in the background, so it doesn't need to be coded from scratch.


## SoftMax Regression

In softmax regression (also called multinomial regression) the output is one (and only one) of several predefined classes, say K of them. Multinomial regression is a linear classifier, the multiclass extension of logistic regression. 

The final layer activation function is defined as taking in the final layer linear combinations and working on all the components of this layer to produce a vector of the same length that can be seen as the probability of each class. That is it takes K linear combinations (a K length vector) and returns K probabilities, one for each class. The actual transformation function is from the $Z^[1]$ vector of the final layer to the final output estimate vector:

$ \Pr( Y = C_0 | x)  = \frac{exp(Z_i)}{\sum_i exp(Z_i)} $

$ \Pr( Y = C_1 | x)  = \frac{exp(Z_i)}{\sum_i exp(Z_i)} $

$ \Pr( Y = C_2 | x)  = \frac{exp(Z_i)}{\sum_i exp(Z_i)} $

$ \vdots $

$ \Pr( Y = C_{K-1} | x)  = \frac{exp(Z_i)}{\sum_i exp(Z_i)}$

![](softmax_regression.gif)

### Why The Name SoftMax?

The name SoftMax is used because assignment of the class is usually based the maximum of the soft distribution output. Hard max would be when the output is a vector with only one non-zero component and that entry must be 1.

### How Is This Related To Logistic Regression?

If we have $K=2$ and realize that for this case:

$ \Pr( Y = C_1 | x) = 1 - \Pr( Y = C_0 | x)$

Then as we will see, the cost function reduces to the logistic classifier cost function when there are no hidden layers.

## Training SoftMax on Neural Networks

SoftMax can be used as the activation function of the final output layer of a neural network that is classifying one of K classes. We look at the loss function to understand how this can be setup.

### Loss Function

If $y$ is a vector of the actual labels and $\hat{y}$ is the estimate given features $x$, the mapping learned from historical data examples, then the canonical loss function is: 

For vectors y and $\hat{y}$ of length $K$

$$
L(y, \hat{y})  = -\sum_{j=0}^{j=K-1} y_j log(\hat{y_j})
$$

This is the NLL and Cross Entropy Loss.

### BackPropagation 

In the backpropagation algorithm, for the final layer, we have: 

$dZ^{[L]} = y - \hat{y}$

and the chain rule ensures that backprop follows as before.


## TensorFlow

TensorFlow (`tf`) is a Python language framework for deep learning and matrix based programming open sourced by Google. It is based off an earlier project that Google used internally called DistBelief.

Tensorflow has many API layers (for web, servers, cellphones, execution modes, etc.). This can make it seem complicated. It also moved from a DAG model that needed to be compiled to a more pythonic eager execution mode. This is in addition to integration with the keras easy use API mode.

One of the major benefits of `tf` is that once the forward prop is specified (usually in the `tf` idiomatic way), it will automatically calculate the backprop using rules for simple operations and expressions.

Below is a simple example of minimizing a one variable convex function using gradient descent variations.

### Minimize Convex Quadratic Function

Suppose we have the function:

$J(w) = w^2 -10w +25$

We want to find the point w where it is minimized. This is easy to do algebraically. We can factor the square 

$J(w) = (w-5)^2$

and it's clear that the function is minimized when $w = 5$.

However, just to get a feel for the technology with a simple example, we will do it with gradient descent using tensorflow.

In [17]:
import numpy as np
import tensorflow as tf

# define variable that will change 
w = tf.Variable(0.0, dtype=tf.float32)
# define function J using tf API commands
J = tf.add(tf.add(tf.square(w),tf.multiply(-10.0,w)),25.0)
# set learning rate hyperparameter - no tuning
learning_rate = 0.01
# set optimizer with learning rate, function and minimize requirement
# this creates the function as good to run, but is not actually run
# note there is no need to explicitly set gradient function
#train = tf.train.GradientDescentOptimizer(learning_rate).minimize(J)
train = tf.train.AdamOptimizer(learning_rate).minimize(J)

#### Idiomatic tf Session compiling the model ###
# initialize parameters
init = tf.global_variables_initializer()
session = tf.Session()
session.run(init)
print(session.run(w))

0.0


In [26]:
## see impact of one gradient descent update iteration to the parameters
session.run(train)
print(session.run(w))

5.0


In [27]:
## run gradient descent for 1000 iterations and the one done before
for i in range(1000):
    session.run(train)

session.run(w)

5.0

With a small constant learning rate, we see that the gradient descent optimizer converges very close to the true optima. In fact if we switch out the code above with one line to specify the Adam optimizer the convergence is in 1 iteration.

### What is tf.session doing?

In the background, `tf.session` are moving the model into very efficient code that is optimized to make use of the cache of the CPU. In fact it can even move the data and model to a GPU or specialized chips. This is the DAG part. 

When `tf.run` is called, the model is executed on the chips and results transported back to the main python thread. As there are resources being managed and handled at tf.session and tf.run, making use of the with keyword is important. This will take care of error handling and managing resources.

``` python
with tf.Session() as session:
    session.run(init)
    print(session.run(w))
    session.run(train)
```


Again remember, there are dozens of APIs to `tf` and so there are many ways to do this. The most popular (for model development) these days being to use keras and eager execution.

In [None]:
For a second more machine learning specific example, we look at multiple linear regression.

In [None]:
import numpy as np
import tensorflow as tf



And here a more complex example of the hello world dataset for neural networks - MNIST. The training of model parameters is done via the Adam optimizer on sparse categorical loss. Notice how this uses the DAG computation graph which needs to be compiled and run.

In [3]:
import tensorflow as tf
mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


[0.062195106509223115, 0.9802]

Note how in all cases we didn't need to specify the gradient, it was done in the background by the `tf` symbolic engine.