# Week 3 Notes

## Tuning Process

Hyperparameters are parts of the neural network learning architecture that are treated as fixed. They are not learned from the data directly - because they would lower the training error but not aid generalization. Additionally, usually there is no clean way to learn them from data even if we wanted - because of the computational load.

Instead we rely on another split of the data and use a pseudo empirical Bayes procedure to find the best hyperparameters. This is a great discussion from [stackoverflow](https://stats.stackexchange.com/questions/365762/why-dont-we-just-learn-the-hyper-parameters).

```
A hyperparameter typically corresponds to a setting of the learning algorithm, rather than one of its parameters. In the context of deep learning, for example, this is exemplified by the difference between something like the number of neurons in a particular layer (a hyperparameter) and the weight of a particular edge (a regular, learnable parameter).

Why is there a difference in the first place? The typical case for making a parameter a hyperparameter is that it is just not appropriate to learn that parameter from the training set. For example, since it's always easier to lower the training error by adding more neurons, making the number of neurons in a layer a regular parameter would always encourage very large networks, which is something we know for a fact is not always desirable (because of overfitting).

To your question, it's not that we don't learn the hyper-parameters at all. Setting aside the computational challenges for a minute, it's very much possible to learn good values for the hyperparameters, and there are even cases where this is imperative for good performance; all the discussion in the first paragraph suggests is that by definition, you can't use the same data for this task.

Using another split of the data (thus creating three disjoint parts: the training set, the validation set, and the test set, what you could do in theory is the following nested-optimization procedure: in the outer-loop, you try to find the values for the hyperparameters that minimize the validation loss; and in the inner-loop, you try to find the values for the regular parameters that minimize the training loss.

This is possible in theory, but very expensive computationally: every step of the outer loop requires solving (till completion, or somewhere close to that) the inner-loop, which is typically computationally-heavy. What further complicates things is that the outer-problem is not easy: for one, the search space is very big.

There are many approaches to overcome this by simplifying the setup above (grid search, random search or model-based hyper-parameter optimization), but explaining these is well beyond the scope of your question. As the article you've referenced also demonstrates, the fact that this is a costly procedure often means that researchers simply skip it altogether, or try very few setting manually, eventually settling on the best one (again, according to the validation set). To your original question though, I argue that - while very simplistic and contrived - this is still a form of "learning".

```

### Examples of Hyperparameters

THere are many kinds of hyperparameters that one might want to tune. A good strategy (found empirically) is to tune in the following sequence:

1. learning rate $alpha$

1. number of hidden units, minibatch size, $\beta$ of Adam algorithm

1. number of layers, learing rate decay

1. Almost never done in practice, but is possible to tune $\beta_1, \beta_2, \epsilon$. Defaults (0.9, 0.999, $10^{-8}$) are usually good enough.


### Hyperparameter Search Strategy

There are some strategies that can be tried:


1. grid search - usually discouraged as computationally expensive.

1. random search - known to produce good results, but can be wasteful. Can also use coarse to fine search scheme where after some time, finding a promising neighborhood with good possibilities, we focus our search around that area - this is similar to the next section.

1. Bayesian hyperparameter optimization - balancing exploitation vs exploration.

There has been some innovation in this space, notes can be found [here](https://www.automl.org/wp-content/uploads/2018/09/chapter1-hpo.pdf)

## Using An Appropriate Scale

If we have a range for a parameter - which in itself is a task, then often using an appropriate scale that reflects the relative change can be a better sampling space. One good choice can be to sample from the log space. 

This could be for quantities such as:

- $\alpha$ sampled on log space
- $\beta$ sampled from $log(1-\beta)$


## HyperParameter Tuning In Practice

There are two ways to do hyperparameter search in practice.

### Panda Approach: Not much compute or data

The first is to babysit and mange a single model, adjusting hyperparameters manually and checking updates of performance. This is out of vogue in the era of cheap compute and big data. It is akin to a panda that has one offspring and takes care of it.

### Caviar Approach: A lot of compute and data

The second approach is much more used these days. This is to try many hundreds/thousands of hyperparameter settings and choose that which performs best. Such an approach is like caviar, with fish laying thousands of eggs and only some surviving with little supervision.

## Normalizing Activations In A Network

### Normalizing Inputs

We have seen that normalizing inputs (layer 0 activations) is a good idea because it makes the surface easier to navigate on a similar scale.  The same argument is used to justify batch norm, that is normalizing the linear combinations in each layer. This has been shown empirically to improve training and reduce problems like exploding and vanishing gradients.

Some papers suggest normalizing activations, but in practice most systems are made by normalizing linear combinations across the layers. The equations are as below:

For a fixed layer [l] and a minibatch t of size $q$, we have linear combinations for each of the q examples in this minibatch.

$z^{[l] \{t\}, (1)}, \ldots, z^{[l] \{t\}, (q)} $


Note that each of these z's is a vector with `z.shape` $= (n^[l],1)$

We now normalize these $z$'s assuming statistical independence and get location and scale vector parameters $\mu$ and $\sigma$. These have the same shape as the z's, namely $(n^[l],1)$

### Normalizing Hidden Layers Linear Combinations

$\mu$ and $\sigma$ are vectors (of shape $(n^{[l]},1)$ same as the z's) of the $i^{th}$ example on the $l^{th}$ layer of the $t^{th}$ minibatch estimated as follows.


$\mu^{[l], \{t\} } = \frac{1}{m} \sum_{i=1}^{i=q}  (z^{[l],\{t\}, (i)})$

$ (\sigma^{[l], \{t\}})^2 = \frac{1}{m} \sum_{i=1}^{i=q}  (z^{ [l],\{t\}, (i)} - \mu)$

The assumption is that there is not cross correlation.

Then the normalized values of z's are given as subtracting the mean estimate and dividing by the standard devation:

$z^{[l],\{t\}, (i)}_{\textbf{norm}} = \frac{z^{[l],\{t\}, (i)} - \mu^{[l], \{t\} }}{\sigma^{[l], \{t\}}}$


### Rescaling Normalized Linear Combinations

As will be explained later, it is useful to scale the normalized values to have arbitrary mean and variance. The reason is that certain centering and scalings are better for certain activation functions. For example, being scaling and shifting the tanh function results in the sigmoid function.

For a given location $\beta$ and scale $\gamma$ (which we will see later can be learned from the training set), we have:

$ \widetilde{z}^{[l],\{t\}, (i)} = \gamma z^{[l],\{t\}, (i)}_{\textbf{norm}} + \beta$

These \widetilde{z}'s are used as arguments to the activation functions. They need to be estimated, but this can usually be done with a single command added to most neural network frameworks such as tensorflow or pytorch.

### Why Rescale Linear Combinations?

Why do we do this? More details will be provided later, but broadly:

- Allows centering and scaling output to activation functions

- Consuming neurons can rely on some stability of the range and likely input neurons

- Better convergence properties

- Some small regularization effect

## Fitting Batch Norm Into Networks



## Why Does Batch Norm Work?

## Batch Norm At Test Time

## SoftMax Regression

## Training SoftMax Regression

## TensorFlow