# # Course 2: Week 3

## ## Hyperparameters tuning

[...] "it's just hard to know in advance which ones turn out to be
the really important hyperparameters for your application and sampling at random rather than in the grid shows that you are more richly
exploring set of possible values for the most important hyperparameters, whatever they turn out to be."

![](media/rand.png)

[...] "When you sample hyperparameters, another common practice is to use a coarse to fine sampling scheme"

![](media/rand2.png)

[...] "You can then sample more densely into smaller square. So this type of a coarse to fine search is also frequently used"

To sum up: Do **not** use grid search.

We just have to be 'randomly smart'.

**Example**:

Using a logarithmic scale:

![](media/rand3.png)

Since,

$a = {log_{10}}^{0.0001}$

$a = -4$, since $10^{-4} = 0.0001$

and, 

$b = {log_{10}}^{1}$

$b = 0$, since $10^{0} = 1$

A random value will be $\alpha = 10^r$, where r is in {a, b} = {-4, 0}

In python,

`r = -4 * np.random.rand()`

**Example:**

The neat trick is, instead of $\beta$, we'll use $1-\beta$.

![](media/rand4.png)

## ### Watching over models

![](media/bbs.png)

Choose the approach that better fits your (computer) resources.

## ## Batch Normalization

[...] "what batch norm does is it applies that normalization process not just to the input layer, but to the values even deep in some hidden layer in the neural network"

You will be normalizing intermediate values during the fowawrd propagation step.

The activations $z^{[l](i)}$ will be normalized.

$z_{norm}^{(i)} = \frac{z^{(i)}}{\sqrt{\sigma^2 + \epsilon}}$, we add $\epsilon$ in case $\sigma$ is zero.

In case the distribution isn't normal ($\mu = 0$, $\sigma = 1$), we may calculate

$ẑ^{(i)} = \sqrt{\gamma * z_{norm}^{(i)} + \beta}$, where $\gamma,\beta$ are parameters from the model that will be updated.

(`^` == `~`)

Controlling $\gamma,\beta$:

![](media/btn.png)

[...] "But by choosing other values of gamma and beta, this allows you to make the hidden unit values have other means and variances as well."

<div class="alert alert-block alert-warning">
[...] "For example, if you have a sigmoid activation function, you don't want your values to always be clustered here. You might want them to have a larger variance or have a mean that's different than 0, in order to better take advantage of the nonlinearity of the sigmoid function rather than have all your values be in just this linear  regime"
</div>

## ### Fitting Batch Norm (BN) into a neural network



![](media/btn2.png)

![](media/btn3.png)

![](media/btn4.png)

![](media/btn5.png)

[...] "Because any constant you add will get cancelled out by the mean subtractions step."

Then,

![](media/btn6.png)

[...] "batch norm reduces the problem of the input values changing,
it really causes these values to become more stable, so that the later layers of the neural network has more firm ground to stand on.

And even though the input distribution changes a bit, it changes less, and what this does is, even as the earlier layers keep learning, the amounts that this forces the later layers to adapt to as early as layer changes is reduced or, if you will, **it weakens the coupling  between what the early layers parameters has to do and what the later layers parameters have to do.**

And so it allows each layer of the network to learn by itself, a  little bit more independently of other layers, and this has the effect of speeding up of learning in the whole network"

![](media/btn7.png)

In [33]:
!mv -v /home/f4119597/Downloads/Screenshot.png media/btn8.png

mv: cannot stat '/home/f4119597/Downloads/Screenshot.png': No such file or directory
