# Course 2: Week 1

## # Data

## ## Sets

[...] 'take all the data you have and carve off some portion of it to  be your **training set**. Some portion of it to be your *hold-out cross validation set*, and *this is sometimes also called the development set*. And for brevity I'm just going to call this the **dev set**, but all of these terms mean roughly the same thing.'

![split](./files/media/split.png)

The **goal** of the dev set is that you're going to test different algorithms on it and see which algorithm works better.

## ## Bias and variance

*(If you never heard about the 'Bias-Variance Tradeoff, there's a great TLDR [here](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=2ahUKEwj14r6clYbnAhUDG7kGHZVSChoQFjAAegQIARAB&url=https%3A%2F%2Fstats.stackexchange.com%2Fquestions%2F4284%2Fintuitive-explanation-of-the-bias-variance-tradeoff&usg=AOvVaw1A8Yka6B1y54v36zlsXn0S))*

![bv](./files/media/bv.png)

Example:

![bv2](./files/media/bv2.png)

(*About **Bayes Error**: See [here](https://www.cs.helsinki.fi/u/jkivinen/opetus/iml/2013/Bayes.pdf)*)

## ## Basic recipe

![rec](./files/media/rec.png)

## # Regularization

[Regularization (mathematics) - Wikipedia](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&cad=rja&uact=8&ved=2ahUKEwinjqyptYbnAhVKOKwKHVtsCdcQFjACegQIDRAG&url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FRegularization_(mathematics)&usg=AOvVaw04AA1ClsGSf0abnrOMr_C2): "In mathematics, statistics, and computer science, particularly in machine learning and inverse problems, regularization is the process of adding information in order to solve an ill-posed problem or to prevent overfitting."

## ## Weight Decay

![wd](./files/media/wd.png)

Note that, when computing the parameters update, we end up with

$w^{[L]} = w^{[L]} * (1 - \frac{alfa * lambd}m) - alfa * {\partial w^{[L]}}$

Since,

(`alfa`, `lambd`, `m`) > 0

Then,

$(1 - \frac{alfa * lambd}m) < 1$, 

And,

$w^{[L]} * (1 - \frac{alfa * lambd}m) < w^{[L]}$

Which means that we are, first, performing a '*shrinking*' of $w^{[L]}$ and then updating it as we know

## ### Why regularization reduces overfitting?

![reg](./files/media/reg.png)

[...] "if you crank regularisation lambda to be really, really big,
they'll be really incentivized to set the weight matrices W to be reasonably close to zero. So one piece of intuition is maybe it set the weight to be so close to zero for a lot of hidden units that's basically zeroing out a lot of the impact of these hidden units.
And if that's the case, then this much simplified neural network becomes a much smaller neural network."

![reg2](./files/media/reg2.png)

[...] "And so that will take you from this overfitting case much closer to the left to other high bias case. But hopefully there'll be an intermediate value of lambda that results in a result closer to this just right case in the middle. But the intuition is that by cranking up lambda to be really big they'll set W close to zero,
which in practice this isn't actually what happens. We can think of it as zeroing out or at least reducing the impact of a lot of the hidden units so you end up with what might feel like a simpler network."

![reg3](./files/media/reg3.png)

Each 'disabled' hidden unit will have a much lower impact on data.

If $lambd$ is too high, the behaviour of the network will be close to linear, which might result in a 'linear network'

## ## Dropout regularization

Randomly remove neurons from layers and train on example to see how it goes

![drop](./files/media/drop.png)

### ### Inverted dropout

![drop2](./files/media/drop2.png)

Where the vector `d` will point out which neurons to shut down.

[...] "for different training examples, you zero out different hidden units. And in fact, if you make multiple passes through the same training set, then on different pauses through the training set, you should randomly zero out different hidden units"

## ## Data augmentation

You might mirror, skew, zoom, apply little distortion to your data and generate more inputs whenever you cannot acquire new different data

![aug](./files/media/aug.png)

## ## Early stopping

Plot both training error and dev set error curve.

Find the latter point where both curves are close, and keep those parameters

![stop](./files/media/stop.png)

[...] "And the advantage of early stopping is that running the gradient descent process just once, you get to try out values of small w, mid-size w, and large w, without needing to try a lot of values of the L2 regularization hyperparameter lambda."

## # Optimization

## ## Normalizing inputs

![norm](./files/media/norm.png)

Rember the relation between `std dev` and `variance:` $stdDev = variance^{1/2}$

So,

- variance: ${\sigma^2}$

- standard deviation: ${\sigma}$


Now we have a dataset whose ${\mu} = {\sigma} = 1$. It's now, by definition, a [normal distribution](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=2ahUKEwiS68HG_4rnAhVkH7kGHfEKDDAQFjAAegQIBBAB&url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FNormal_distribution&usg=AOvVaw044ZZPoqQRQ0hObRmquz6Z).

<div class="alert alert-block alert-info">
<b>TLDR:</b> Subtract the mean and divide by standard deviation: $x_{norm} = \frac{x - \mu}{\sigma}$
</div>

![norm2](./files/media/norm2.png)

Normalizing data will make it easir for gradient descent, achieving minimum $J(w, b)$ faster, speeding up training time and using less computing resources.

![norm3](./files/media/norm3.png)

## ## Vanishing/exploding gradients

Take this example:

![vanex](./files/media/vanex.png)

Then,

$ŷ = W^{[L]} * W^{[L-1]} * x$

Now, note that for `?` > 1, we'll have

- Exploding: `gradients` -> $+∞$.

Whereas for values 0 < `?` < 1,

- Vanishing: `gradients` -> 0.

What can lead us towards long training periods, since it'll be hard to optimize the cost function

## ### Weight initialization

Normalize the random initialized weights.

For a single neuron:

![ini](./files/media/ini.png)
*(this works for a ReLU activation)*

[...] "And this doesn't solve, but it definitely helps reduce the vanishing, exploding gradients problem, because it's trying to set each of the weight matrices w, you know, so that it's not too much
bigger than 1 and not too much less than 1 so it doesn't explode or vanish too quickly"

## ### Numerical approximation

![](./files/media/.png)

In [52]:
!mv -v /home/f4119597/Downloads/Screenshot*.png files/media/ini.png

renamed '/home/f4119597/Downloads/Screenshot.png' -> 'files/media/ini.png'
