# Course 5: Week 1

![](media/Selection_001.png)

Example:

![](media/Selection_002.png)

Where $x^{(i)<t>}$ representes the vector corresponding to the **$t$-esm** word on the **$i$-esm** input example.

When dealing with text/words data, we need to one-hot encode it, meaning we also need a dictionary vector, with size = number of words (in alphabetical order).

![](media/Selection_003.png)

Pay attention that each vector will only have a $1$ on the correspondent position of it's word on the dictionary, and all other values will be $0$.

For words not in the vocabulary, we assign it as '$unknown$' word.

## # Recurrent Neural Network

![](media/Selection_005.png)

## ## Foward step

Model example:

For each time step $t$ the activation of $x^{<t>}$ will be passed on to $x^{<t+1>}$

![](media/Selection_006.png)

Note that $T_x$ and $T_y$ may have the **same lenght**, but that's not true for every case.

It's also a good practice to introduce a initial ($t = 0$) input example, made of $0$'s:

![](media/Selection_007.png)

Note that, so far, the prediction $ŷ_n$ only uses the activation $a^{<n-1>}$, thus being unidirectional.

![](media/Selection_008.png)


Keep in mind that there are "two moviments" happening on each layer, one *horizontal* for finding $a^{<t>}$, and another *vertical* for classifying $x^{<t>}$.

<div class="alert alert-danger" role="alert">
The activation functions for solving $a$ and $y$ are usually distinct functions!
</div>

Then,

![](media/Selection_010.png)

Whe can rewrite it:

![](media/Selection_012.png)

and,

![](media/Selection_013.png)

## ## Back step

Remember the foward step:

![](media/Selection_014.png)

We'll use this $Loss$ equation:

![](media/Selection_015.png)

And the $Cost$ we'll be the sum ($\sum$) of the $L_i$'s.

Then,

![](media/Selection_017.png)

You should give it a try to write down how the backprop equations would be like.

You don't need - but would be good - to figure out each one exactly, but try to write then, at least, as a matter of the chain rule.

e.g.:

- $\frac{\partial \mathcal{L}}{\partial ŷ^{<t>}}$:

$\frac{\partial \mathcal{L}}{\partial ŷ^{<t>}} = \frac{-y}{ŷ^{<t>}} + \frac{(1 - y)}{(1 - ŷ^{<t>})}$

- $\frac{\partial \mathcal{L}}{\partial a^{<t>}}$:

$\frac{\partial \mathcal{L}}{\partial a^{<t>}} = \frac{\partial \mathcal{L}}{\partial ŷ^{<t>}} * \frac{\partial \mathcal{ŷ^{<t>}}}{\partial a^{<t>}}$

Where,

$ŷ^{<t>} = g(W_{ya} * a^{<t>} + b_y)$

Then,

*Suppose $g(x)$ is the $sigmoid$ function, $\frac{\partial \mathcal{g}}{\partial x} = g(x)*(1 - g(x))$. In this case, we have a composite function, $g(f(x))$, where $f(x) = w * x + b$.*

$\frac{\partial \mathcal{ŷ^{<t>}}}{\partial a^{<t>}} = \frac{\partial \mathcal{y}}{\partial g} * \frac{\partial \mathcal{g}}{\partial a}$

$= g(W_{ya} * a^{<t>} + b_y)*(1 - g(W_{ya} * a^{<t>} + b_y)) * \frac{\partial \mathcal{(W_{ya} * a^{<t>} + b_y)}}{\partial a^{<t>}}$

$= g(W_{ya} * a^{<t>} + b_y)*(1 - g(W_{ya} * a^{<t>} + b_y)) * W_{ya}$

- And so on...

## ## RNN Architectures

- Many to many; many to one, one to one:

![](media/Selection_018.png)

- One to many:

Note that, often, $ŷ^{<t>}$ will be fed to $ŷ^{<t+1>}$.

![](media/Selection_019.png)

- Many to many ($T_x = T_y$):

Previously seen.

- Many to many ($T_x \ne T_y$):

![](media/Selection_020.png)

## ## Sequence Generation

## ### Language moldel

It's job is, given any sentence, to tell you the what is the probability $P(sentence)$ of that input being a certain sentence.

![](media/Selection_021.png)

$<EOS>$: End Of Sentence.

$<UNK>$: Unknown token/word.

We'll set up the inputs $x^{<t>} = y^{<t-1>}$.

The first step:

![](media/Selection_022.png)

[...] "But what $a^{<1>}$ does is it will make a $softmax$ prediction to try to figure out what is the probability of the first words $y$. And so that's going to be $ŷ^{<1>}$. So what this step does is really, it has a soft max it's trying to predict. What is the probability of any word in the dictionary?" **For each word** in our corpus.

We propagate $a^{<1>}$ to the next step, and try to guess the next word. But we give it the correct word:

$y^{<1>} = x^{<2>} =$ **Cats**.

Now, it shuould calculate the prbability of the next word ($ŷ^{<2>}$) **given the fact that the previous word is "Cats"**.

Then, predict $ŷ^{<3>}$, given that $y^{<1>}$ and $y^{<2>}$ are given

![](media/Screenshot(1).png)

And finally, trying to predit $<EOS>$ given all previous inputs

![](media/Screenshot(2).png)

Each step of the RNN will look at a set of the previous words, learning to predict one word a time.

For the loss and cost functions:

![](media/Screenshot(3).png)

## ## Sampling novel sequences

After training time, we could sample from the model

![](media/Screenshot(4).png)

## ## Vanishing gradients with RNNs

For a long sentece, it'll be hard for the error to back propagate until the input that might be early in the sequence, and modify the related weights.

![](media/Screenshot(5).png)

Exploding gradients will usualy lead to $NaN$s, which will be easy to spot.

## ### Gated Recurrent Unit (GRU)

Without GRU:

![](media/Screenshot(8).png)

With GRU:

![](media/Screenshot(6).png)

We have the following:

- $-1 < tanh : $ (~c)$< 1$

- $0 < sigmoid : \Gamma_u < 1$

Then,

- $\Gamma_u \approx 0 \Rightarrow c^{<t>} = c^{<t-1>}$, i.e: keep

- $\Gamma_u \approx 1 \Rightarrow c^{<t>} = $~$c^{<t>}$, i.e: update


Note that all of the these will have the same dimmensions:

![](media/Screenshot(7).png)

$\Gamma_r$ tells how relevant $c^{<t-1>}$ is to computing ~$c^{<t>}$

Why $\Gamma_r$? Because it's a good practice, so far.

## ## LSTM

![](media/Screenshot(9).png)

![](media/Screenshot(10).png)

## ## Bidirectional RNN

BRNNs will allow information regarding $ŷ^{<t>}$ to be fed from the future.

![](media/Screenshot(11).png)

Each cell $t$ will use $a^{\leftarrow<t+1>}$ backward activation, which will be obained using $a^{\leftarrow<t+2>}$, which will be obained using $a^{\leftarrow<t+...>}$ and so on.

Thus, suffering influence from all the tokens in the setence who come before and after it.

![](media/Screenshot(12).png)

## ## Deep RNN

Try to think of it as a "fully connected" RNN.

![](media/Screenshot(13).png)

![](media/Screenshot(14).png)

![](media/Screenshot(15).png)

The evaluation blocks can also be GRUs or LSTMs.