## The vanishing gradient problem in RNNs
For the calculation of the gradients of the loss with respect to the hidden-to-hidden weights $W_{hh}$ over the whole time-steps with backpropagation, we can obtain the following expression: 
$$
\frac{\partial L}{\partial W_{hh}} = \sum_{t}^{T} \sum_{k=1}^{t+1} \frac{\partial L_{t+1}}{\partial \hat{y}_{t+1}} \frac{\partial \hat{y}_{t+1}}{\partial h_{t+1}} \frac{\partial h_{t+1}}{\partial h_{k}}\frac{\partial h_{k}}{\partial W_{hh}}
$$

And for the gradients of the loss with respect to the input-to-hidden weights $W_{xh}$ over the whole time-steps:

$$
\frac{\partial L} {\partial W_{xh}} = \sum_{t}^{T} \sum_{k=1}^{t+1} \frac{\partial L_{t+1}}{\partial \hat{y}_{t+1}} \frac{\partial \hat{y}_{t+1}}{\partial h_{t+1}} \frac{\partial h_{t+1}}{\partial h_{k}} \frac{\partial h_{k}}{\partial W_{xh}}
$$

In this context the term $\frac{\partial h_{t+1}}{\partial h_{k}}$ it's a sort of "chain rule" for the derivatives with respect to the other hidden states in oter time-steps. In example, $\frac{\partial h_{3}}{\partial h_{1}} = \frac{\partial h_{3}}{\partial h_{2}}\frac{\partial h_{2}}{\partial h_{1}}$

An equivalent form for $\frac{\partial h_{t+1}}{\partial h_{k}}$ is:
$$ \frac{\partial h_{t+1}}{\partial h_k} = \prod^{t}_{j=k} \frac{\partial h_{j+1}}{\partial h_{j}}  = \frac{\partial h_{t+1}}{\partial h_{t}}\frac{\partial h_{t}}{\partial h_{t-1}}...\frac{\partial h_{k+1}}{\partial h_k}$$

Each one of the terms $\frac{\partial h_{j+1}}{\partial h_{j}}$ are jacobian matrices so the gradients of the loss with respect to $W_{hh}$ and $W_{xh}$ are matrix multiplications between a large amount of terms. Performing an analysis to the jacobian matrices one can rise the conclusion that, with enough matrix multiplications, if the largest eigenvalue $\lambda_{1} <1$ then the gradient will vanish (tend to zero) while if the value of $\lambda_{1} > 1$ the gradient explodes (goes to infinity). This effect can be potentiated by the activation functions since the derivative of most of the activation functions has a maximum of 1 and depends on the gradients so if the gradients goes down the derivative of the activation functions follows it.