In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

# Back propagation through time (BPTT)

A Recurrent Neural Network (RNN)
- Can be viewed as a loop
- That can be unrolled
- Resulting in a multi-layer network
- One layer per time step

Here are the final layers of an unrolled RNN with input sequence
$$
\x_{(1)}, \ldots, \x_{(T)}
$$


<table>
    <tr>
        <th><center>RNN many to many API</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer_API_many_to_many.jpg"></td>
    </tr>
</table>


Given enough space: we would continue unrolling on the left to the Input layer
- Resulting in a network with $T$ unrolled layers
- Plus a Loss layer

To compute the derivatives of the Loss with respect to weights
- We could, in theory, use Back Propagation
- Which is the weight update step of Gradient Descent

<div>
<center>Backward pass: Loss to Weights</center>
<br>
<img src="images/NN_Layers_plus_Loss_backward.png">
</div>

When dealing with unrolled RNN's
- We will index the "unrolled layers" with time steps, denoted by the label $t$
- Rather than $\ll$, which we use to index layers

This process is called *Back Propagation Through Time* (BPTT).

The only special thing to note about BPTT is that the Loss function is more complex
- There is a Loss
- Per example (as in non-recurrent layers)
- **and Per time-step** (unique to recurrent layers)

<table>
    <tr>
        <th><center>RNN Loss: Forward pass</center></th>
    </tr>
    <tr>
        <td><img src=images/RNN_layer_loss.png></td>
    </tr>
</table>

<table>
    <tr>
        <th><center>RNN Loss: Backward pass</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer_loss_gradient.png"></td>
    </tr>
</table>

# Truncated back propagation through time (TBPTT)

An unrolled RNN layer turns into a $T$ layer network where $T$ is the number of elements in the input sequence.

For long sequences (large $T$) this may not be practical.

First, there is the computation *time*
- $t$ steps to compute $\loss^\ip_\tp$, the loss due to the $t^{th}$ output $\y^\ip_\tp$ of example $i$
- For each $1 \le t \le T$

Less obvious is the *space* requirement
- As we saw in the module "How a Neural Network Toolkit works"
- We may store information in each layer of the Forward pass (so storage for $T$ layers)
- To facilitate computation of analytical derivatives on the Backward pass
    - For example: the Multiply layer stored the multiplicands in the forward pass
    - Because they are needed for the derivatives

Moreover, as we shall shortly see
- Derivatives may vanish or explode as we proceed further backwards from the Loss layer to the Input layer

So, in theory, the weights $\W_\tp$ for small $\tt$ (close to the input) may not get updated.
- This is certainly a problem in a non-recurrent network
- But is **fatal** in a recurrent layer
- Since there is a **single** weight matrix $\W$ that is shared across *all time steps*
    $$\W_\tp = \W \; \text{ for all } 1 \le t \le T$$

The solution to these difficulties
- Is to *truncate* the unrolled RNN
- To a fixed number of time steps
- From the loss layer backwards
- The truncated graph is a suffix of the fully unrolled graph

This process is known as *Truncated Back Propagation Through Time* (TBPTT).


Note that *truncation only occurs in the backward pass*.

There is *no truncation* of the forward pass of the RNN !


Because the unrolled graph is less than $T$ steps
- Gradient computation takes fewer steps 
- So weight updates can occur more often

The obvious downside to truncation is that
- Gradients are only approximate

But there is a subtle and more impactful difference
- The RNN layer *cannot capture long-term dependencies*

Suppose we unrolled the layer for only $\tau$ time steps (the "window" size)
- The loss for the $\tt^{th}$ time step ($\loss^\ip_\tp$)
- Flows backwards only to steps 
$$(\tt - \tau+1), \ldots, t$$

So the "error signal" from time $\tt$ does not affect time steps $\tt' \lt (\tt - \tau+1)$


Consider a long sentence or document (sequence of words)
- If the gender of the subject is defined by the early words in the sentence
- An incorrect "prediction" late in the sentence
- May not be able to be corrected

"Z was the first woman who ...  **he** said ..."


In other words
- Truncation may affect the ability of an RNN to encode *long-term* dependencies
- Vanishing gradients may cause a similar impact


## TBPTT variants

There are several common ways to decide on how many unrolled time steps to keep.

Let $\tt''$ denote the index of the *smallest* time step in the unrolled layer for step $\tt$.
- $\tt'' = (\tt - \tau +1)$



Plain, untruncated BPTT defines
- $\tt'' = 0$
- Unroll all the way to the Input Layer

$k$-truncated BPTT defines window size $\tau = k$
- $\tt'' = \max{}(0, \tt -k)$

Subsequence truncated BPTT defines
- $\tt'' = k * \floor{\tt/k}$

That is, it breaks the sequence into "chunks" of size $k$

$$
\begin{array}[lll]\\
\x^\ip_{(1)}, \ldots, \x^\ip_{(k )} \\
\x^\ip_{(k+1)}, \ldots, \x^\ip_{( 2*k )} \\
\vdots \\
\x^\ip_{( (i'*k) +1)}, \ldots, \x^\ip_{( (i'+1)*k )} \\
\vdots
\end{array}
$$

- Gradients flow *within* chunks
- But *not between* chunks


Subsequence TBPTT is very common as it fits well into the design of current toolkits

See the Deep Dive on [How to deal with long sequences](RNN_Long_Sequences.ipynb)
for how to arrange your training examples.

# Calculating gradients in an RNN

There is an important subtlety we have ignored regarding Back Propagation in an unrolled RNN
- There is a **single** weight matrix $\W$ that is shared across *all time steps*
    $$\W_\tp = \W \; \text{ for all } 1 \le t \le T$$
    

This 
- Makes the derivative computation slightly more complex
- Creates an *additional* exposure to the problem of vanishing/exploding gradients

A simple picture will illustrate.

Consider the loss at time step $\tt$ of example $i$
- $\loss^\ip_\tp = L(\hat{\y}^\ip_\tp, \y^\ip_\tp; \W)$
- The loss is a function of 
    -  $\hat{\y}^\ip_\tp$: The $\tt^{th}$ element of the output sequence $\hat{\y}^\ip = \y_{(T)}$ for example $i$
    - The $\y^\ip_\tp$: The $\tt^{th}$ element of the **target** sequence $\y^\ip$ for example $i$

Recall from the module on back propagation that $\W$ is updated in proportion to
$$
\frac{\partial \loss_\tp}{\partial \W}
$$

and this quantity is obtained from
 $$
\begin{array}[lll] \\
\frac{\partial \loss}{\partial \W_\llp} & = & \frac{\partial \loss}{\partial \y_\llp} \frac{\partial \y_\llp}{\partial \W_\llp} & = & \loss'_\llp \frac{\partial \y_\llp}{\partial \W_\llp}
\end{array}
$$

where $\y_\tp$ is the output of layer $\tp$ (i.e., that which is fed as input to layer $(\tt+1)$

In the case of an RNN:
$$
\y_\tp = \h_\tp
$$


<table>
    <tr>
        <th><center>RNN Time step</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer.png"></td>
    </tr>
</table>

<table>
    <tr>
        <th><center>RNN multiple dependence on W</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer_gradient.png"></td>
    </tr>
</table>

The red lines show **two** different ways that $\W$ (in particular: $\W_{hh}$) affects $\h_\tp$
- And thus $\hat{\y}_\tp = \W_{hy} \h_\tp + \b_y$
- By its indirect effect on $\h_\tp$ **through** $\h_{(\tt-1)}$ (lower line)
- By its direct effect on $\h_\tp$ (upper line)
- Both using the part of $\W$ denoted by $\W_{hh}$

So

$$
\begin{array}\\
\frac{\partial \h^\ip_\tp}{\partial \W_{hh}} & = & \frac{d \h^\ip_\tp}{d \W_{hh}} 
    + \frac{\partial \h^\ip_\tp}{\partial \h^\ip_{(\tt-1)}} \frac{\partial \h^\ip_{(\tt-1)} }{\partial \W_{hh}} \\
& = & \frac{d (\W_{hh} \h^\ip_{(\tt-1)})}{d \W_{hh}} 
    + \frac{\partial \h^\ip_\tp}{\partial \h^\ip_{(\tt-1)}} \frac{\partial \h^\ip_{(\tt-1)} }{\partial \W_{hh}} \\
\end{array}
$$

(Each addend reflect a different path through which $\W_{hh}$ affects $\h_\tp$)
- There is a direct dependence of $\h^\ip_\tp$ on $\W_{hh}$
- There is an indirect dependence $\h^\ip_\tp$ on $\W_{hh}$ through $\h^\ip_{(\tt-1)}$
    - and all prior $\h^\ip_{(\tt')}$ for $ \tt' \lt \tt$ (since $\h^\ip_{(\tt')}$ in turn depends on $\h^\ip_{(\tt'-1)}$)

So 

$$
\frac{\partial \loss^\ip_\tp}{\partial \W} = \loss'_\tp \frac{\partial \y^\ip_\tp}{\partial \W}
$$

and
$$\frac{\partial \y^\ip_\tp}{\partial \W}$$
*depends* on all time steps from $1$ to $t$.

Thus, the derivative update for $\W$ cannot be computed without the gradient (for each time step $t$)
flowing all the way back to time step $0$.


# Conclusion

Updating the weights of a Recurrent layer appears, at first glance, to be straight forward
- Unroll the loop
- Use ordinary Back Propagation

We have discovered some complexity
- Full unrolling is expensive
- Gradient computation is complicated by shared weights

Fortunately, we have solutions to these complexities.

In [3]:
print("Done")

Done
