In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

# Teacher Forcing

Here is a diagram of the Encoder/Decoder architecture.
<br>
<br>
<table>
    <tr>
        <th><center>RNN Encoder/Decoder with Cross Attention and Self Attention (Encoder/Decoder)</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer_API_Encoder_Decoder_Attention_All_Self_Attention.png"
             width=80%</td>
    </tr>
   
</table>

The grey box represents the *entire* output sequence
$$
\hat\y_{(1:T)}
$$

From this diagram: it appears that
- the Encoder/Decoder can produce output $\hat\y_\tp$
- while attending to outputs *that have not yet been generated* at the start of step $\tt$
$$\hat\y_{(\tt : T)}$$
- "looking into the future"

That is, it is computing
$$\prc{\hat\y_\tp}{\hat\y_{(1:T)} }$$
What is going on ?

## Teacher forcing at training time
 
 The *Auto-Regressive* Decoder behavior
 - constructs output sequence $\hat\y$ sequentially
     - output $\hat\y_\tp$ of step $\tt$
     - becomes part of the input $\hat\y_{(1:\tt)}$ of step $\tt+1$
     
 So how can the entire output sequence $\hat\y_{(1:T)}$
 be available before final step $T$ ?

We need to distinguish between behavior
- during *training*
- versus during *inference*

During inference, clearly we can only compute
$$
\prc{\hat\y_\tp}{\hat\y_{(1 : \tt-1)} }
$$

because we haven't generated the future outputs yet.

But at training time, example $\tt$ is
$$ \langle \y_{(1 : \tt-1)}, \y_\tp \rangle $$

We predict *only the immediate next* target $\y_\tp$
- *not* the full suffix $\hat\y_{(\tt:T)}$

The prediction is conditioned on the *true* target prefix $\y_{(1 : \tt-1)}$
- *not* the prefix generated during training $\hat\y_{(1 : \tt-1)}$

The Auto-Regressive behavior is eliminated during training !

Training in this manner has a big advantage.

In a a perfect world, when predicting $\hat\y_\tp$
- the Auto-Regressive behavior during training would result in
- the prefix $\hat\y_{(1 : \tt-1)}$ of the generated output 
- matching the true target 
$$
\hat\y_{(1 : \tt-1)} = \y_{(1 : \tt-1)}
$$

But
- if any element of the *generated* prefix is wrong
$$
\hat\y_{(\tt')} \ne \y_{(\tt')} \text{ for  }\tt' \lt \tt
$$
- it is likely that all *subsequent predicted outputs* $\hat\y_{(\tt'+1:T)}$ will be wrong
- because each subsequent output **is conditioned on incorrect** $\hat\y_{(\tt')}$

That is
- a single mis-predicted element of the sequence
- causes a catastrophic chain of errors

Training a model under such conditions would be difficult
- the incorrect sequences don't even come from the true distribution of inputs !
- violating the Fundamental Theorem of Machine Learning

To avoid this, our training examples compute
$$
\prc{\hat\y_\tp}{\y_{(1 : \tt-1)} }
$$
rather than
$$
\prc{\hat\y_\tp}{\hat\y_{(1 : \tt-1)} }
$$

That is: we train on *target* prefixes rather than train-time generated prefixes.

This is called *Teacher Forcing*.

Teacher forcing can be implemented
- by making the entire target output sequence
$$
\y_{(1:T)}
$$
i.e., the grey box in the diagram

- available at training time via setting example $\tt$ to
$$ \langle \y_{(1 : T)}, \y_\tp \rangle $$
- and using *Causal masking*
- to make only prefix $\y_{(1:\tt-1)}$ visible during the prediction of $\y_\tp$

In [2]:
print("Done")

Done
