In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

# Purpose of this notebook

This notebook is meant to serve as a *quick reference* of key concepts/notations
from the Intro course.


# Notation

To ensure that everyone is up to speed on notation, let's review
- [the notation](ML_Notation.ipynb) that we used in the "Classical Machine Learning" part of the intro course.
- [additional notation](Intro_to_Neural_Networks.ipynb) used in the "Deep Learning" part of the intro course

# Representations

A path through a Neural Network can be viewed as a sequence of representation transformations
- transforming *raw features* $\y_{(0)} = \x$
- into *synthetic features* $\ \y_\llp$
    - varying with layer $1 \le \ll \le (L-1)$
- of increasing abstraction

Thus, the output anywhere along the path is an *alternate representation* of the input

<div>
    <center><strong>Path through a Neural Network</strong></center>
    <br>
    <div>
    <!-- edX: Original: <img src="images/NN_Layers.png"> replace by EdX created image -->
    <img src="images/W12_L1_NN_layers1920by1080.png">
</div>

Shallow features are less abstract: "syntax", "surface"

Deeper features are more abstract: "semantics", "concepts"
- We may even interpret the features as "pattern matching" regions or concepts in the raw feature space.

For example, in a CNN
- shallow features are primitive shapes
- deeper features seem to recognize combinations of shallower features

<div>
    <center>
        <center><strong>Input features detected by layer</strong></center>
        <br>
        <img src="images/ThreeLayers_W8_L2_Sl21.png" width=20%>
    </center>
</div>

<center><strong>Saliency Maps and Corresponding Patches<br>Single Layer 5 Feature Map<br>On 9 Maximally Activating Input images</strong></center>

<table>
    <tr>
        <td><img src="images/ZF_p4_118_row11_col1_mag.png"></td>
        <td><img src="images/ZF_p4_118_row11_col1_patch_mag.png"></td>
    </tr>
    <tr>
        <td colspan=2><center>Layer 5 ? Feature Map (Row 11, col 1).</center></td>
    </tr>
</table>
Attribution: https://arxiv.org/abs/1311.2901


In the simple architectures of the Intro course, we mostly ignored the intermediate representations
$$
\y_\llp : \; 1 \le \ll \le (L-1)
$$

The layers were referred to as "hidden" for a reason !

We will discover uses for intermediate representations and show how to build a "feature extractor" to obtain them
from a given architecture.

# Recurrent Neural Networks

With a sequence $\x^\ip$ as input, and a sequence $\y$ as a potential output,  the questions arises:
- How does an RNN produce $\y_\tp$, the $t^{th}$ output ?

Some choices
- Predict $\y_\tp$ as a direct function of the prefix of $\x$ of length $\tt$: 
$$\pr{\y_\tp | \x_{(1)} \dots \x_\tp} $$

<br>
<div>
    <center><strong>Direct function</strong></center>
    <img src="images/RNN_arch_parallel.png" width=50%>
</div>

- Loop
    - Uses a "latent state" that is updated with each element of the sequence, then predict the output

$$
\begin{array}[lll] \\
\pr{\h_\tp | \x_\tp, \h_{(\tt-1)} } & \text{latent variable } \h_\tp \text{encodes } [ \x_{(1)} \dots \x_\tp ]\\
\pr{\y_\tp | \h_\tp }              & \text{prediction contingent on latent variable} \\
\end{array}
$$

    
<br>
<div>
    <center><strong>Loop with latent state</strong></center>
    <img src="images/RNN_arch_loop.png" width=70%>
</div>


## Latent state

The *latent state* $\h_\tp$ is a kind of memory that acts
as a *summary* of the prefix of sequence $\x$ through time step $\tt%:

$$
\h_\tp = \text{summary}(\x_{([1:\tt])})
$$

Note that $\h_\tp$ is a *vector* of fixed length.

Thus, it is a *fixed length* representation of the key aspects
of a sequence $\x$ of potentially *unbounded* length.

**Example**

Let's use an RNN to compute the sum of a sequence numbers
- the latent state $\h_\tp$ can be maintained as 
$$
\h_\tp = \text{summary}(\x_{([1:\tt])}) = \sum_{\tt' =1}^\tt { \x_{(\tt')} }
$$
- by updating $\h_\tp$ in the loop
$$
\h_\tp = \h_{(\tt-1)} + \x_\tp
$$

Let's make this concrete with an example: a sequence of words

<table>
    <tr>
        <th><center>RNN</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_loop_NLP.png" width=1000></td>
    </tr>
</table>

$\h_\tp$ is a **fixed length** vector that "summarizes" the prefix of sequence $\x$ up to element $\tt$.

The sequence is processed element by element, so order matters.

\begin{array} \\
\h_{(0)} & = & \text{summary}( [ \text{Machine} ]) \\
\h_{(1)} & = & \text{summary}( [ \text{Machine, Learning} ]) \\
\vdots \\
\h_\tp & = & \text{summary}( [ \x_{(0)}, \ldots \x_\tp ] ) \\
\vdots \\
\h_{(5)} & = & \text{summary}( [ \text{Machine, Learning, is, easy, not, hard} ]) \\
\end{array}

The importance of $\h_\tp$ being *fixed length*
- can be used as input to other types of Neural Network layers
- which *don't* process sequences.

A typical example is a model for text classification (sentiment)
- Using an RNN to create a fixed length encoding of a variable length sequence
- A Head Layer that is a Binary Classifier

<table>
    <tr>
        <th><center><strong>RNN Many to one; followed by classifier</strong></center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_many_to_one_to_classifier.jpg" width=80%></td>
    </tr>
</table>

## Output $\hat\y_\tp$ of an RNN

According to our pseudo-code and diagram
$$
\hat\y_\tp = \h_\tp
$$

That is: the output is the same as the latent state.

It is easy to add another NN to transform $\h_\tp$ into a $\hat\y_\tp$ that is different
- we will omit this additional layer for clarity


## Unrolled RNN diagram

<table>
    <tr>
        <th><center>RNN many to many API</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer_API_many_to_many.jpg"></td>
    </tr>
</table>

## Encoder-Decoder architecture; Auto-regressive 

A very common architecture pairs two RNN's
- an Encoder, which summarizes the input sequence $\x_{([1:\bar T])}$ via final latent state $\bar \h_{(\bar T)}$
- a Decoder, which takes the input summary $\bar \h_{(\bar T)}$ and outputs sequence $\hat \y_{([1:T])}$

It is used for *Sequence to Sequence* tasks where both the input and output are sequences.

<table>
    <tr>
        <th><center><strong>Encoder-Decoder for language translation</strong></center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer_API_Encoder_Decoder_Language_Translation.png" width=80%></td>
    </tr>
</table>

Notice that the Decoder output $\hat\y_{(\tt-1)}$ at position $(\tt-1)$ is fed back as *input* for position $\tt$.

This is called *Autoregressive* behavior.

It is typical behavior for Generative tasks.

<table>
    <tr>
        <th><center>Test time: no forcing</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer_API_one_to_many.png"></td>
    </tr>
</table>


# Language Models

The *Language Model* training objective
- given some text
    - sequence of *tokens*
- predict a word that could be the next word in the sequence

We sometimes refer to this as the "predict the next" task.

Clearly, we need to train a model on the "predict the next" objective with labeled examples.

But this is sometimes called Semi-Supervised or Unsupervised because text is not inherently labeled.

Yet we can easily create $T$ labeled examples from a text string $s[1:T]$. Example $\tt$
- feature: $s[1:\tt-1]$
- label: $s[\tt]$

<center>$\mathbf{s} = \mathbf{s}_{(1)}, \ldots, \mathbf{s}_{(T)}$</center>
        <br><br><br>
\begin{array} \\
      i  & \x^\ip  & \y^\ip \\
      \hline \\
      1 & \mathbf{s}_{(1) }  & \mathbf{s}_{(2)} \\
      2 & \mathbf{s}_{(1), (2) }  & \mathbf{s}_{(3)} \\
      \vdots \\
      i & \mathbf{s}_{(1), \ldots, (i) }  & \mathbf{s}_{(i+1)} \\
      \vdots \\
      (T-1) & \mathbf{s}_{(1), \ldots, (T-1) }  & \mathbf{s}_{(T)} \\
\end{array}

The *Unsupervised Pre-Trained Model + Supervised Fine-Tuning paradigm*
is 
- a way of adapting a model trained on the Language Modeling objective
- to perform another task

Pre-training refers to training a model on the Language Modeling objective with *lots* of data
- this is called Unsupervised because text is not inherently labeled
- we can easily create a labeled example  from a text string $s[1:T]$
    - feature: $s[1:\tt-1]$
    - label: $s[\tt]$

- Pre-training
    - Train a model with *lots* of data
    - On the 

In [2]:
print("Done")

Done
