In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

# Word representations

We present the evolution of the way words are represented in NLP tasks.

In the sequel, let
- $\w$ denote a token
- $\Vocab$ denote the vocabulary

## One Hot Encoding

A token is encoded as a OHE vector over the vocabulary $\Vocab$
$$
\begin{array} \\
\text{rep}(\w) = \text{OHE}(\w) * I & \text{where } I \text{is the identity matrix } || I || = (||\Vocab || \times || \Vocab ||)\\
\end{array}
$$

- $\text{rep}(\w)$ is long ($|| \Vocab ||$) and sparse
- does not capture relationship between tokens

## Embeddings

[Embeddings](NLP_Embeddings.ipynb)


A token is encoded as short, dense vector
$$
\begin{array} \\
\text{rep}(\w) = \text{OHE}(\w) * E & \text{where } E \text{ is the embedding matrix } || E || = (||\Vocab || \times n_e)\\
\end{array}
$$

- $\text{rep}(\w)$ is short ($n_e <<|| \Vocab ||$) and dense
- captures relationship between tokens 
    - "meaning"
- *Not* context sensitive
    - "bank":
        - financial institution ?
        - edge of a river ?
        - tilt (e.g., turning a plane)
    

# Contextualized representations

*The representation of each token in a sequence depends on other parts of the sequence*

- Unidirectional
$$
\begin{array} \\
\text{rep}(\w_\tp) =  F( \w_\tp | \w_{(0)}, \ldots, \w_{(\tt-1)} ) \\
\end{array}
$$

The latent state $\h_\tp$ of an RNN is the natural candidate for $F$
- $\text{rep}$ is short ($|| \h_\tp ||$)
- captures the left context $\w_{(0)}, \ldots, \w_{(\tt-1)}$

But the token may depend on the *full* context.

- Bidirectional
$$
\begin{array} \\
\text{rep}(\w_\tp) =  \text{concat} \left( F( \w_\tp | \w_{(0)}, \ldots, \w_{(\tt-1)} ), F( \w_\tp | \w_{(T)}, \ldots, \w_{(\tt + 1)} ) \right)\\
\end{array}
$$

The latent state $\h_\tp$ of a *bi-directional* RNN is the natural candidate for $F$
- $\text{rep}$ is short ($|| \h_\tp ||$)
- captures the left context $\w_{(0)}, \ldots, \w_{(\tt-1)}$ via an RNN processing sequence $\w$ left to right
- captures the right context $\w_{(0)}, \ldots, \w_{(\tt-1)}$ via an RNN processing sequence $\w$ right to left

## ELMo

ELMo ([link to paper](https://arxiv.org/abs/1802.05365)) was a first step in creating
contextualized representations.

It uses two LSTM's
$$
\begin{array}[lll]\
\text{Forward Model} &  \pr{\w_\tp | & \w_{(0)} \ldots, \w_{(\tt-1)} }  & \text{predict next word from prefix} \\
\text{Backward Model} &  \pr{\w_\tp | & \w_{(T)}, \w_{(T-1)}, \ldots, \w_{(\tt+1)}   } & \text{predict next word from suffix}  &\\
\end{array}
$$

The Forward (resp., Backward) Model uses the *entire prefix* (resp., *suffix*), not just a fixed window
- That's why a sequence model (like the LSTM) is needed

The unsupervised pre-training objective  maximizes the likelihood of both models
$$
\begin{array}[lll] \\
\mathcal{L}_1 ( \mathcal{U} ) =  
\left( 
\sum_{\tt=1}^T { \log{P( \w_\tp | \w_{(0)} \ldots, \w_{(\tt-1)} )}; \Theta )}
\right) 
+ 
\left( \sum_{\tt=1}^T  { \log{P( \w_\tp | \w_{(T)}, \w_{(T-1)}, \ldots, \w_{(\tt+1)} )}; \Theta )} 
\right)  \\
\end{array}
$$


Both the Forward/Backward models use *multi-layer* LSTM's
- Let $\h^{[\ll]}_{F,\tp}$ denote the hidden state of layer $\ll$ of the Forward model on input element $t$
- Let $\h^{[\ll]}_{B,\tp}$ denote the hidden state of layer $\ll$ of the Backward model on input element $t$

Concatenating these two states gives the layer $\ll$ "ELMo" (Embedding from Language Model) for word $\w_\tp$
$$
E^{[\ll]}_\tp = [ \h^{[\ll]}_{F,\tp}, \h^{[\ll]}_{B,\tp}] 
$$

It would seem natural to use the latent state of the *last* layer $L$ as the representation.

But ELMo does something a little different
- It *combines* the representations at multiple layers

Suppose there are $L$ layers of LSTM's.

Rather than using the final layer's ELMo $E^{[L]}_\tp$ as the representation for $\w_\tp$
- the authors *combine* the ELMo's for $\w_\tp$ from multiple layers
$$
E_\tp = \sum_{l=1}^L { s^\text{task}_\ll * E^{[\ll]}_\tp} 
$$
- the per-layer weights $s^\text{task}_\ll$ are parameters that are learned as part of the task-specific model

<table>
    <tr>
        <th><center>ELMo</center></th>
    </tr>
    <tr>
        <td><img src="http://jalammar.github.io/images/bert-feature-extraction-contextualized-embeddings.png" width=80%></td>
    </tr>
    <tr>
        <td><center>Picture from: http://jalammar.github.io/images/bert-feature-extraction-contextualized-embeddings.png</center></td>
    </tr>   
</table>


In our module on Transfer Learning, we speculated that
- the representations produced by deep layers (closer to the Head) are task-specific
- the representations produced by shallow layers (closer to the input) are task-agnostic

Rather than arbitrarily guessing where to chop off the Head of the Word Prediction task
- ELMo learns which layers are most useful fot the task-specific model

# Attention based representations

While bi-directional representations take into account full context, their "view" is limited to a single direction (left-to-right or right-to-left).

We had introduced the Attention mechanism as a device that enables a Neural Network to "attend" to the most relevant piece of information
- e.g., word in sequence


The Attention mechanism, in theory, allows us to access each element of the input sequence *as needed*
rather than in order (as in an RNN or LSTM).

*Attention* is usually a very important part of obtaining contextualized representations
- Decides what other tokens in the sequence affect the representation of any token


- Use self-attention over the *entire* input sequence to derive new representations that are context sensitive

<table>
    <tr>
        <th><center>Attention weights</center></th>
    </tr>
    <tr>
        <td><center>Thickness of the blue lines indicate the strength of attention to other tokens</center></td>
    </tr>
    <tr>
        <td><img src="https://1.bp.blogspot.com/-AVGK0ApREtk/WaiAuzddKVI/AAAAAAAAB_A/WPV5ropBU-cxrcMpqJBFHg73K9NX4vywwCLcBGAs/s1600/image2.png" width=80%></td>
    </tr>
    <tr>
        <td><center>Picture from: https://1.bp.blogspot.com/-AVGK0ApREtk/WaiAuzddKVI/AAAAAAAAB_A/WPV5ropBU-cxrcMpqJBFHg73K9NX4vywwCLcBGAs/s1600/image2.png</center></td>
    </tr>   
</table>


In [1]:
print("Done")

Done
