In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

# Language Models

A *Language Model* is an instance of the "predict the next" paradigm where
- given a sequence of tokens
- we try to predict the next token

$$
\pr{\y_\tp \, | \, \y_{(1:\tt-1)}}
$$

Recall the architecture to solve "predict the next word" and data preparation

<br>
<table>
<tr>
    <center><strong>Language Modeling task</strong></center>
</tr>
    <br>
<tr>
    <th><center>Architecture</center></th>
    <th><center>Data preparation</center></th>
    </tr>
<tr>
    <td><img src="images/RNN_many_to_one_to_classifier.jpg" width=70%></td>
    <td><center>$\y = \y_{(1)}, \ldots, \y_{(T)}$</center>
        <br><br><br>
\begin{array} \\
      i  & \x^\ip  & \y^\ip \\
      \hline \\
      1 & \y_{(0) }  & \y_{(1)} \\
      2 & \y_{([0:1]) }  & \y_{(2)} \\
      \vdots \\
      \tt & \y_{([1:\tt-1]) }  & \y_\tp \\
      \vdots \\
      T & \y_{([1:T-1]) }  & \y_{(T)} \\
\end{array}
    </td>
<tr>
</table>

The raw data
- e.g., the sequence of words $\mathbf{s} = \mathbf{s}_{(1)}, \ldots \mathbf{s}_{(\bar T)}$

is not naturally labeled.

We need a Data Preparation step to create labeled example $i$
$$
\begin{array} \\
\x^\ip & = & \mathbf{s}_{(1)}, \ldots \mathbf{s}_{(i)} \\
\y^\ip & = & \mathbf{s}_{(i+1)}
\end{array}
$$

We have called this method of turning unlabeled data into labeled examples: *Semi-Supervised* Learning.

In the NLP literature, it is called *Unsupervised Learning*.

There are abundant sources of raw text data
- news, books, blogs, Wikipedia
- not all of the same quality

The large number of examples that can be generated facilitates the training of models with very large number of weights.

This is extremely expensive but, fortunately, the results can be re-used.
- Someone with abundant resources trains a Language Model on a broad domain
- Publishes the architecture and weights
- Others re-use

## Predict the next ? Really: predict the *distribution* of next

We have casually defined the Language Modeling objective as predicting the next token.

As you can see: the head layer is a Classifier
- produces a probability for *each token* in the vocabulary as being the next
$$
\pr{\y_\tp \, | \, \y_{(1:\tt-1)}}
$$
- We choose one token by sampling from this probability distribution
    - Greedy sampling: always chose the token with highest probability
    - Non-greedy sampling



## The Masked Language Modeling objective

There is a variation on the Language Modeling objective called the  *Masked Language Modeling* objective.
- Language Modeling objective: given $s[1:\tt-1]$, predict $s[\tt]$
- Masked Language Modeling objective
    - Given $s[1:\tt]$
    - Randomly chose an index $1 \le j \le \tt$
    - "Mask" token $j$ by replacing it with `<MASK>` so that the input becomes
    $$
    s_{(0)},  \ldots s_{(j-1)}, \text{<MASK>}, s_{(j+1)}, \ldots s_\tp
    $$
    - Predict the value behing the mask, e.g., $s_{(j)}$
- The Language Modeling objective is the special case where $j=\tt$

# Unsupervised Pre-Training + Supervised Fine-Tuning (Transfer Learning)

How do we adapt a Language Model to solve other Target tasks ?

The obvious answer is via Transfer Learning
- The Language Model has learned a lot about the nature of language
    - perhaps the language-knowledge can be transfered to a new task
- Replace the "head" that predicts the next token
- With a new task-specific head
- Train the new model on labeled examples from the Target task
    - the task-specific head **must** be trained
    - the language-model weights **can** (but don't have to) be adapted
    
This paradigm is called *Unsupervised Pre-Traininng + Supervised Fine-Tuning*.

<br>
<table>
    <tr>
        <th><center>Transfer Learning: replace the head of the pre-trained model</center></th>
    </tr>
    <tr>
        <td><img src="images/Transfer_Learning_2.jpg" width=60%></td>
    </tr>

## Example: Fine-Tuning a Pre-trained Language Model to analyze sentiment

This is a straight-forward application of Transfer Learning
- Replace the Classification Head used for Language Modeling
    - e.g., a head that generated a probability distribution over words in the vocabulary
- By an untrained Binary Classification head (Positive/Negative sentiment)
- Train on examples. Pairs of
    - sentence
    - label: Positive/Negative

# Other uses of a Language Model: Feature based Transfer Learning

We can generalize the procedure of "replacing the head": re-use the features produced by the
Source model.

Let $f_\Theta (\x)$ denote the function computed by the Source Language Model on input sequence $\x$
- the output of the layer **before the final Classification layer** that translates $f_\Theta (\x)$ into the token Vocabulary
- the Source model is parameterized by $\Theta$

Feature based Transfer Learning computes
$$
g_\Phi( f_\Theta (\x) )
$$
for the function $g$ (parameterized by $\Phi$) computed by a new NN.

That is
- we take the final representation $f_\Theta (\x)$ create by the model for the source Task
- and feed it into a model for the Target Task
-
The case where $g_\Phi$ is a Classifier is the special case of creating a new Target task specific head.

## Using the final representation

The final representation of some models may be useful in surprising ways.

For example
- consider the final representation created by an Encoder Transformer trained on a Language Modeling task
- it is a *sequence* (of length $\bar T$, where the input is also a sequence of length $\bar T$)
    - a "context sensitive" representation of each element in the input sequence



Many tasks that use an Encoder modify the original input sequence
- by bracketing it with special tokens `<START>, <END>`

The context sensitive representation of the `<START>, <END>` tokens
- can be interpreted as a **fixed-length summary of the entire input sequence**
- similar to the final latent state in an RNN

Thus, the Encoder can be used to *summarize a sequence*.

Two interesting uses of this fixed-length summary
- Classification of sequences: e.g., Sentiment Analysis
- Semantic Search
    - "Google"-like search
    - encode your query as a fixed length vector
    - encode each document in a collection as a fixed length vector
    - -query-retrieval: return the document whose vector is closest to that of the query
    

# Multi-task learning

One area of recent interesting is *multi-task learning*
- Training a model to implement multiple tasks

A model that implements a single task computes
$$\pr{\text{output | input}}$$

A model that implements several tasks computes
$$\pr{\text{output | input, task-id }} $$

Using the Universal API, 
a Language Model may be adapted to solve *multiple tasks* simultaneously.

This requires you to construct a training set
- with examples from each task
- where each example has an additional "feature" that identifies the task to which it applies

For example

$$\begin{array}[lll] \\
(  \mathsf{Translate \; to \;French} , & \text{English text} ,  & & \text{French Text}) \\
( \mathsf{Answer \; the \; question} , & \text{document} , & \text{question} , & \text{answer}) \\
\end{array}
$$

The first feature above is the task identifier of the example.

In [2]:
print("Done")

Done
