In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

# Language Models

A *Language Model* is an instance of the "predict the next" paradigm where
- given a sequence of words
- we try to predict the next word

Recall the architecture to solve "predict the next word" and data preparation

<br>
<table>
<tr>
    <center><strong>Language Modeling task</strong></center>
</tr>
    <br>
<tr>
    <th><center>Architecture</center></th>
    <th><center>Data preparation</center></th>
    </tr>
<tr>
    <td><img src="images/RNN_many_to_one_to_classifier.jpg" width=70%></td>
    <td><center>$\mathbf{s} = \mathbf{s}_{(1)}, \ldots, \mathbf{s}_{(T)}$</center>
        <br><br><br>
        \begin{array} \\
      i  & \x^\ip  & \y^\ip \\
      \hline \\
      1 & \mathbf{s}_{(1) }  & \mathbf{s}_{(2)} \\
      2 & \mathbf{s}_{(1), (2) }  & \mathbf{s}_{(3)} \\
      \vdots \\
      i & \mathbf{s}_{(1), \ldots, (i) }  & \mathbf{s}_{(i+1)} \\
      \vdots \\
      (T-1) & \mathbf{s}_{(1), \ldots, (T-1) }  & \mathbf{s}_{(T)} \\
      \end{array}
    </td>
<tr>
</table>


The raw data
- e.g., the sequence of words $\mathbf{s} = \mathbf{s}_{(1)}, \ldots \mathbf{s}_{(\bar T)}$

is not naturally labeled.

We need a Data Preparation step to create examples
$$
\langle \x^\ip, \y^\ip \rangle = \langle \mathbf{s}_{(1)}, \ldots \mathbf{s}_{(i)}, \mathbf{s}_{(i+1)} \rangle
$$
to create labelled examples.

We have called this method of turning unlabeled data into labeled examples: *Semi-Supervised* Learning.

In the NLP literature, it is called *Unsupervised Learning*.

There are abundant sources of raw text data
- news, books, blogs, Wikipedia
- not all of the same quality

The large number of examples that can be generated facilitates the training of models with very large number of weights.

This is extremely expensive but, fortunately, the results can be re-used.
- Someone with abundant resources trains a Language Model on a broad domain
- Publishes the architecture and weights
- Others re-use

Models that have been trained with the intent that they be re-used are called *Pre-Trained* models.

The process of creating such a Language Models
from unlabeled raw text is referred to as 

> *Unsupervised Pre-Training*: Train a model on a **very large** number of examples from a **broad** domain

# Using a Pre-Trained Language Model

## Feature based

Consider the behavior of a Language Model as it processes a word sequence (either an RNN or Encoder Transformer).

It produces an output (or latent state) $\bar\h_\tp$ for each position $\tt$ of the sequence.

This is a *context sensitive* representation specific to input word $\mathbf{s}_tp$ at position $\tt$.
- context sensitive because it depends on
    - prefix $\mathbf{s}_{(1)}, \ldots, \mathbf{s}_{(\tt-1)}$
    - entire sequence $\mathbf{s}_{(1)}, \ldots, \mathbf{s}_{(\bar T)}$

These Context Sensitive Representations of words may be useful representations for down-stream tasks
- Better than Word Embeddings, which have no context
- See the [ELMo paper](https://arxiv.org/abs/1802.05365)

## Fine-Tuning



Logically, we use the process that we described as Transfer Learning
- where we use the output of some layer of the Pre-Trained model
    - default: all layers, excluding the Classification Head
- as a "meaningful" **fixed length** representation of input sequence $\x^\ip_{(1)}, \ldots, \x^\ip_{(m)}$
- which is then fed to a Classification head with the object of matching the target $\y^\ip$

Recall the diagram from our module on [Transfer Learning](Transfer_Learning.ipynb)


<br>
<table>
    <tr>
        <th><center>Transfer Learning: replace the head of the pre-trained model</center></th>
    </tr>
    <tr>
        <td><img src="images/Transfer_Learning_2.jpg" width=60%></td>
    </tr>
    

The process is
- Import the Pre-Trained model (which was trained on a large number of examples from a broad domain)
- Fine-Tune  the weights using a **small** number of examples for a **specific task** from a **narrow** domain.

Often, the specific task is Supervised (e.g., sentiment analysis).

In that case: we refer to the second step as *Supervised Fine-Tuning*

### Example: Using a Pre-trained Language Model to analyze sentiment

This is a straight-forward application of Transfer Learning
- Replace the Classification Head used for Language Modeling
    - e.g., a head that generated a probability distribution over words in the vocabulary
- By an un-trained Binary Classification head (Positive/Negative sentiment)
- Train on examples. Pairs of
    - sentence
    - label: Positive/Negative

# Language Models: the future (present ?) of NLP ?

Pre-trained Language Models (especially the *Large Language Models* that have trained on massive amounts of data) seem to transfer well to other tasks via Supervised Fine-Tuning.

We call this paradigm "Unsupervised Pre-Trained Model + Supervised Fine-Tuning".

This paradigm means that we might not need to create a new model for a new task.

Instead: we transform our task into one amenable to the "Unsupervised Pre-Trained Model + Supervised Fine-Tuning" paradigm
- using a Language Model as our Pre-Trained Model

## Input Transformations

One impediment to using the  paradigm is that
- the task-specific input
- is  not the simple, unstructured sequence of words that characterize the input for Language Modeling.

We need to apply *input transformations*
- to transform structured task-specific input
- to the unstructured sequence of words used in the Language Model task input

Here are common examples of tasks with structured inputs:
- Entailment
    - Input is a *pair* of text sequences $[ \text{Premise}, \text{Hypothesis} ]$
    - Binary classification: Does the Hypothesis Logically follow from the Premise ?
    
          Premise: The students are attending a lecture on Machine Learning
          Hypothesis: The students are sitting in a class room
          Label: Entails
          

- Question answering
    - Consider a multiple choice questions consisting of
        - Context: a sentence or paragraph stating facts
        - Question
        - Answers: a set of possible answer sentences
    - Input
    
    
    Context: It is December of 2022.  Prof. Perry is teaching the second half of the Machine Learning Course.
    Question: Where are the students ?
    Answer 1: The beach
    Answer 2: In a classroom in Brooklyn
    Answer 3: Dreaming of being elsewhere.
    
    Label: Answer 2 (95% probability), Answer 3 (4% probability)

- Similarity  
    - Input is a *pair* (or more) of text sequences $[ \text{First}, \text{Second}, \text{Third} ]$
    - Binary/Multinomial  classification: Probability that other sentences are similar to First ? 
    
    
        First: Machine Learning is easy not hard
        Second: Machine Learning is not difficult
        Third:  Machine Learning is hard not easy
        Label: [Second: .95, Third: .01 ]
        

To use the Pre-Trained LM + Fine-Tuning approach
- we need to convert the structured input into simple sequences.

See [this paper](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf) for some transformations.

<table>
    <tr>
        <th><center>GPT: Task encoding</center></th>
    </tr>
    <tr>
        <td><img src="images/LM_GPT_task_encoding.png" width=80%></td>
    </tr>
    <tr>
        <td><center>Picture from: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf</center></td>
    </tr>   
</table>


For example: for multiple choice questions answering
- Create a triple for each answer, 
    - [Context, Question, Answer 1], 
    - [Context, Question, Answer 2], ...
- Obtain a representation of each triple using a LM
    - using Delimiter tokens to separate elements of the triple
- Fine-tune using a new Multinomial classifier head
    - to obtain probability distribution over answers


# Conclusion

Language Models are the basis for the paradigm of Unsupervised Pre-training + Supervised Fine-Tuning.

This has become the dominant paradigm in NLP.

The ability to train Large Language models stem is due, in part, to the advantages of the Transformer
- Execution Parallelism: can run larger models than an RNN for the same amount of elapsed time
- This also facilitates the use of extremely large training datasets.
