In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

# Multi-task learning

One area of recent interesting is *multi-task learning*
- Training a model to implement multiple tasks

A model that implements a single task computes
$$\pr{\text{output | input}}$$

A model that implements several tasks computes
$$\pr{\text{output | input, task-id }} $$

When training a model for multiple tasks, the training examples would look something like:
$$\begin{array}[lll] \\
(  \mathsf{Translate \; to \;French} , & \text{English text} ,  & & \text{French Text}) \\
( \mathsf{Answer \; the \; question} , & \text{document} , & \text{question} , & \text{answer}) \\
\end{array}
$$

Text is almost a universal encoding so NLP is a natural way of expressing multiple tasks.

So a natural extensions of a Language Model is to solve multiple tasks
- Encode your specific task as an input that can be handled by a Language Model
- That's one advantage of Byte Pair Encoding
    - No special per-task pre-processing needed for a task's training set

We will take the idea of Multi-task learning one step further
- Learning how to solve a task **without** explicitly training a model !

# Zero-shot learning: Learning to learn

We have presented the "Unsupervised Pre-Trained Model + Supervised Fine-Tuning" paradigm.

Considering that
- Language models seen to learn universal, task-independent language representation
- Text-to-text is a universal API for NLP tasks

We can raise the question
- Is Supervised Fine-Tuning even necessary ?
- Can a Language Model learn to solve 
a task *without having been trained on examples for the task* ?

There are some  practical impediments to answering this question
- How does the LM model "understand" that it is being asked to solve a particular task ?
- How does the LM model "understand" the input-output relationship involved in the new task ?

The solution to both impediments is to present a block of text called a **prompt**
- that labels the new task
- describes, in text, the relationship between inputs and outputs



For example, we can describe Translation between languages with the following prompt

    Translate English to French
    
    sea otter =>  loutre de mer
    
    peppermint => menthe poivree
    
    plush giraffe => girafe peluche
    
    cheese => 

The prompt consists of
- an initial string describing the task: "Translate English to French"
- a number of examples
    - English input, French output, Separated by a `=>`
- a new example **without** a target, representing a query to be solved

      cheese => 

The expectation is that the "next words" generated by the Language Modeling task
- are the translation of `cheese` into French


Note that the labeled "examples" are given at *inference* time **not** trainining time
- the model's weights are **not updated**
- the examples only condition the model into generating specific output


This paradigm has been called ["Pre-train, Prompt, Predict"](https://arxiv.org/pdf/2107.13586.pdf)


The terms used to describe this process depends on the number $k$ of labeled examples in the prompt:

- **Few shot learning**: $10 \le k \le 100$ typically
- **One shot learning**: $k = 1$
- **Zero shot learning** $k=0$

A picture will help

<table>
    <tr>
        <th><center>Few/One/Zero shot learning</center></th>
    </tr>
    <tr>
        <td><img src="images/LM_Few_Shot_Training.png"" width=80%></td>
    </tr>
    <tr>
        <td><center>Picture from: https://arxiv.org/pdf/2005.14165.pdf</center></td>
    </tr>   
</table>


There is a whole literature on creating successful prompts: [Prompt engineering](https://arxiv.org/pdf/2107.13586.pdf)
- Providing enough context to condition the model to "understand"
    - That "Translate English to French" relates to some examples seen (implicitly) in training
    - and that the string `=>` suggests a relationship between the input and output
        - perhaps generalizing examples seen in training

[See Appendix G](https://arxiv.org/pdf/2005.14165.pdf#page=51) (pages 50+) for examples of prompts for many other tasks.

Is this even possible ?!  Learning a new task with **zero** examples ?

Let's look at the reported results from the third generation GPT-3 model.

<table>
    <tr>
        <th><center>Few/One/Zero shot learning</center></th>
    </tr>
    <tr>
        <td><img src="images/LM_Few_Shot_Accuracy.png"" width=80%></td>
    </tr>
    <tr>
        <td><center>Picture from: https://arxiv.org/pdf/2005.14165.pdf</center></td>
    </tr>   
</table>


# Using zero-shot to create new applications

With a little cleverness, one can almost trivially create a new application using a LLM in zero-shot mode
- create the prefix of a prompt describing the task
- append the user input to the prefix to complete the prompt

Here we use [`ChatGPT`](https://chat.openai.com/chat) to create an app that summarizes a conversation
- we create a prompt with a "place-holder" (in braces `{..}`) for user input

`prompt = Summarize the following conversation: {user input}`

<img src="images/chatgpt_summarize_conversation_example.png" width=80%>

Here we use ChatGPT as a programming assistant

`prompt = Write a Python function that does the following: {task description}`

<img src="images/chatgpt_program_generation_example.png" width=80%>

# How is zero-shot learning possible ? Some theories

**Theory 1**

- The training set contains explicit instances of these out of sample tasks

**Theory 2**

- The super-large training sets  contain *implicit* instances of these out of sample tasks
    - For example: an English-language article quoting a French speaker in French with English translation

One thing that jumps out from the graph:
- Bigger models are more likely to exhibit meta-learning

**Theory 3**

The training sets are so big that the model "learns" to create groups of examples with a common theme
- Even with the large number of parameters, the model capacity does not suffice for example memorization


Another thing to consider
- The behavior of an RNN depends on *all* previous inputs
    - It has memory (latent state, etc.)
    
So Few Shot Learning may work by "priming" the memory with parameters for a specific task

# Social concerns

The team behind GPT is very concerned about potential misuse of Language Models.

To illustrate, they conducted an experiment in having a Language Model construct news articles
- Select title/subtitle of a genuine news article
- Have the Language Model complete the article from the title/subtitle
- Show humans the genuine and generated articles and ask them to judge whether the article was written by a human

<table>
    <tr>
        <th><center>Human accuracy in detecting model generated news articles</center></th>
    </tr>
    <tr>
        <td><img src="images/LM_GPT_model_generated_news.png" width=80%></td>
    </tr>
    <tr>
        <td><center>Picture from: https://arxiv.org/pdf/2005.14165.pdf</center></td>
    </tr>   
</table>

The bars show the range of accuracy across the 80 human judges.

- 86% accuracy detecting articles created by a really bad model (the control)
- 50% accuracy detecting articles created by the biggest models

It seems that humans might have difficulty distinguishing between genuine and generated articles.

The fear is that Language Models can be used
- to mislead
- to create offensive speech

In [1]:
print("Done")

Done
