In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

# Introduction

**References**
- [Nice overview](https://lightning.ai/pages/community/article/understanding-llama-adapters/#prompt-tuning-and-prefix-tuning)

The goal of Transfer Learning is to adapt a Pre-Trained model for a Source task
(the "base" model) to solve a new Target task.

Fine-Tuning (additional training with Source task-specific examples) is a common method of adaptation.

But *Prompt Engineering* can be used for adaptation as well
- creating a prompt that adapts the text-continuation ("predict the next") behavior of a LLM
- to produce a solution to the Target task



# Large Language Model as a Universal base model

Within the context of NLP tasks
- Text to text is a [Universal API](NLP_Universal_Model.ipynb#A-Universal-API:-Text-to-text)
    - The Target task's input and output can both be re-formatted into text
- Large Language Models (LLM) can be a Universal "base" Model
    - convert all Target tasks into instances of the LM "predict the next" task
- Eliminating the need for Target task specific "head" layers to be appended to the base model

An essential part
of the Universal API is converting an example of the Target task
- so that the text-continuation ("predict the next") task
- produces a solution to the Target Task

To illustrate, suppose we want our LLM  base model to adapt to solving the task of Summarization.

A training example for the Summarization task might look like

    {PREFIX} {DOC} Summary: {SUMMARY}
    
where
- `{DOC}` and `{SUMMARY}` are placeholders for the features (i.e., document) and target/label (i.e., the summary)
- `{PREFIX}` are *instructions* for the summarization task. For example
    - `Produce a one paragraph summary of the following: `

We refer to 
- the text up to an including the `Summary: ` as the *prompt*
    - the features of the converted example
- the remainder of the text (i.e., `{SUMMARY}`) as the *continuation*
    - the target of the converted example

The features for a test example (i.e, a request to summarize) would be the prompt without a continuation

     {PREFIX} {DOC} Summary: 

We would hope that the LLM's completion (continuation) of this prompt would be the target `{SUMMARY}`

This representation of the relationship between features and target for the Target task
- adapts the LLM
- by causing it to compute
$$
\pr{ \text{\{TARGET\}} \, | \, \text{\{PREFIX\} \{DOC\} } }
$$
- rather than the native LLM objective
$$
\pr{ \text{\{TARGET\}} \, | \, \text{ \{DOC\} } }
$$

That is:
- `{PREFIX}` *conditions* the LLM to product a continuation
- that satisfies the Target Task

# Prompt Design/Prompt Engineering via Tuning

A base model may be adapted to solve a Target Task *without fine-tuning*
- using Prompt Engineering
- crafting a prompt
    - that conditions the LLM text-continuation behavior
    - to produce output consistent with a Target Task

This is *parameter efficient* in that **no** existing parameters are changed, nor are any added.

The conditioning prompt usually consists of
- detailed "instructions"
- exemplars: examples of the input/output relationship for the Target task

In the above Summarization task example,
we could imagine various choices for the instructions `{PREFIX}`

- `Summarize the following article: [SEP] `
- `Produce a summary of: [SEP] `
- `A "summary" has the following properties ... Create a summary of: [SEP]`
- Exemplars: a number of `{DOC}`:`{SUMMARY}` pairs

Does it matter which we choose ?
- the last two, being more specific, might be preferable
- but at the cost of using a greater fraction of the LLM model's fixed maximum Context length

*Prompt engineering* (*prompt design*) is the "art" of constructing prompts
in order to get an LLM to solve a task.

It is an *inference time* technique
- does not modify parameters of base model
- in contrast to Fine Tuning

This has been treated more as an art ("GPT Whisperer") than a science.
- rules of thumb, without scientific validation

## Hard prompt tuning

We can formalize prompt design as a formal task.

One can imagine `{PREFIX}` as a sequence of token *variables* 

$$\langle \text{TOK}_1\rangle, \ldots, \langle \text{TOK}_p\rangle$$

We can evaluate the quality of the prefix
- as measured through the evaluation of Performance Metric $\mathcal{M}$
- on an out of sample examples $\tilde \X$ from the Target task

Prompt design can be viewed as an optimization task
- finding the optimal tokens in the `{PREFIX}`
- we treat each token variable $\langle \text{TOK}_\tt\rangle$ as a *parameter* to be solved for

$$
\text{PREFIX}^* = \argmax{\text{PREFIX}} \mathcal{M}_{\x \in \tilde{X}}  \left( \, \pr{\y \, | \, \text{PREFIX}, \x} \,  \right)
$$

Because tokens are discrete (hard) values, we refer to this method as
 *Hard Prompt Fine-Tuning*
- optimizing the prompt at the *token* level

The fact that tokens are discrete (rather than continuous) values means
- we can't differentiate with respect to token values
- so can't optimize by Gradient Descent


Without differentiability, hard prompt fine tuning may devolve to
- an exhaustive (but finite) search for the optimal `{PREFIX}` sequence

<table>
    <center><strong>Discrete Prompt Search</strong></center>
    <img src="images/PTuning_diagram_discrete.png" width=40%>
    <br>
    Attribution: https://arxiv.org/pdf/2103.10385.pdf#page=3
</table>

# Soft prompt tuning; Prefix tuning

**References**
- [The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/pdf/2104.08691.pdf)
- [Prefix-Tuning: Optimizing Continuous Prompts for Generation](https://arxiv.org/pdf/2101.00190.pdf)

The problem with Hard Prompt Tuning is that we can't differentiate with respect to the tokens.

But
- an examination of the Transformer architecture
- which we will typically use to solve our NLP tasks
- offers a simple solution


Almost immediately
- the Transformer changes the discrete token values
- into *continuous* vectors: the Embeddings

<br>
<table>
    <tr>
        <th><center>Transformer (Encoder/Decoder)</center></th>
    </tr>
    <tr>
        <td><img src="images/Attention_is_all_u_need_Transformer.png" width=30%></td>
    </tr>
</table>

The Input Embedding layer
- maps a token
    - encoded as a OHE vector of length $| \V |$, indicting the index within Vocabulary $\V$
- to an "embedding" vector of length $d$ (the internal dimension of the output of all layers in the Transformer)  

This mapping is (conceptually) implemented via a matrix
- of size $(|\V| \times d)$
- whose elements are *parameters* that are solved for during Gradient Descent

Denoting the embedding of token $\x_\tt$ as $e( \x_\tt )$
- the Input Embedding later transforms input sequence of *discrete* token values
$$
\x_{(0)}, \ldots, \x_{(\bar T)}
$$
to sequence of *continuous* embedding values
$$
e(\x_{(0)}), \ldots, e(\x_{(\bar T)})
$$

So
- rather than adding a prefix of tokens
$$
\langle \text{TOK}_1\rangle, \ldots, \langle \text{TOK}_p\rangle 
$$
- to input $\x$
- resulting in input 
$$
\langle \text{TOK}_1\rangle, \ldots, \langle \text{TOK}_p\rangle \, \, \x
$$

We add 
- a prefix of embeddings
$$\tilde{e}_{(1)}, \ldots, \tilde{e}_{(p)} $$
- to the sequence that is the embedding of sequence $\x$

$$
 e(\x_{(0)}), \ldots, e(\x_{(\bar T)})
$$
- resulting in the output of the Embedding layer producing
$$
\tilde{e}_{(1)}, \ldots, \tilde{e}_{(p)} \, \, e(\x_{(0)}), \ldots, e(\x_{(\bar T)})
$$

$$\tilde{e}_{(1)}, \ldots, \tilde{e}_{(p)} $$
- are called *pseudo tokens*
- they are embedding vectors
- that *don't necessarily correspond* to any true token value in the vocabulary
    - just vectors of length $d$

Most significantly
- they are *parameters*
- that are solved for by Gradient Descent !

Problem solved 
- find the optimal prefix of embeddings
- rather than the optimal prefix of tokens

Since the embedding of pseudo tokens does not have to be human-readable
- we can use a very small number of them
- we can place them anywhere in $\x$, not just as a prefix
- the special case where the placeholders are restricted to a prefix of $\x$ is called *prefix tuning*

In effect: the embeddings of pseudo tokens
- represent instructions to perform the Target task
- written in non-human language

<table>
    <center><strong>Discrete Prompt Search</strong></center>
    <img src="images/PTuning_diagram.png" width=90%>
    <br>
    Attribution: https://arxiv.org/pdf/2103.10385.pdf#page=3
</table>

During Soft Prompt Tuning
- we use a small number of Target task examples
- to learn the embeddings for the pseudo tokens
- keeping the embeddings of non-pseudo tokens and all other weights of the base model frozen

Since only the embeddings of the new pseudo tokens are learned
- **all** Target task specific information from the Fine-Tuning Target training dataset
- is encoded in the new embeddings

## Soft prompt tuning: refinements

Recall
- the embeddings of pseudo tokens act as a kind of "instruction" to perform the Target task
- Transformer blocks are stacked in many models
    - thus, there is an embedding in each Transformer block in the stack

Our initial description of prompt tuning created pseudo tokens only at the first block in the stack
of Transformer blocks.

Different methods have been tried to add embeddings at the pseudo token positions
at *other* blocks in the stack.

One reference [Prefix-Tuning: Optimizing Continuous Prompts for Generation](https://arxiv.org/pdf/2101.00190.pdf)
learns embeddings corresponding to the positions of pseudo tokens
- at *every* level of the stack

Another reference ([LLaMA-Adapter](https://arxiv.org/pdf/2303.16199.pdf#page=3))
learns embeddings corresponding to the positions of pseudo tokens
- only at the *top-most* levels of the stack
- perhaps consistent with the result of removing spans of Adapters reported in the Adapter section above
    - adaptation is most influential at the *top* levels of the stack

# Results: Adaptation via prompts

## Space efficiency

Suppose we have 3 Target tasks: A, B, C.

Fine-Tuning (*model tuning*) each results in 3 copies of the large base model.

In contrast, since the base model is shared in Prompt Tuning
- We can *separately* learn embeddings for placeholder tokens for each of the 3 tasks
- Place the embeddings for each within the Input Embedding
    - e.g., as rows of the Embedding matrix
- to solve the 3 tasks in a single instance of the base model
- by pre-pending the prefix for the appropriate task to each inference-time example

<table>
    <center><strong>Adaptation via prompts</strong></center>
    <img src="images/PEFT_Scale_compare_process.png" width=70%>
    <br>
    Attribution: https://arxiv.org/pdf/2104.08691.pdf#page=2
</table>

# Performance of various forms of adaptation

The following table compares various forms of adaptation
- Fine-tuning (model tuning)
- Adapter
    - see the module on [Parameter Efficient Fine Tuning](ParameterEfficient_TransferLearning.ipynb)
- Prefix Tuning

The number in parenthesis next to the name of the adaptation is
- the size of the adapted parameters as a fraction of base model parameters.
- note that for all metrics except TER, a bigger performance number is better

We can see that Prefix Tuning
- using only a small number of adapted parameters ($0.1 \%$ of base model parameters)
- performs similarly *or better* than full Fine-Tuning for many tasks
    - evaluated on base models which are the Medium and Large variants of GPT-2


<table>
    <center><strong>Performance, by method of adaptation</strong></center>
    <img src="images/PrefTuning_compare.png">
    <br>
    <center>n.b., for the TER metric: smaller is better</center>
    <br>
    Attribution: https://arxiv.org/pdf/2101.00190.pdf#page=7
    
</table>

## Prefix length

How long does the prefix need to be ?
- how many pseudo tokens in the prompt

The results of several experiments show
- a small number (10) of pseudo tokens achieves most of the performance
- hence, the number of Target task specific parameters does not need to be large

<table>
    <center><strong>Effect of Prefix Length on Adaptation via Prefix Tuning</strong></center>
    <img src="images/PrefTuning_length.png" width=70%>
    <br>
    <center>n.b., for the TER metric: smaller is better</center>
    <br>
    Attribution: https://arxiv.org/pdf/2101.00190.pdf#page=8
    
</table>

## Performance as a function of base model size

The general ordering of adapted models, from best to worst is
- Fine-tuning (model tuning)
- Prompt tuning
- Prompt Design (Prompt Engineering)

*However*: the gap between Model Tuning and Prompt Tuning *disappears* as we use larger base models.

<table>
    <center><strong>Adaptation by base model size</strong></center>
    <img src="images/PEFT_Scale_compare_results.png">
    <br>
    Attribution: https://arxiv.org/pdf/2104.08691.pdf#page=1
</table>

In [2]:
print("Done")

Done
