In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

**References**
- [SELF-Instruct paper](https://arxiv.org/pdf/2212.10560.pdf)
- [Self-Alignment with Instruction Backtranslation](https://arxiv.org/pdf/2308.06259.pdf)
- [Large Language Models can Self-Improve](https://arxiv.org/pdf/2210.11610.pdf)

# Using an LLM to generate Instruction Following examples

In the module on [Instruction Following](LLM_Instruction_Following.ipynb)
- we motivated the use of Fine-Tuning a LLM
- to exhibit Instruction Following behavior

Recall: an example of Instruction Following behavior is a triple, for example
- Instruction: Tell me the word that is the opposite of the word that I input
- Input: Stop
- Response: Go

The Instruction describes the task to be accomplished
- relationship between Input and Response
- the Input/Response pair is an exemplar for this task

In this module, we explore methods
- to generate these fine-tuning examples
- to improve examples

# Using an LLM to generate Instruction Following examples

[SELF-Instruct paper](https://arxiv.org/pdf/2212.10560.pdf)

Is there an alternative to the labor-intensity of constructing Instruction Following examples by human ?

The idea of the [SELF-Instruct paper](https://arxiv.org/pdf/2212.10560.pdf)
is to use a Synthetic Data approach to constructing new examples of Instruction Following

These examples are pairs of an Instruction part, and a Target Output part.

The authors
- use a *few-shot* learning approach to generate *synthetic* Instruction Following examples
- augmenting a small number of human-constructed examples with the synthetic examples
- using the augmented dataset to Fine Tune an LLM to better demonstrate Instruction Following

<img src="images/selfinstruct_process.png">

Attribution: https://arxiv.org/pdf/2212.10560.pdf#page=2

The process involves multiple steps which we explain below.

## Generating the Instruction part of an Instruction-Output example

The first step is to use few shot learning to generate synthetic Instructions
- the Instruction part of an Instruction-Target Output example

The synthetic Instructions are used to augment a small number of Instructions from the manually generated training dataset.

Recall: few-shot learning involves creating a prompt that is the concatenation of
- a few exemplars ($\langle \x, \y \rangle$ pairs demonstration the task)
- an example with no label: $\x$

Here is a template for a prompt demonstrating to GPT how to create a new Instruction 

<img src="images/selfinstruct_task_generation_prompts.png" width=90%>

Attribution: https://arxiv.org/pdf/2212.10560.pdf#page=15

## Generating the Output part, given an Instruction

The next step is to 
- choose an Instruction (called the *Target task*) from the augmented list of Instructions 
- prompt the LLM to generate the Target Output for the target task.

The prompting for the output is achieved by few-shot learning.
- Provide $k$ exemplars
- Followed by a line consisting of 
    - The Instruction for the Target Task
    -with the expectation that the LLM will create an Input/Output pair
        - that obeys the Instruction
        - correctly relates the Input and the Output
        

Each exemplar is an Instruction following example for some other task.

That is, it is a Instruction-Target Output pair.
   

For Classification tasks, the prompt might look like this

    Task: Classify the sentiment of the sentence into positive, negative, or mixed
    
    Example 1
    Sentence: I enjoy the flavor of the restaurant but their service is too slow.
    Class Label: mixed
    
    Example 2
    Sentence: I had a great day today. The weather was beautiful and I spent time with friends.
    Class label: Positive
    
    
    Task: Tell me if the following email is a promotion email or not.
    
    Email: Check out our amazing new sale! We’ve got discounts on all of your favorite products.
    Class label: Promotion

    Email: We hope you are doing well. Let us know if you need any help.
    Class label: Not Promotion
    
    Task: {instruction for the target task}

The last line above contains a place holder for the Instruction of the Target Task
- the one for which we want the LLM to create a Target Output

Here is an example of the template from the paper

<img src="images/selfinstruct_generated_instances.png">

Attribution: https://arxiv.org/pdf/2212.10560.pdf#page=16

### Generating examples for Classification tasks

Consider the an Instruction Following example for a Classification task

    Task: Classify the sentiment of the sentence into positive, negative, or mixed
    
    Example 1
    Sentence: I enjoy the flavor of the restaurant but their service is too slow.
    Class Label: mixed
 


The authors found that the response generated by the LLM (e.g., Classification examples)
- were examples whose Class Label's 
- were not well-distributed among all possible labels 

This was attributed to the *format* of the example called *Input-first*.
- Additional Input
- Precedes Target Output (e.g., `Class Label:`

When the format was changed to *output-first*
- Target Output 
- precedes Additional Input

the Classification examples generated had Class Label's that were less biased to one label


     Task: Classify the sentiment of the sentence into positive, negative, or mixed

     Example 1
        Class Label: mixed
        Sentence: I enjoy the flavor of the restaurant but their service is too slow.
        

        Example 2
        Class label: Positive
        Sentence: I had a great day today. The weather was beautiful and I spent time with friends.
        


This is an example of Prompt Engineering
- In-context learning seems very sensitive to the format of prompts
- There is a skill of engineering a prompt to elicit the desired behavior

This feels similar to the idea behind Chain of Thought prompting
- by presenting `Class Label` first
- the model seems better conditioned to generate a less biased distribution of labels

# Instruction Backtranslation

*Backtranslation* is a method
- to generate an instance of Instruction Following behavior
$$\langle\x, \y \rangle =  \langle \text{Instruction}, \text{Response} \rangle$$
- **given** only the $\text{Response}$
- using an LLM to create the $\text{Instruction}$

The essential idea is 
- given a *small* "seed"  of instruction/response pairs
$$\langle\x, \y \rangle =  \langle \text{Instruction}, \text{Response} \rangle$$
- create an inverse dataset of response/instruction pairs by reversing the features and targets
$$\langle\y, \x \rangle =  \langle  \text{Response}, \text{Instruction} \rangle$$
- fine-tune an LLM to predict $\text{Instruction}$ from $\text{Response}$

The Fine-Tuned model can be used to
- create a task description (Instruction) describing the task
- given demonstrations (Response) of a task

The results of using the Fine-Tuned model to create new examples
- can be used as synthetic examples
- to augment the seed examples

One can then iteratively augment the examples further
- by using Step $i$ augmented data
- as the "seed" to create Step $(i+1)$ augmented data

Here is the workflow:

<table>
    <center><strong>Instruction Backtranslation</strong></center>
    <tr>
        <img src="images/instruction_backtranslation.png" width=70%>
    </tr>
    
Attribution: https://arxiv.org/pdf/2308.06259.pdf#page=2
</table>

## Selecting the best synthetic examples for augmentation

The quality of the synthetic examples created at each step may not be uniformly high.

It would be desirable to select only the best examples to use in augmenting the seed examples of each Step.

How can we rate the quality of a synthetic example ?

Ask the LLM to do it for you ! 

The following prompt requests that the LLM evaluate the
synthetic example using a rating scale of $1$ (low quality) to $5$ (high quality)

<table>
    <center><strong>Instruction Backtranslation Curation</strong></center>
    <tr>
        <img src="images/instruction_backtranslation_curating.png" width=70%>
    </tr>
    
Attribution: https://arxiv.org/pdf/2308.06259.pdf#page=4
</table>

# Automatic Prompt Engineering (APE)

The Automatic Prompt Engineer (APE) is a system to *improve* upon prompts
- given a prompt
- APE will create a prompt that is *more effective*

It uses an LLM
- to create variations of the given prompt
- evaluate which variation is best

One use of APE is
- to create an *instruction* describing a task
- given exemplars for a task (the input/output mapping for the task)

So, we might use APE to create synthetic examples for Instruction Following
- conditional on having only instances of input/output pairs for the task

Let visit the module: [APE](Prompt_Engineering_APE.ipynb) 

# Related work: Self-improvement

The methods illustrated use a LLM to help improve future iterations of the LLM.

This is called *self improvement*.

A related [paper](https://arxiv.org/pdf/2210.11610.pdf) adds some interesting ideas.

The first idea relates to the construction of the exemplars
- use Chain of Though (CoT) exemplars as demonstrations of the task
    - for example generation

CoT prompts have been shown to increase the likelihood of generating a correct response
- by explicitly asking for "step by step" reasoning to be included
- rather than just outputting the "answer"

But even with step by step reasoning, a wrong answer may be output.

The other idea adapted by the authors is *multiple reasoning paths*
- sample multiple outputs for each question
- extract the "answer" part (i.e., ignore the step by step part) from the output
- the answer that occurs most frequently among the multiple answers is deemed more likely to be correct

The answer deemed to be correct
- is then used as a training example
- to improve the model's future behavior on similar questions

In [2]:
print("Done")

Done
