In [1]:
!pip install -q transformers accelerate

# LLM Prompting Guide

Large Language Models such as Falcon, LLaMA, etc. are pretrained transformer models initially trained to predict the next token given some input text.

We can use them to solve multiple NLP tasks out of the box by instructing the models with natural language prompts. Designing such prompts to ensure the optimal output is often called "prompt engineering".

## Basics of prompting

### Types of models

When using a pipeline to generate text with an LLM, it is important to know what type of LLM we are using.

Run inference with decoder-only models with the `text-genration` pipeline:

In [None]:
from transformers import pipeline
import torch

torch.manual_seed(101)

generator = pipeline('text-generation', model='openai-community/gpt2')

In [3]:
prompt = "Hello, I'm a language model"
generator(prompt, max_length=50)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a language model.\n\nBut I didn't find myself able to build a model of why some people dislike the language at all. Is the language too verbose? Maybe because it is too simple to use? It is confusing"}]

Run inference with an encoder-decoder models, use the `text2text-generation` pipeline:

In [5]:
text2text_generator = pipeline('text2text-generation', model='google/flan-t5-base')
prompt = "Translate from English to French: I'm very happy to see you"

In [6]:
text2text_generator(prompt)

[{'generated_text': 'Je suis très heureuse de vous rencontrer.'}]

### Base vs instruct/chat models

Base models are excellent at completing the text when given an initial prompt, however, they are not ideal for NLP tasks where they need to follow instructions, or for converstional use. This is where the instruct (chat) versions come in.

These checkpoints are the result of further fine-tuning of the pre-trained base versions on instructions and conversational data.

### NLP tasks

In [None]:
from transformers import pipeline, AutoTokenizer
import torch

torch.manual_seed(101)
checkpoint = 'tiiuae/falcon-7b-instruct'

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
pipe = pipeline(
    'text-generation',
    model=checkpoint,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map='auto',
)

#### Text classification

We can write a prompt that instructs the model to classify a given text.

In [None]:
prompt = """Classify the text into neutral, negative or positive.
Text: This movie is definitely one of my favorite movies of its kind. The interaction between respectable and morally strong characters is an ode to chivalry and the honor code amongst thieves and policemen.
Sentiment:
"""

sequences = pipe(
    prompt,
    max_new_tokens=10,
)

for seq in sequences:
    print(f"Result: {seq['generated_text']}")

#### Named Entity Recognition

We can modify the instructions in the prompt to make the LLM perform the Named Entity Recognition (NER) task. We can set `return_full_text = False` so that output does not contain the prompt:

In [None]:
prompt = """Return a list of named entities in the text.
Text: The Golden State Warriors are an American professional basketball team based in San Francisco.
Named entities:
"""

sequences = pipe(
    prompt,
    max_new_tokens=15,
    return_full_text=False
)

for seq in sequences:
    print(f"{seq['generated_text']}")

In [None]:
sequences = pipe(
    prompt,
    max_new_tokens=15,
    return_full_text=True
)

for seq in sequences:
    print(f"{seq['generated_text']}")

#### Translation

In [None]:
prompt = """Translate the English text to Italian.
Text: Sometimes, I've believed as many as six impossible things before breakfast.
Translation:
"""

sequences = pipe(
    prompt,
    max_new_tokens=20,
    do_sample=True,
    top_k=10,
    return_full_text=False,
)

for seq in sequences:
    print(f"{seq['generated_text']}")

#### Text summarization

In [None]:
prompt = """Permaculture is a design process mimicking the diversity, functionality and resilience of natural ecosystems. The principles and practices are drawn from traditional ecological knowledge of indigenous cultures combined with modern scientific understanding and technological innovations. Permaculture design provides a framework helping individuals and communities develop innovative, creative and effective strategies for meeting basic needs while preparing for and mitigating the projected impacts of climate change.
Write a summary of the above text.
Summary:
"""

sequences = pipe(
    prompt,
    max_new_tokens=30,
    do_sample=True,
    top_k=10,
    return_full_text=False,
)

for seq in sequences:
    print(f"{seq['generated_text']}")

#### Question answering

For QA task, we can structure the prompt into the following logical components: instructions, context, question, and the leading word or phrase (`"Answer:"`) to nudge the model to start generating the answer:

In [None]:
prompt = """Answer the question using the context below.
Context: Gazpacho is a cold soup and drink made of raw, blended vegetables. Most gazpacho includes stale bread, tomato, cucumbers, onion, bell peppers, garlic, olive oil, wine vinegar, water, and salt. Northern recipes often include cumin and/or pimentón (smoked sweet paprika). Traditionally, gazpacho was made by pounding the vegetables in a mortar with a pestle; this more laborious method is still sometimes used as it helps keep the gazpacho cool and avoids the foam and silky consistency of smoothie versions made in blenders or food processors.
Question: What modern tool is used to make gazpacho?
Answer:
"""

sequences = pipe(
    prompt,
    max_new_tokens=10,
    do_sample=True,
    top_k=10,
    return_full_text = False,
)

for seq in sequences:
    print(f"{seq['generated_text']}")

#### Reasoning

Reasoning often requires applying advanced prompoting techniques, like Chain-of-thought.

In [None]:
prompt = """There are 5 groups of students in the class. Each group has 4 students. How many students are there in the class?"""

sequences = pipe(
    prompt,
    max_new_tokens=30,
    do_sample=True,
    top_k=10,
    return_full_text = False,
)

for seq in sequences:
    print(f"Result: {seq['generated_text']}")

In [None]:
prompt = """I baked 15 muffins. I ate 2 muffins and gave 5 muffins to a neighbor. My partner then bought 6 more muffins and ate 2. How many muffins do we now have?"""

sequences = pipe(
    prompt,
    max_new_tokens=30,
    do_sample=True,
    top_k=10,
    return_full_text = False,
)

for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Reasoning is difficult for models of all sizes, but larger models are likely to perform better.

## Best practices of LLM prompting

* When choosing the model to work with, the latest and most capable models are likely to perform better.
& Start with a simple and short prompt, and iterate from there.
* Put the instructions at the beginning of the prompt, or at the very end. When working with large context, models apply various optimizations to prevent Attention complexity from scaling quadratically. This may make a model more attentive to the beginning or end of a prompt than the middle.
* Clearly separate instructions from the text they apply to - more on this in the next section.
* Be specific and descriptive about the task and the desired outcome - its format, length, style, language, etc.
* Avoid ambiguous descriptions and instructions.
* Favor instructions that say “what to do” instead of those that say “what not to do”.
* “Lead” the output in the right direction by writing the first word (or even begin the first sentence for the model).
* Use advanced techniques like Few-shot prompting and Chain-of-thought
* Test your prompts with different models to assess their robustness.
* Version and track the performance of your prompts.

## Advanced prompting techniques

### Few-shot prompting

The previous prompts are the examples of "zero-shot" prompts, meaning, the model has been given instructions and context, but no examples with solution.

In few-shot prompting, we provide examples in the prompt giving the model more context to improve the performance. The examples condition the model to generate the output following the patterns in the examples.

In [None]:
prompt = """
Text: The first human went into space and orbited the Earth on April 12, 1961.
Date: 04/12/1961
Text: The first-ever televised presidential debate in the United States took place on September 28, 1960, between presidential candidates John F. Kennedy and Richard Nixon.
Date:
"""

sequences = pipe(
    prompt,
    max_new_tokens=10,
    do_sample=True,
    top_k=10,
)

for seq in sequences:
    print(f"Result: {seq['generated_text']}")

We provided a single example to demonstrate the desired output to the model, so this can be called a "one-shot" prompting. Depending on the task complexity we may need to use more than one example.

Limitations of the few-shot prompting technique:
* While LLMs can pick up on the patterns in the examples, these technique does not work well on complex reasoning tasks.
* Few-shot prompting requires creating lengthy prompts. Prompts with large number of tokens can increase computation and latency. There's also a limit to the length of the prompts.
* Sometimes when given a number of examples, models can learn patterns that we did not intend them to learn.

### Chain-of-thought

Chain-of-thought (CoT) prompting is a technique that nudges a model to produce intermediate reasoning steps thus imrpoving the results on complex reasoning tasks.
* Few-shot prompting by illustrating examples with detailed answers to questions, showing the model how to work through a problem.
* By instructing the model to reason by adding phrases like "Let's think step by step" or "Take a deep breath and work through the problem step by step".

## Prompting vs fine-tuning

We can achieve great results by optimizing our prompts, however, we may still ponder whether fine-tuning a model would work better for our case.

Here are some scenarios when fine-tuning a smaller model may be a prefered option:
* Our domain is wildly different from what LLMs were pre-trained on and extensive prompt optimization did not yield sufficient results.
* We need our model to work well in a low-resource language.
* We need the model to be trained on sensitive data that is under strict regulations.
* We have to use a small model due to cost, privacy, infrastructure or other limitations.

We need to make sure that we either already have or can easily obtain a large enough domain-speciifc dataset at a reasonable cost to fine-tune a model. We will also need to have enough time and resources to fine-tune a model.

If these are not the case for us, optimizing prompts can prove to be more beneficial.