This example will walk you throught the basic usage of PromptBench. We hope that you can get familiar with the APIs and use it in your own projects later.

First, there is a unified import of `import promptbench as pb` that easily imports the package.

In [1]:
!pip install promptbench

Collecting promptbench
  Downloading promptbench-0.0.2-py3-none-any.whl.metadata (12 kB)
Collecting autocorrect==2.6.1 (from promptbench)
  Downloading autocorrect-2.6.1.tar.gz (622 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m622.8/622.8 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting accelerate==0.25.0 (from promptbench)
  Downloading accelerate-0.25.0-py3-none-any.whl.metadata (18 kB)
Collecting datasets==2.15.0 (from promptbench)
  Downloading datasets-2.15.0-py3-none-any.whl.metadata (20 kB)
Collecting nltk==3.8.1 (from promptbench)
  Downloading nltk-3.8.1-py3-none-any.whl.metadata (2.8 kB)
Collecting openai==1.3.7 (from promptbench)
  Downloading openai-1.3.7-py3-none-any.whl.metadata (17 kB)
Collecting sentencepiece==0.1.99 (from promptbench)
  Downloading sentencepiece-0.1.99-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Collecting tokenizers

In [2]:
import promptbench as pb

## Load dataset

First, PromptBench supports easy load of datasets.

In [3]:
# print all supported datasets in promptbench
print('All supported datasets: ')
print(pb.SUPPORTED_DATASETS)

# load a dataset, sst2, for instance.
# if the dataset is not available locally, it will be downloaded automatically.
dataset = pb.DatasetLoader.load_dataset("sst2")
# dataset = pb.DatasetLoader.load_dataset("mmlu")
# dataset = pb.DatasetLoader.load_dataset("un_multi")
# dataset = pb.DatasetLoader.load_dataset("iwslt2017", ["ar-en", "de-en", "en-ar"])
# dataset = pb.DatasetLoader.load_dataset("math", "algebra__linear_1d")
# dataset = pb.DatasetLoader.load_dataset("bool_logic")
# dataset = pb.DatasetLoader.load_dataset("valid_parenthesesss")

# print the first 5 examples
dataset[:5]

All supported datasets: 
['sst2', 'cola', 'qqp', 'mnli', 'mnli_matched', 'mnli_mismatched', 'qnli', 'wnli', 'rte', 'mrpc', 'mmlu', 'squad_v2', 'un_multi', 'iwslt2017', 'math', 'bool_logic', 'valid_parentheses', 'gsm8k', 'csqa', 'bigbench_date', 'bigbench_object_tracking', 'last_letter_concat', 'numersense', 'qasc']


[{'content': 'uneasy mishmash of styles and genres .', 'label': -1},
 {'content': "this film 's relationship to actual tension is the same as what christmas-tree flocking in a spray can is to actual snow : a poor -- if durable -- imitation .",
  'label': -1},
 {'content': 'by the end of no such thing the audience , like beatrice , has a watchful affection for the monster .',
  'label': -1},
 {'content': 'director rob marshall went out gunning to make a great one .',
  'label': -1},
 {'content': 'lathan and diggs have considerable personal charm , and their screen rapport makes the old story seem new .',
  'label': -1}]

## Load models

Then, you can easily load LLM models via promptbench.

In [4]:
import torch
import os
from huggingface_hub import login

token = "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"
login(token)

# print all supported models in promptbench
print('All supported models: ')
print(pb.SUPPORTED_MODELS)

print(torch.cuda.current_device)
# load a model, flan-t5-large, for instance.
#model = pb.LLMModel(model='google/flan-t5-large', max_new_tokens=10, temperature=0.0001, device='cuda')
#model = pb.LLMModel(model='google/flan-t5-large', max_new_tokens=10, temperature=0.0001, device=torch.cuda.current_device)



model = pb.LLMModel(model='google/flan-t5-large',
                    max_new_tokens=10, 
                    temperature=0.0001)

#model = pb.LLMModel(model='gpt-3.5-turbo', 
#                    openai_key = 'sk-xxx',
#                    max_new_tokens=150)
    

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /opt/app-root/src/.cache/huggingface/token
Login successful
All supported models: 
['google/flan-t5-large', 'llama2-7b', 'llama2-7b-chat', 'llama2-13b', 'llama2-13b-chat', 'llama2-70b', 'llama2-70b-chat', 'phi-1.5', 'palm', 'gpt-3.5-turbo', 'gpt-4', 'gpt-4-1106-preview', 'gpt-3.5-turbo-1106', 'vicuna-7b', 'vicuna-13b', 'vicuna-13b-v1.3', 'google/flan-ul2']
<function current_device at 0x7f485ab5b820>


You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Construct prompts

Prompts are the key interaction interface to LLMs. You can easily construct a prompt by call the Prompt API.

In [5]:
# Prompt API supports a list, so you can pass multiple prompts at once.
prompts = pb.Prompt(["Classify the sentence as positive or negative: {content}",
                     "Determine the emotion of the following sentence as positive or negative: {content}"
                     ])

You may need to define the projection function for the model output.
Since the output format defined in your prompts may be different from the model output.
For example, for sst2 dataset, the label are '0' and '1' to represent 'negative' and 'positive'.
But the model output is 'negative' and 'positive'.
So we need to define a projection function to map the model output to the label.

In [6]:
def proj_func(pred):
    mapping = {
        "positive": 1,
        "negative": 0
    }
    return mapping.get(pred, -1)

## Perform evaluation using prompts, datasets, and models

Finally, you can perform standard evaluation using the loaded prompts, datasets, and labels.

In [7]:
from tqdm import tqdm
for prompt in prompts:
    preds = []
    labels = []
    for data in tqdm(dataset):
        # process input
        input_text = pb.InputProcess.basic_format(prompt, data)
        label = data['label']
        raw_pred = model(input_text)
        # process output
        pred = pb.OutputProcess.cls(raw_pred, proj_func)
        preds.append(pred)
        labels.append(label)
    
    # evaluate
    score = pb.Eval.compute_cls_accuracy(preds, labels)
    print(f"{score:.3f}, {prompt}")

100%|██████████| 1821/1821 [05:13<00:00,  5.80it/s]


0.000, Classify the sentence as positive or negative: {content}


100%|██████████| 1821/1821 [05:26<00:00,  5.58it/s]

0.000, Determine the emotion of the following sentence as positive or negative: {content}



