In [None]:
# %pip install -qq alpaca-eval

# Testing Best-of-10 Sampling

This noteook shows how you can get a boost in performance (as measured by Alpacaeval) by sampling multiple times and choosing the best result.

I ran it on my dual 3090 machine, with the reward model on one GPU and the LLM on the other. 

Since this ran over the weekend I didn't optimize it at all - you'll likely want to generate or score in batches and look into optimizations for inference if you want to replicate this in less then 24 hours!

I've edited it lightly to add comments and run it on only 3 samples by default.



In [1]:
import torch
import transformers
import numpy as np
from tqdm.auto import tqdm
from transformers import AutoTokenizer, pipeline

This is a reward model, which assigns a score to a chat interaction. I picked the best open one on [RewardBench](https://huggingface.co/spaces/allenai/reward-bench) and copied in the usage example here:

In [2]:
# Load the reward model
rm_tokenizer = AutoTokenizer.from_pretrained("sfairXC/FsfairX-LLaMA3-RM-v0.1")
rm_pipe = pipeline(
    "sentiment-analysis",
    model="sfairXC/FsfairX-LLaMA3-RM-v0.1",
    device=0,
    tokenizer=rm_tokenizer,
    model_kwargs={"torch_dtype": torch.bfloat16}
)

pipe_kwargs = {
    "return_all_scores": True,
    "function_to_apply": "none",
    "batch_size": 1
}


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Let's score a sample conversation. Try changing the assistant response to something nasty:

In [4]:
# Test it (usage example from HF page)
chat = [
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing well! How can I help?"},
  {"role": "user", "content": "I'd like to show off how chat templating works!"},
]

test_texts = [rm_tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=False).replace(rm_tokenizer.bos_token, "")]
pipe_outputs = rm_pipe(test_texts, **pipe_kwargs)
rewards = [output[0]["score"] for output in pipe_outputs]
print(rewards[0]) # -8.1 if we add a naughty word.

-6.53125


# Generating Alpacaeval Responses

Next I copied the starter code from the Llama 3 8B Instruct page to generate some text in response to an instruction:

In [5]:
# Load the model and generate some text
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device=1
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
print(outputs[0]["generated_text"][len(prompt):])


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Arrrr, me hearty! Me name be Captain Chatbot, the scurviest pirate to ever sail the Seven Seas! Me be a chatbot, aye, but me be a pirate at heart, with a penchant fer swashbucklin' and a love o' treasure! Me be here to chat with ye, share me tales o' adventure, and maybe even help ye find yer own treasure! So hoist the colors, me hearty, and let's set sail fer a swashbucklin' good time!


In [6]:
# Package that up into a function that takes an instruction and returns a response
def get_response(instruction):
    messages = [
        {"role": "system", "content": "You are a helpful chatbot."},
        {"role": "user", "content": instruction},
    ]

    prompt = pipeline.tokenizer.apply_chat_template(
            messages, 
            tokenize=False, 
            add_generation_prompt=True
    )

    terminators = [
        pipeline.tokenizer.eos_token_id,
        pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
    ]

    outputs = pipeline(
        prompt,
        max_new_tokens=1024,
        eos_token_id=terminators,
        do_sample=True,
        temperature=0.6,
        top_p=0.9,
    )
    return outputs[0]["generated_text"][len(prompt):]

Now we want to use this to generate responses to the instructions used in AlpacaEval 2.0, a popular evaluation of chat models. Let's see what a submission is supposed to look like:

In [7]:
# Get the alpacaeval instructions
import requests

example_json_url = "https://raw.githubusercontent.com/tatsu-lab/alpaca_eval/main/example/outputs.json"

response = requests.get(example_json_url)
examples = response.json()

Here's one example (they inclide output from a simple model which we'll replace):

In [8]:
import pprint
pprint.pprint(examples[0])

{'dataset': 'helpful_base',
 'datasplit': 'eval',
 'generator': 'example',
 'instruction': 'What are the names of some famous actors that started their '
                'careers on Broadway?',
 'output': 'Some famous actors that started their careers on Broadway are Hugh '
           'Jackman, Meryl Streep, Denzel Washington, Audra McDonald, and '
           'Lin-Manuel Miranda.'}


We can get our own model's response to this instruction like so:

In [9]:
print(get_response(examples[0]['instruction']))

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


Many talented actors have gotten their start on Broadway before making it big in Hollywood. Here are some famous actors who began their careers on the Great White Way:

1. Hugh Jackman - Jackman originated the role of Curly McLain in the 1998 Broadway production of "Oklahoma!" before moving on to star in films like the X-Men franchise and "Les Misérables."
2. Neil Patrick Harris - Harris got his start on Broadway in the 1980s, appearing in productions like "Assassins" and "Cabaret." He's since become a household name thanks to his roles in TV shows like "How I Met Your Mother" and "Doogie Howser, M.D."
3. Scarlett Johansson - Johansson made her Broadway debut in the 1995 production of "A View from the Bridge" alongside Robert De Niro. She's since become a Hollywood A-lister, starring in films like the Marvel Cinematic Universe and "Lost in Translation."
4. Chris Evans - Evans originated the role of Billy in the 2000 Broadway production of "Miss Saigon" before moving on to star in the M

Let's do this for all examples, generating 10 responses and storing the first plus the best (according to the reward model):

In [None]:
# Run through all examples, generating and scoring responses
first_outputs = []
chosen_outputs = []
n_tries = 10
for e in tqdm(examples[:3]):
    instruction = e['instruction']
    responses, scores = [], []
    for i in range(n_tries):
        response = get_response(instruction)
        chat = [
        {"role": "user", "content": instruction},
        {"role": "assistant", "content": response},
        ]

        test_texts = [rm_tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=False).replace(rm_tokenizer.bos_token, "")]
        pipe_outputs = rm_pipe(test_texts, **pipe_kwargs)
        rewards = [output[0]["score"] for output in pipe_outputs]
        responses.append(response)
        scores.append(rewards[0]) 

    # Store the first response and the best response
    first_outputs.append({
        'dataset': e['dataset'],
        'datasplit': e['dataset'],
        'generator': 'l3_8b_default',
        'instruction': instruction,
        'output': responses[0]
    })
    chosen_outputs.append({
        'dataset': e['dataset'],
        'datasplit': e['dataset'],
        'generator': f'l3_8b_best_of_{n_tries}',
        'instruction': instruction,
        'output': responses[np.argmax(np.array(scores))] # << The best of the n_tries responses
    })

In [None]:
# Save the outputs for evaluation
import json
with open('first_outputs.json', 'w') as f:
    f.write(json.dumps(first_outputs))
with open('chosen_outputs.json', 'w') as f:
    f.write(json.dumps(chosen_outputs))

To run eval:
```
export OPENAI_API_KEY=<your_api_key>
alpaca_eval --model_outputs 'first_outputs.json' 
```

AlpacaEval takes these responses and compares them to those written by GPT4-Turbo, with a big LLM (GPT-4 by default) as the judge picking which response is better. The output is a 'win rate' score - the higher the better.