# LLM-Blender Usage examples

## Loading blender (quick start)
You can find more custom configurations in 
- PairRanker: [./llm_blender/pair_ranker/config.py](./llm_blender/pair_ranker/config.py)
- GenFuser: [./llm_blender/gen_fuser/config.py](./llm_blender/gen_fuser/config.py)
- Blender: [./llm_blender/blender/config.py](./llm_blender/blender/config.py)

In [2]:
!pip install -e .

Obtaining file:///root/LLM-Blender
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting transformers>=4.31.0 (from llm-blender==0.0.2)
  Obtaining dependency information for transformers>=4.31.0 from https://files.pythonhosted.org/packages/20/0a/739426a81f7635b422fbe6cb8d1d99d1235579a6ac8024c13d743efa6847/transformers-4.36.2-py3-none-any.whl.metadata
  Downloading transformers-4.36.2-py3-none-any.whl.metadata (126 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.8/126.8 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting datasets (from llm-blender==0.0.2)
  Obtaining dependency information for datasets from https://files.pythonhosted.org/packages/ec/93/454ada0d1b289a0f4a86ac88dbdeab54921becabac45da3da787d136628f/datasets-2.16.1-py3-none-any.whl.metadata
  Downloading datasets-2.16.1-py3-none-any.whl.metadata (20 kB)
Collecting wget (from llm-blender==0.0.2)
  Using cached wget-3.2-py3-none-any.whl
Collecting pycocoevalcap (f

In [3]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import llm_blender
blender = llm_blender.Blender()

# Load Ranker
# blender.loadranker("llm-blender/PairRM") # load ranker checkpoint

# blender.loadranker("OpenAssistant/reward-model-deberta-v3-large-v2") # load ranker checkpoint
# Load Fuser
# blender.loadfuser("llm-blender/gen_fuser_3b") # load fuser checkpoint if you want to use pre-trained fuser; or you can use ranker only



In [4]:
# load ranker
blender.loadranker("llm-blender/PairRM") # load ranker checkpoint




Successfully loaded ranker from  /root/.cache/huggingface/hub/llm-blender/PairRM


In [5]:
# Load Fuser
blender.loadfuser("llm-blender/gen_fuser_3b") # load fuser checkpoint if you want to use pre-trained fuser; or you can use ranker only

pytorch_model.bin: 100%|██████████| 5.70G/5.70G [01:33<00:00, 61.2MB/s]
generation_config.json: 100%|██████████| 142/142 [00:00<00:00, 718kB/s]


## Load Mixinstruct dataset for the following examples showing

In [38]:
import datasets
import json
# from llm_blender.gpt_eval.cor_eval import COR_MAPS
# from llm_blender.gpt_eval.utils import get_ranks_from_chatgpt_cmps

# load mixinstruct dataset
mixinstruct_test = datasets.load_dataset("llm-blender/mix-instruct", split="test", streaming=True)

# take 8 examples from mixinstruct dataset 
few_examples = list(mixinstruct_test.take(8))

insts = [x['instruction'] for x in few_examples]
inputs = [x['input'] for x in few_examples]


candidates_texts = [[cand['text'] for cand in x['candidates']] for x in few_examples]

# list with models used to generate the candidate texts
candidate_models = [[cand['model'] for cand in x['candidates']] for x in few_examples]
# print(candidate_models)

# print("Example:")
# print("Instruction 1:\n", insts[0])
print("Input 1:\n", inputs[0])


Input 1:
 I've always wondered what the difference is between a skeptic and a denier.


In [36]:
# prints candidate 1 for input 1:  
print("Candidate 1 for input 1, generated with", candidate_models[0][0],':\n', candidates_texts[0][0])


Candidate 1 for input 1, generated with oasst-sft-4-pythia-12b-epoch-3.5 :
 A skeptic is someone who doubts or expresses doubt about a claim or idea without being dismissive of it. They are open-minded and approach evidence with an open mind, searching for reasonable explanations and evidence to support their beliefs.

A denier, on the other hand, is someone who actively works to deny or ignore evidence that contradicts their beliefs. They are often characterized by a closed mind and an unwillingness to consider alternative perspectives. They may also use rhetoric or false claims to try to discredit the evidence.
Candidate 2 for input 1, generated with koala-7B-HF :
 Can you explain?
5.   I've also noticed that some people who are skeptical about climate change also tend to be skeptical about other scientific subjects, like evolution. Can you explain that?
6.   What evidence have you seen that supports the theory of evolution?

These are just a few examples of questions that a journali

In [37]:
# prints candidate 2 for input 1:  
print("Candidate 2 for input 1, generated with", candidate_models[0][1],':\n', candidates_texts[0][1])

Candidate 2 for input 1, generated with koala-7B-HF :
 Can you explain?
5.   I've also noticed that some people who are skeptical about climate change also tend to be skeptical about other scientific subjects, like evolution. Can you explain that?
6.   What evidence have you seen that supports the theory of evolution?

These are just a few examples of questions that a journalist might ask to gather additional information about someone's skepticism about climate change. It's important for journalists to do their own research and fact-checking to ensure that their stories are accurate and balanced.


In [39]:
# inspect few examples from mix_instruct dataset [DELETE LATER]

# print(few_examples)



In [17]:
print(candidates_texts[0][2])

Here are a few definitions that I found online:
Skeptic: a person who seeks to acquire and validate knowledge by investigation and analysis, especially of a scientific or mathematical nature.
Denier: a person who deliberately refuses to accept facts or evidence that contradict their beliefs.
It looks like a skeptic is someone who is open to looking at evidence and facts, while a denier is someone who actively refuses to accept evidence that contradicts their beliefs. I guess that means a skeptic can be wrong, but a denier will never change their mind.
I think it's important to keep an open mind when it comes to facts and evidence, so I guess I'm a skeptic. What about you?
I'm always interested in learning new things, and I love when facts and evidence contradict my own beliefs. That's when I know I'm really learning something!


## Use case 1: Using LLM-Blender for ranking
By the rank function, LLM-Blender could ranks the candidates through pairwise comparisons and return the ranks. We show our ranker's ranks are highly correlated with ChatGPT ranks.

In [61]:
# Ranks the 8 example inputs taken from the mixinstruct dataset
# For every input, 12 open source models generated their candidate text

ranks = blender.rank(inputs, candidates_texts, instructions=insts, return_scores=False, batch_size=1)

Ranking candidates: 100%|██████████| 8/8 [00:33<00:00,  4.21s/it]


In [62]:
# # inspect inputs to understand the below cell
# print(inputs[0])

# # inspect candidate texts as well
# print(candidates_texts)

# print(candidates_texts[0])


In [63]:
print("Input 1:", inputs[0])

print("Ranks for input 1:", ranks[0]) # ranks of candidates for input 1
# Ranks for input 1: [ 1 11  4  9 12  5  2  8  6  3 10  7]

Input 1: I've always wondered what the difference is between a skeptic and a denier.
Ranks for input 1: [ 1 11  2  8  7 10  3 12  5  4  9  6]


As we see in the above cell, according to the ranker model 1 has the most suitable candidate text, then model 11, then model 2, etcetera. (EXPLAIN here how this is ranked)

Let's take a look now how we can use the compare class of LLM Blender to directly compare 2 candidates: 


In [64]:
# import numpy as np
# llm_ranks_map, gpt_cmp_results = get_ranks_from_chatgpt_cmps(few_examples)
# gpt_ranks = np.array(list(llm_ranks_map.values())).T
# print("Correlation with ChatGPT")
# print("------------------------")
# for cor_name, cor_func in COR_MAPS.items():
#     print(cor_name, cor_func(ranks, gpt_ranks))

## Use case 2: Using LLM-blender to directly compare two candidates

In [67]:
# candidates A: answers for the 8 example questions generated by model 1 (vicuna-13b-1.1)
candidates_A = [x['candidates'][0]['text'] for x in few_examples]

# candidates B: answers for the 8 example questions generated by model 2 (flan-t5-xxl)
candidates_B = [x['candidates'][1]['text'] for x in few_examples]



In [68]:
# blender.compare directly compares the 8 generated texts of model 1 to the 8 generated texts of model 2.

# comparison_results is a list of bool, where comparison_results[i] denotes whether candidates_A[i] is better than candidates_B[i] for inputs[i]
# comparison_results[0]--> True 

comparison_results = blender.compare(
    inputs, candidates_A, candidates_B, instructions=insts, 
    batch_size=2, return_logits=False)
print("Comparison results for inputs:", comparison_results) # comparison results for input 1

Ranking candidates: 100%|██████████| 4/4 [00:00<00:00, 13.62it/s]

Comparison results for inputs: [ True  True  True  True False  True  True  True]





## Use case 3: Using LLM-Blender for fuse generation
We show that the the fused generation using the top-ranked candidate from the rankers could get outputs of higher quality.

In [71]:
# from accelerate import Accelerator

blender.loadfuser("llm-blender/gen_fuser_3b")

from llm_blender.blender.blender_utils import get_topk_candidates_from_ranks

topk_candidates = get_topk_candidates_from_ranks(ranks, candidates_texts, top_k=3)

# print(topk_candidates)

fuse_generations = blender.fuse(inputs, topk_candidates, instructions=insts, batch_size=2)

# print("fuse_generations for input 1:", fuse_generations[0])

Fusing candidates: 100%|██████████| 4/4 [00:11<00:00,  2.83s/it]


In [70]:
print("fuse_generations for input 1:", fuse_generations[0])

fuse_generations for input 1: A skeptic is someone who is open to questioning and evaluating claims, while a denier is someone who actively refuses to accept evidence that contradicts their beliefs. So, a skeptic is someone who is open to questioning and evaluating claims, while a denier is someone who actively refuses to accept evidence that contradicts their beliefs.


In [15]:
# # Or do rank and fuser together
fuse_generations, ranks = blender.rank_and_fuse(inputs, candidates_texts, instructions=insts, return_scores=False, batch_size=2, top_k=3)

Ranking candidates: 100%|██████████| 4/4 [00:21<00:00,  5.25s/it]


AttributeError: 'Blender' object has no attribute 'fuser'

In [None]:
from llm_blender.common.evaluation import overall_eval
metrics = ['bartscore']
targets = [x['output'] for x in few_examples]
scores = overall_eval(fuse_generations, targets, metrics)

print("Fusion Scores")
for key, value in scores.items():
    print("  ", key+":", np.mean(value))

print("LLM Scores")
llms = [x['model'] for x in few_examples[0]['candidates']]
llm_scores_map = {llm: {metric: [] for metric in metrics} for llm in llms}
for ex in few_examples:
    for cand in ex['candidates']:
        for metric in metrics:
            llm_scores_map[cand['model']][metric].append(cand['scores'][metric])
for i, (llm, scores_map) in enumerate(llm_scores_map.items()):
    print(f"{i} {llm}")
    for metric, llm_scores in llm_scores_map[llm].items():
        print("  ", metric+":", "{:.4f}".format(np.mean(llm_scores)))


## Use case 4: Use LLM-Blender for decoding enhancement (best-of-n sampling)


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta", device_map="auto")

system_message = {
    "role": "system",
    "content": "You are a friendly chatbot who always responds in the style of a pirate",
}
messages = [
    [   
        system_message,
        {"role": "user", "content": _inst + "\n" + _input},
    ]
    for _inst, _input in zip(insts, inputs)
]
prompts = [tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True) for m in messages]
outputs = blender.best_of_n_generate(model, tokenizer, prompts, n=10)
print("### Prompt:")
print(prompts[0])
print("### best-of-n generations:")
print(outputs[0])


## Use case 5: Use PairRM for RLHF tuning