# LLM-Blender in Amazon Sagemaker

This notebook will show you how to work with LLM Blender, an ensembling framework designed to attain consistently superior performance by leveraging the diverse strengths of multiple open-source large language models (LLMs). 

LLMs such as GPT-4 and Claude are closed-source, restricting insights into their architectures and training data. Open-source LLMs like LLaMA, Flan-T5 however, offer the option to be fine-tuned on custom instruction datasets, enabling the development of smaller yet efficient LLMs. As demonstrated in the LLM Blender paper, optimal LLMs for different examples can significantly vary and there is no open-source LLM that wint for every use case. Considering the diverse strengths and weaknesses of LLMs, it can be beneficial leverage an ensembling method that combines their complementary potentials, leading to improved robustness, generalization, accuracy, but also alleviating biases, errors, and uncertainties in individual LLMs, resulting in outputs better aligned with human preferences.

LLM Blender is such an ensembling method, which is especially unique because, unlike the other RMs that encode and score each candidate respectively, PairRM takes a pair of candidates and compares them side-by-side to indentify the subtle differences between them.

The LLM Blender framework consists of 2 components:
1. A Ranker (PairRM): this model ranks the responses from multiple FMs through pairwise comparison (y1 + y2, y1 + y3, etc) and selects the top K candidates from different LLMs. It jointly encodes the input text and a pair of candidates, using cross-attention encoders to determine the superior one. PairRM is based on microsoft/deberta-v3-large, and is therefore super efficient: 0.4B.

2. A Fuser (GenFuser): this model fuses the top-ranked responses by combining them. GenFuser is essentially a fine-tuned Flan-T5-xl model, taking the input prompt (x) + the top K best ranked responses from PairRM and summarizes the top ranked responses.


This notebook was tested on a ml.g5.12xlarge notebook instance with a Pytorch 2.0.1. GPU Optimized image.

In [31]:
# install LLM blender
!pip install -e . -q

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[0m

In [2]:
import llm_blender
blender = llm_blender.Blender()

# load Ranker
blender.loadranker("llm-blender/PairRM") # load ranker checkpoint

# Load Fuser
blender.loadfuser("llm-blender/gen_fuser_3b") # load fuser checkpoint if you want to use pre-trained fuser; or you can use ranker only

  from .autonotebook import tqdm as notebook_tqdm


## Load Mixinstruct dataset

To play around with LLM Blender, we will be using the Mixinstruct dataset. From this dataset, we will be using the input column, that contains a prompt or question we can ask an FM. It also contains a 'candidates' column, which for every entry contains a list of dictionaries in JSON format. This list contains answers from 12 different open source models to the input question. Because of this, we can use this dataset to test LLM Blender and rank + fuse the responses from different FMs to an input promp, and likely getting a more relevant response.  

To learn more about MixInstruct, you can take a deeper look here: https://huggingface.co/datasets/llm-blender/mix-instruct?row=0

In [47]:
import datasets
import json

# Load the test split of the MixInstruct dataset 
mixinstruct_test = datasets.load_dataset("llm-blender/mix-instruct", split="test", streaming=True)

# Take 8 examples from MixInstruct dataset 
few_examples = list(mixinstruct_test.take(8))

insts = [x['instruction'] for x in few_examples]
inputs = [x['input'] for x in few_examples]

candidates_texts = [[cand['text'] for cand in x['candidates']] for x in few_examples]

# List with models used to generate the candidate texts

candidate_models = [[cand['model'] for cand in x['candidates']] for x in few_examples]
# print(candidate_models)

# Print the first input question from the MixInstruct (test) dataset.
print("Input 1:\n", inputs[0])



Input 1:
 I've always wondered what the difference is between a skeptic and a denier.


---------

If you want, you can inspect the dataset and print all candidate responses (i.e. responses to the input from all 12 open source models in the dataset). However, this is a very long list, so we will just take a look at the candidate response of model 1 (oasst-sft-4-pythia-12b-epoch-3.5), model 2 (koala-7B-HF) and model 3 (alpaca native). 

In [41]:
# print all candidate responses to input 1 
# print("All Candidate responses",':\n', candidates_texts[0])

# prints candidate 1 for input 1:  
print("Candidate 1 for input 1, generated with", candidate_models[0][0],':\n', candidates_texts[0][0])


Candidate 1 for input 1, generated with oasst-sft-4-pythia-12b-epoch-3.5 :
 A skeptic is someone who doubts or expresses doubt about a claim or idea without being dismissive of it. They are open-minded and approach evidence with an open mind, searching for reasonable explanations and evidence to support their beliefs.

A denier, on the other hand, is someone who actively works to deny or ignore evidence that contradicts their beliefs. They are often characterized by a closed mind and an unwillingness to consider alternative perspectives. They may also use rhetoric or false claims to try to discredit the evidence.


In [42]:
# prints candidate 2 for input 1:  
print("Candidate 2 for input 1, generated with", candidate_models[0][1],':\n', candidates_texts[0][1])

Candidate 2 for input 1, generated with koala-7B-HF :
 Can you explain?
5.   I've also noticed that some people who are skeptical about climate change also tend to be skeptical about other scientific subjects, like evolution. Can you explain that?
6.   What evidence have you seen that supports the theory of evolution?

These are just a few examples of questions that a journalist might ask to gather additional information about someone's skepticism about climate change. It's important for journalists to do their own research and fact-checking to ensure that their stories are accurate and balanced.


In [43]:
# prints candidate 3 for input 1:  
print("Candidate 3 for input 1, generated with", candidate_models[0][2],':\n', candidates_texts[0][2])


Candidate 3 for input 1, generated with alpaca-native :
 Here are a few definitions that I found online:
Skeptic: a person who seeks to acquire and validate knowledge by investigation and analysis, especially of a scientific or mathematical nature.
Denier: a person who deliberately refuses to accept facts or evidence that contradict their beliefs.
It looks like a skeptic is someone who is open to looking at evidence and facts, while a denier is someone who actively refuses to accept evidence that contradicts their beliefs. I guess that means a skeptic can be wrong, but a denier will never change their mind.
I think it's important to keep an open mind when it comes to facts and evidence, so I guess I'm a skeptic. What about you?
I'm always interested in learning new things, and I love when facts and evidence contradict my own beliefs. That's when I know I'm really learning something!


## Rank LLM outputs with the Ranker.

Now we're ranking the candidate outputs from the 12 models in the MixInstruct dataset. We rank them for 8 example inputs with the rank function, PairRM, through pairwise comparisons. 

Then, we will inspect the ranking for input 1 ('I've always wondered what the difference is between a skeptic and a denier.')

In [44]:
# Ranks the 8 example inputs taken from the mixinstruct dataset
# For every input, 12 open source models generated their candidate text

ranks = blender.rank(inputs, candidates_texts, instructions=insts, return_scores=False, batch_size=1)

Ranking candidates: 100%|██████████| 8/8 [00:33<00:00,  4.25s/it]


In [12]:
# Input 1
print("Input 1:", inputs[0])

# print the ranking for input 1.
print("Ranks for input 1:", ranks[0]) # ranks of candidates for input 1
# Ranks for input 1: [ 1 11  4  9 12  5  2  8  6  3 10  7]

Input 1: I've always wondered what the difference is between a skeptic and a denier.
Ranks for input 1: [ 1 11  2  8  7 10  3 12  5  4  9  6]


As we see in the above cell, the ranker lists model 1 has the most suitable candidate text, then model 11, then model 2, etcetera.

## Directly compare two candidates

Let's take a look now how we can use the compare class of LLM Blender to directly compare 2 candidates: 

In [49]:
# candidates A: answers for the 8 example questions generated by model 1 (vicuna-13b-1.1)
candidates_A = [x['candidates'][0]['text'] for x in few_examples]

# candidates B: answers for the 8 example questions generated by model 2 ()
candidates_B = [x['candidates'][1]['text'] for x in few_examples]

# print(candidates_A)
#print(candidates_B)

["Can you explain?\n5.   I've also noticed that some people who are skeptical about climate change also tend to be skeptical about other scientific subjects, like evolution. Can you explain that?\n6.   What evidence have you seen that supports the theory of evolution?\n\nThese are just a few examples of questions that a journalist might ask to gather additional information about someone's skepticism about climate change. It's important for journalists to do their own research and fact-checking to ensure that their stories are accurate and balanced.", '5.   Write a short poem about the number and word.\n6.   Ask students to turn and share their answers with the class.\n2.   Review the answer key with the students and help them understand the concept.\n3.   Have students work in small groups to play a matching game using the number and word cards.\n4.   Review the results of the game and have students discuss the strategies they used to play.\n5.   Have students create their own board ga

In [15]:
# blender.compare directly compares the 8 generated texts of model 1 to the 8 generated texts of model 2.
# comparison_results is a list of bool, where comparison_results[i] denotes whether candidates_A[i] is better than candidates_B[i] for inputs[i]

comparison_results = blender.compare(
    inputs, candidates_A, candidates_B, instructions=insts, 
    batch_size=2, return_logits=False)
print("Comparison results for inputs:", comparison_results) # comparison results for input 1

Ranking candidates: 100%|██████████| 4/4 [00:00<00:00, 13.90it/s]

Comparison results for inputs: [ True  True  True  True False  True  True  True]





## Fuse the top K candidates with GenFuser


In [16]:
# from accelerate import Accelerator

# blender.loadfuser("llm-blender/gen_fuser_3b")

from llm_blender.blender.blender_utils import get_topk_candidates_from_ranks

# First, we'll need to get the top K candidates from our 12 candidate responses in candidate_texts. We set K to 3 here. 

topk_candidates = get_topk_candidates_from_ranks(ranks, candidates_texts, top_k=3)
# print(topk_candidates)

# Fuse the top K candidates with the fuser class in LLM Blender. 
fuse_generations = blender.fuse(inputs, topk_candidates, instructions=insts, batch_size=2)

# print("fuse_generations for input 1:", fuse_generations[0])

Fusing candidates: 100%|██████████| 4/4 [01:00<00:00, 15.25s/it]


In [17]:
print("fuse_generations for input 1:", fuse_generations[0])

fuse_generations for input 1: A skeptic is someone who is open to questioning and evaluating claims, while a denier is someone who actively refuses to accept evidence that contradicts their beliefs. So, a skeptic is someone who is open to questioning and evaluating claims, while a denier is someone who actively refuses to accept evidence that contradicts their beliefs.


----
The above output is a combination (fusion) of the top 3 best candidates in to our input question 'I've always wondered what's the difference between a skeptic and a denier'. Let's evaluate this output in the next section. 

In [18]:
# Or do rank and fuser together
fuse_generations, ranks = blender.rank_and_fuse(inputs, candidates_texts, instructions=insts, return_scores=False, batch_size=2, top_k=3)

Ranking candidates: 100%|██████████| 4/4 [00:20<00:00,  5.24s/it]
Fusing candidates: 100%|██████████| 4/4 [01:00<00:00, 15.17s/it]


## Evaluation

Then, we will evaluate whether the fused generation using the top-ranked candidates from the rankers gets outputs of higher quality. We evaluate this with the bartscore of the individual model outputs versus the bart score of the fused response.

In [29]:
from llm_blender.common.evaluation import overall_eval
import numpy as np

metrics = ['bartscore']
targets = [x['output'] for x in few_examples]
scores = overall_eval(fuse_generations, targets, metrics)

print("Fusion Scores")
for key, value in scores.items():
    print("  ", key+":", np.mean(value))

print("LLM Scores")
llms = [x['model'] for x in few_examples[0]['candidates']]
llm_scores_map = {llm: {metric: [] for metric in metrics} for llm in llms}
for ex in few_examples:
    for cand in ex['candidates']:
        for metric in metrics:
            llm_scores_map[cand['model']][metric].append(cand['scores'][metric])
for i, (llm, scores_map) in enumerate(llm_scores_map.items()):
    print(f"{i} {llm}")
    for metric, llm_scores in llm_scores_map[llm].items():
        print("  ", metric+":", "{:.4f}".format(np.mean(llm_scores)))


bart_score.pth trained on ParaBank not found.
Please download bart_score.pth from bartscore github repo, then put it here:  /root/LLM-Blender/llm_blender/common/bart_score.pth
Using the default bart-large-cnn model instead.


Evaluating bartscore: 100%|██████████| 8/8 [00:00<00:00, 45.97it/s]

Fusion Scores
   bartscore: -2.789272427558899
LLM Scores
0 oasst-sft-4-pythia-12b-epoch-3.5
   bartscore: -3.8071
1 koala-7B-HF
   bartscore: -4.5505
2 alpaca-native
   bartscore: -4.2063
3 llama-7b-hf-baize-lora-bf16
   bartscore: -3.9364
4 flan-t5-xxl
   bartscore: -4.9341
5 stablelm-tuned-alpha-7b
   bartscore: -4.4329
6 vicuna-13b-1.1
   bartscore: -4.2022
7 dolly-v2-12b
   bartscore: -4.4400
8 moss-moon-003-sft
   bartscore: -3.5876
9 chatglm-6b
   bartscore: -3.7075
10 mpt-7b
   bartscore: -4.1353
11 mpt-7b-instruct
   bartscore: -4.2827



