# **Evaluating an LLM on a Domain-specific Benchmark Task**

*Notes: You'll need approx. 15.5 GB of GPU RAM to run this notebook. I ran it with Colab Pro with High Ram enabnled and a V100.*

In this notebook:
- **Domain**: Astronautics
- **Task**: Multiple Choices Questions and Answers
- **Dataset**: AstroMCQA

### **Content**
1. Formatting of an existing dataset hosted on huggingface to feed an LLM for evaluation
2. Load an huggingface LLM and collect its outputs
3. Evaluating the LLM based on multiple scoring strategies
    - string character-based distance
    - embedding-based distance
    - f1 and exact match
    - LLM-as-a-Judge

In [1]:
|!pip install transformers dataset accelerate evaluate langchain rapidfuzz sentence-transformers openai tiktoken bitsandbytes --upgrade --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m31.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.6/297.6 kB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m817.7/817.7 kB[0m [31m32.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m52.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.9/309.9 kB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m26.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━

## Imports

In [2]:
import pandas as pd
import ast
import torch
from transformers import pipeline
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from google.colab import userdata

## Tokens and API KEYs
**Important**
- Make sure you have configure an `HF_WRITE_TOKEN`, i.e., an hugging token with read and write permission in your Colab secrets. You can get it from your you account at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens).
- Make sure you have also configured you `OPENAI_API_KEY` secret environment variable in Colab secrets.

In [3]:
# Login to huggingface
from huggingface_hub import login
login(token=userdata.get('HF_WRITE_TOKEN'), add_to_git_credential=True, write_permission=True)

Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


# Configuration
Select the model you would like to evaluate. This notebook is configured to evaluate `Mistral-7B-Instruct-v0.2` on AstroMCQA.

In [4]:
MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.2"
TORCH_DTYPE = torch.bfloat16 # keep this to mitigate vRAM usage.

Here we load the dataset from the Huggingface hub. Note that given its limited size (only 200 expert annotated questions and answers) there are not multiple splits such as train/val/test but all the data are under a unique `train` split. With **AstroMCQA** you would use this unique `train` split for the evaluation.

In [16]:
from datasets import load_dataset
ds = load_dataset("patrickfleith/Astro-mcqa", split='train') # there is just one split
# for this dataset, called train that we'll use not to train but the eval.

# Load Model

In [6]:
device = "cuda" # the device to load the model onto
pipe = pipeline(task='text-generation', model=MODEL_ID, device=device, torch_dtype=TORCH_DTYPE)

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

In [12]:
def release_gpu_from(items_list):
    """ An utility function to delete the model from GPU memory and release RAM """
    for item in items_list:
        try:
            del item
        except:
            pass
        torch.cuda.empty_cache()

In [126]:
release_gpu_from(items_list=[])

# Dataset Preparation
- For this simple evaluation we filter to answerable questions only
- Then we construct a instruction prompt template for this task
- For each instance we use the inscrution prompt template to create a specific prompt that includes the question and the propositions submitted to the model
- We genereate a set of reference 'ground truth'. Indeed, evaluating LLMs is tricky and correct answers can be provided in different formats. That is why we generate multiple acceptance references (i.e., different formulations of ground truth).

In [17]:
data = ds.to_pandas()
# preprocessing
data = data.query('answerable==True') # just remove non-answerable items
data = data.reset_index()
data = data.rename(columns={
    'index': 'id'
})

In [18]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   id             200 non-null    int64 
 1   question       200 non-null    object
 2   propositions   200 non-null    object
 3   labels         200 non-null    object
 4   justification  91 non-null     object
 5   answerable     200 non-null    bool  
 6   uid            200 non-null    object
dtypes: bool(1), int64(1), object(5)
memory usage: 9.7+ KB


For each example in the dataset, we construct a prompt, using the model's appropriate chat template format, also with an instruction prompt, such that it can be fed to the model under evaluation.
- Note: some model require a `system` role while others do not accept it that is why we have the flag `system`

In [19]:
def create_prompt_hf(pipe, question, choices, system=False):

    task_instruction_str = """You're a helpful assistant. You are an expert in Multiple Choices Question Answering.
Here is a question, tell me which are the correct answer(s).
Do not elaborate, just answer with the correct answers. Do not provide any justifications.
Note: Sometimes there is just one correct answer, sometimes there are multiples propositions that are corrects."""

    user_prompt_str = f"""### QUESTION: {question}

### CHOICES:
{choices}

### ANSWER(S):"""

    if system:
        messages = [
            {
                "role": "system",
                "content": task_instruction_str,
            },
            {
                "role": "user",
                "content": user_prompt_str
            },
        ]
    else:
        messages = [
            {
                "role": "user",
                "content": task_instruction_str,
            },
            {
                "role": "assistant",
                "content": "Ok, I'll only respond with the propositions that I think are correct, for instance, 'B, E'"
            },
            {
                "role": "user",
                "content": user_prompt_str
            },
        ]

    prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

    return prompt

# let's test it
prompt = create_prompt_hf(pipe, question='What is the sky color?', choices='A. Blue\nB. Red')
print(prompt)

<s>[INST] You're a helpful assistant. You are an expert in Multiple Choices Question Answering.
Here is a question, tell me which are the correct answer(s).
Do not elaborate, just answer with the correct answers. Do not provide any justifications.
Note: Sometimes there is just one correct answer, sometimes there are multiples propositions that are corrects. [/INST]Ok, I'll only respond with the propositions that I think are correct, for instance, 'B, E'</s>[INST] ### QUESTION: What is the sky color?

### CHOICES:
A. Blue
B. Red

### ANSWER(S): [/INST]


You can view the dataset in a pandas dataframe before

In [21]:
data

Unnamed: 0,id,question,propositions,labels,justification,answerable,uid
0,0,True or False? An object that is not accelerat...,"['True', 'False', 'Impossible to say without m...","[0, 1, 0]","According to Newton's first law: ""In the absen...",True,ff0e8305-73e0-404a-a775-4e7be538d99b
1,1,Complete the following: During its orbital per...,"['remains the same', 'increases', 'decreases']","[0, 0, 1]",,True,40a73e51-5eb4-46d3-baab-8acd624f2287
2,2,Which affirmation is true when talking about E...,"['The potential energy is positive', 'The pote...","[0, 0, 1, 0]",This is because the Earth potential energy is ...,True,883f8ed7-8b2a-4671-80d3-f68791fa4118
3,3,Why do we talk about microgravity in LEO?,['All the external forces are not cancelling e...,"[0, 0, 1, 0]",,True,7667b6ef-31e5-4ff3-87e2-65f9a4467551
4,4,To what wavelength is the atmosphere the most ...,"['X-rays', 'Visible light', 'Infra Red', 'Radi...","[0, 0, 0, 1]",,True,b855830d-0498-4c21-8aca-ac69c917178c
...,...,...,...,...,...,...,...
195,195,Which of the following gas is the main compone...,"['N2', 'O2', 'CO2', 'CH4', 'Ar']","[0, 0, 1, 0, 0]",Mars Atmosphere is made up to 96% of CO2,True,80f3bfa3-5198-4c0f-a137-4c679b192a58
196,196,How long do Lunar night last near the Moon equ...,"['3.5 hours', '35 hours', '350 hours', '3500 h...","[0, 0, 1, 0]",,True,73f86516-6e00-4668-91e8-9e15ebb3912e
197,197,Which can you find on the Moon surface?,"['Liquid methane', 'Lava tubes', 'Ilmenite', '...","[0, 1, 1, 1, 0]",,True,10a3f4eb-6269-41c8-9a78-410ff0189ffb
198,198,Which mission types to Mars often include a Ve...,"['Conjunction mission', 'Cross-trapped mission...","[0, 0, 1, 0, 0]",,True,ed958ac6-616f-479d-b73c-203016fdf5d5


## Create the prompts and the references

In [22]:
references = [] # all accceptable answers
mcqa_prompts = [] # full prompt, including applied chat template
formatted_choices_list = [] # formatted choices list with letters and new lines
llm_references = [] # the ideal acceptable answer for reference to the LLM, we don't have the budget to test a whole list of reference for LLM-as-Judge
queries = [] # ensemble of question + choices, no specific prompt or chat template used just the question and proposition, also useful for LLM-as-Judge eval

for id in data.id:

    # print(f"============\n\n ID = {id}\n")

    # extract data from instance
    mcqa = data.query('id==@id')
    question = mcqa.question.values[0]
    propositions = ast.literal_eval(mcqa.propositions.values[0])
    labels = ast.literal_eval(mcqa.labels.values[0])

    # format the mcqa choices
    formatted_choices = [l + '. ' + prop for l, prop in zip('ABCDEF', propositions)]
    choices = '\n'.join(formatted_choices)
    formatted_choices_list.append(choices)

    correct_answers = [prop for idx, prop in enumerate(formatted_choices) if labels[idx]==1]

    # different acceptable text strings
    correct_completion_newline = '\n'.join(correct_answers)
    correct_completion_letter_only = ', '.join([elt[0] for elt in correct_answers])
    correct_completion_newline_noletters = '\n'.join([elt[3:] for elt in correct_answers])
    correct_completion_simple_join = ' '.join(correct_answers)
    correct_completion_simple_join_noletters = ' '.join([elt[3:] for elt in correct_answers])

    # aggregate all acceptable text strings
    reference_texts = [
        correct_completion_newline,
        correct_completion_letter_only,
        correct_completion_newline_noletters,
        correct_completion_simple_join,
        correct_completion_simple_join_noletters
    ]

    llm_references.append(correct_completion_newline)
    queries.append(question + "\n" + choices)

    # remove potential duplicates from the list
    reference_texts = list(set(reference_texts))
    references.append(reference_texts)

    ### Create prompts

    prompt_str = create_prompt_hf(pipe, question=question, choices=choices)
    mcqa_prompts.append(prompt_str)

In [23]:
# Adds new columns
data['prompt'] = mcqa_prompts
data['references'] = references
data['query'] = queries
data['llm_reference'] = llm_references
data['formatted_choices'] = formatted_choices_list

In [None]:
data

Unnamed: 0,id,question,propositions,labels,justification,answerable,uid,prompt,references,query,llm_reference,formatted_choices
0,0,True or False? An object that is not accelerat...,"['True', 'False', 'Impossible to say without m...","[0, 1, 0]","According to Newton's first law: ""In the absen...",True,ff0e8305-73e0-404a-a775-4e7be538d99b,<s>[INST] You're a helpful assistant. You are ...,"[False, B. False, B]",True or False? An object that is not accelerat...,B. False,A. True\nB. False\nC. Impossible to say withou...
1,1,Complete the following: During its orbital per...,"['remains the same', 'increases', 'decreases']","[0, 0, 1]",,True,40a73e51-5eb4-46d3-baab-8acd624f2287,<s>[INST] You're a helpful assistant. You are ...,"[decreases, C, C. decreases]",Complete the following: During its orbital per...,C. decreases,A. remains the same\nB. increases\nC. decreases
2,2,Which affirmation is true when talking about E...,"['The potential energy is positive', 'The pote...","[0, 0, 1, 0]",This is because the Earth potential energy is ...,True,883f8ed7-8b2a-4671-80d3-f68791fa4118,<s>[INST] You're a helpful assistant. You are ...,"[C, C. The potential energy is negative, The p...",Which affirmation is true when talking about E...,C. The potential energy is negative,A. The potential energy is positive\nB. The po...
3,3,Why do we talk about microgravity in LEO?,['All the external forces are not cancelling e...,"[0, 0, 1, 0]",,True,7667b6ef-31e5-4ff3-87e2-65f9a4467551,<s>[INST] You're a helpful assistant. You are ...,"[The free-fall trajectory gets perturbed, C. T...",Why do we talk about microgravity in LEO?\nA. ...,C. The free-fall trajectory gets perturbed,A. All the external forces are not cancelling ...
4,4,To what wavelength is the atmosphere the most ...,"['X-rays', 'Visible light', 'Infra Red', 'Radi...","[0, 0, 0, 1]",,True,b855830d-0498-4c21-8aca-ac69c917178c,<s>[INST] You're a helpful assistant. You are ...,"[Radio Waves, D, D. Radio Waves]",To what wavelength is the atmosphere the most ...,D. Radio Waves,A. X-rays\nB. Visible light\nC. Infra Red\nD. ...
...,...,...,...,...,...,...,...,...,...,...,...,...
195,195,Which of the following gas is the main compone...,"['N2', 'O2', 'CO2', 'CH4', 'Ar']","[0, 0, 1, 0, 0]",Mars Atmosphere is made up to 96% of CO2,True,80f3bfa3-5198-4c0f-a137-4c679b192a58,<s>[INST] You're a helpful assistant. You are ...,"[CO2, C, C. CO2]",Which of the following gas is the main compone...,C. CO2,A. N2\nB. O2\nC. CO2\nD. CH4\nE. Ar
196,196,How long do Lunar night last near the Moon equ...,"['3.5 hours', '35 hours', '350 hours', '3500 h...","[0, 0, 1, 0]",,True,73f86516-6e00-4668-91e8-9e15ebb3912e,<s>[INST] You're a helpful assistant. You are ...,"[C, C. 350 hours, 350 hours]",How long do Lunar night last near the Moon equ...,C. 350 hours,A. 3.5 hours\nB. 35 hours\nC. 350 hours\nD. 35...
197,197,Which can you find on the Moon surface?,"['Liquid methane', 'Lava tubes', 'Ilmenite', '...","[0, 1, 1, 1, 0]",,True,10a3f4eb-6269-41c8-9a78-410ff0189ffb,<s>[INST] You're a helpful assistant. You are ...,"[Lava tubes\nIlmenite\nIron, Titanium, and Alu...",Which can you find on the Moon surface?\nA. Li...,"B. Lava tubes\nC. Ilmenite\nD. Iron, Titanium,...",A. Liquid methane\nB. Lava tubes\nC. Ilmenite\...
198,198,Which mission types to Mars often include a Ve...,"['Conjunction mission', 'Cross-trapped mission...","[0, 0, 1, 0, 0]",,True,ed958ac6-616f-479d-b73c-203016fdf5d5,<s>[INST] You're a helpful assistant. You are ...,"[Opposition mission, C, C. Opposition mission]",Which mission types to Mars often include a Ve...,C. Opposition mission,A. Conjunction mission\nB. Cross-trapped missi...


# Predictions with the LLM under evaluation

In [24]:
import warnings
warnings.filterwarnings('ignore', category=UserWarning)

In [27]:
# Here you may need to change the 'end of query token depending on the Model's prompt template'
END_OF_QUERY_TOKEN = '[/INST]' # '<|assistant|>\n'

### Test prediction on one example

In [28]:
id=19
mcqa = data.query('id==@id')
prompt = mcqa.prompt.values[0]
outputs = pipe(prompt, max_new_tokens=256, do_sample=False)
response = outputs[0]["generated_text"].split(END_OF_QUERY_TOKEN)[-1]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [29]:
print(response)

 A, D.


In [30]:
mcqa.labels

19    [1, 0, 0, 1]
Name: labels, dtype: object

In [33]:
print("================================")
print("Acceptable answers (references):")
print("================================\n\n")
for item in mcqa.references.values[0]:
    print(item + "\n")

Acceptable answers (references):


Count the number of sun spots
Measure the Sun's flux at a wavelength of 10.7 cm

Count the number of sun spots Measure the Sun's flux at a wavelength of 10.7 cm

A, D

A. Count the number of sun spots D. Measure the Sun's flux at a wavelength of 10.7 cm

A. Count the number of sun spots
D. Measure the Sun's flux at a wavelength of 10.7 cm



### Make predictions over the whole dataset

In [34]:
predictions = []

for id in data.id:

    print(f"ID: {id}")
    try:
        mcqa = data.query('id==@id')
        prompt = mcqa.prompt.values[0]
        outputs = pipe(prompt, max_new_tokens=256, do_sample=False)
        response = outputs[0]["generated_text"].split(END_OF_QUERY_TOKEN)[-1]
    except:
        response = "I don't know"

    predictions.append(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 0


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 1


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 2


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 3


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 4


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 5


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 6


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 7


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 8


--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.10/logging/__init__.py", line 1100, in emit
    msg = self.format(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 943, in format
    return fmt.format(record)
  File "/usr/lib/python3.10/logging/__init__.py", line 678, in format
    record.message = record.getMessage()
  File "/usr/lib/python3.10/logging/__init__.py", line 368, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/colab_kernel_launcher.py", line 37, in <module>
    ColabKernelApp.launch_instance()
  File "/usr/local/lib/python3.10/dist-packages/traitlets/config/application.py", line 992, in launch_instance
    a

ID: 9


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 10


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 11


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 12


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 13


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 14


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 15


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 16


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 17


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 18


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 19


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 20


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 21


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 22


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 23


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 24


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 25


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 26


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 27


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 28


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 29


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 30


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 31


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 32


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 33


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 34


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 35


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 36


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 37


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 38


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 39


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 40


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 41


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 42


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 43


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 44


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 45


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 46


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 47


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 48


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 49


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 50


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 51


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 52


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 53


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 54


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 55


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 56


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 57


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 58


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 59


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 60


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 61


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 62


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 63


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 64


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 65


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 66


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 67


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 68


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 69


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 70


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 71


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 72


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 73


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 74


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 75


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 76


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 77


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 78


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 79


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 80


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 81


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 82


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 83


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 84


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 85


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 86


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 87


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 88


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 89


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 90


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 91


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 92


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 93


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 94


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 95


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 96


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 97


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 98


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 99


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 100


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 101


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 102


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 103


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 104


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 105


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 106


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 107


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 108


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 109


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 110


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 111


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 112


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 113


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 114


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 115


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 116


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 117


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 118


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 119


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 120


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 121


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 122


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 123


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 124


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 125


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 126


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 127


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 128


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 129


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 130


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 131


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 132


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 133


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 134


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 135


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 136


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 137


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 138


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 139


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 140


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 141


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 142


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 143


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 144


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 145


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 146


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 147


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 148


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 149


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 150


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 151


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 152


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 153


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 154


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 155


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 156


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 157


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 158


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 159


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 160


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 161


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 162


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 163


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 164


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 165


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 166


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 167


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 168


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 169


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 170


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 171


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 172


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 173


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 174


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 175


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 176


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 177


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 178


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 179


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 180


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 181


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 182


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 183


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 184


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 185


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 186


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 187


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 188


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 189


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 190


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 191


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 192


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 193


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 194


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 195


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 196


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 197


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 198


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


ID: 199


In [36]:
# let's check if we have our 200 predictions
len(predictions)

200

In [37]:
# store them in a column of the dataframe
data['prediction'] = predictions

Let's now save predictions. Can be useful if you want to load them in a standalone notebook instead of running inference from scratch everytime.

In [38]:
MODEL_ID

'mistralai/Mistral-7B-Instruct-v0.2'

In [39]:
model_name = MODEL_ID.replace('/', '-')
model_name

'mistralai-Mistral-7B-Instruct-v0.2'

In [70]:
# we save with pickle to keep the list format of the references column
data.to_pickle(f'{model_name}-astro-mcqa-predictions.pkl')

# Evaluations

In [71]:
pred_df = pd.read_pickle(f'/content/{model_name}-astro-mcqa-predictions.pkl')

In [72]:
pred_df.head()

Unnamed: 0,id,question,propositions,labels,justification,answerable,uid,prompt,references,query,llm_reference,formatted_choices,prediction
0,0,True or False? An object that is not accelerat...,"['True', 'False', 'Impossible to say without m...","[0, 1, 0]","According to Newton's first law: ""In the absen...",True,ff0e8305-73e0-404a-a775-4e7be538d99b,<s>[INST] You're a helpful assistant. You are ...,"[B, B. False, False]",True or False? An object that is not accelerat...,B. False,A. True\nB. False\nC. Impossible to say withou...,A. False. An object at rest experiences force...
1,1,Complete the following: During its orbital per...,"['remains the same', 'increases', 'decreases']","[0, 0, 1]",,True,40a73e51-5eb4-46d3-baab-8acd624f2287,<s>[INST] You're a helpful assistant. You are ...,"[decreases, C. decreases, C]",Complete the following: During its orbital per...,C. decreases,A. remains the same\nB. increases\nC. decreases,C. decreases.\n\n(Note: The orbital velocity ...
2,2,Which affirmation is true when talking about E...,"['The potential energy is positive', 'The pote...","[0, 0, 1, 0]",This is because the Earth potential energy is ...,True,883f8ed7-8b2a-4671-80d3-f68791fa4118,<s>[INST] You're a helpful assistant. You are ...,"[C. The potential energy is negative, C, The p...",Which affirmation is true when talking about E...,C. The potential energy is negative,A. The potential energy is positive\nB. The po...,A. The potential energy is positive. (Assumin...
3,3,Why do we talk about microgravity in LEO?,['All the external forces are not cancelling e...,"[0, 0, 1, 0]",,True,7667b6ef-31e5-4ff3-87e2-65f9a4467551,<s>[INST] You're a helpful assistant. You are ...,"[The free-fall trajectory gets perturbed, C. T...",Why do we talk about microgravity in LEO?\nA. ...,C. The free-fall trajectory gets perturbed,A. All the external forces are not cancelling ...,"B. In LEO orbits, the Earth’s gravity is not ..."
4,4,To what wavelength is the atmosphere the most ...,"['X-rays', 'Visible light', 'Infra Red', 'Radi...","[0, 0, 0, 1]",,True,b855830d-0498-4c21-8aca-ac69c917178c,<s>[INST] You're a helpful assistant. You are ...,"[Radio Waves, D, D. Radio Waves]",To what wavelength is the atmosphere the most ...,D. Radio Waves,A. X-rays\nB. Visible light\nC. Infra Red\nD. ...,A. X-rays


### Check an example of MCQA and its prediction

In [73]:
ID = 19

In [74]:
print("=========================")
print("Multiple Choices Question")
print("=========================\n\n")
print(pred_df.query('id==@ID')['query'].values[0])

Multiple Choices Question


The activity of the Sun has a periodicity of 11 years. What are the methods to observe it? (Select all that apply)
A. Count the number of sun spots
B. Measure the Sun's bolometric flux (i.e. the flux over the whole spectrum)
C. Measure the Sun's flux in a spectral band called H-alpha
D. Measure the Sun's flux at a wavelength of 10.7 cm


In [75]:
print("Acceptable answers (references):\n\n")
for item in pred_df.query("id==@ID").references.values[0]:
    print(item + "\n")

Acceptable answers (references):


Count the number of sun spots
Measure the Sun's flux at a wavelength of 10.7 cm

Count the number of sun spots Measure the Sun's flux at a wavelength of 10.7 cm

A, D

A. Count the number of sun spots D. Measure the Sun's flux at a wavelength of 10.7 cm

A. Count the number of sun spots
D. Measure the Sun's flux at a wavelength of 10.7 cm



# 2. Evaluation
Evaluating generative AI is a difficult task, because two sentences phrases differently may have the same meaning or be of the same expected quality. In some cases you are lucky enough to have some reference texts, but sometimes you don't even have references.

Thankfully, for Multiple Choices Questions, we have references and we'll show various approaches on how we can assess the performances.

## 2.1 Choosing the right metrics
Here are some interesting metrics you can consider.

### SQuAD v2 style metrics
- `ExactMatch`:  We will ignore casing, ignore punctuation, but respect numbers.
- `F1-score`: We will use the F1-score from the SQuADv2 metric.

### Distance based metrics

- `Damereau-Levenshtein StringDistance`: We will use the Damereau-Levensthein string distance
- `Embedding distance`: here we compare the embeddings of the predictions and the reference.

### LLM-as-a-Judge:
It is possible to evaluate the output of LLMs using other LLMs. LangChain provides convenient LLM-based evaluators which we explored in a dedicated notebook.



## 3.1 Exact Match Example

**Example with langchain**

In [76]:
from langchain.evaluation import StringDistance, load_evaluator, ExactMatchStringEvaluator
evaluator = ExactMatchStringEvaluator(ignore_case=True, ingore_punctuation=True, ignore_numbers=False)

In [77]:
evaluator.evaluate_strings(
    prediction="a d",
    reference="A D"
)

{'score': 1}

**Example with huggingface**

In [78]:
from evaluate import load
exact_match_metric = load("exact_match")

results = exact_match_metric.compute(
    predictions=['a, d'],
    references=['A D'],
    ignore_case=True, ignore_punctuation=True, ignore_numbers=False
    )

Downloading builder script:   0%|          | 0.00/5.67k [00:00<?, ?B/s]

In [None]:
results

{'exact_match': 1.0}

## 3.2 StringDistance
This distance is based on the minimum number of character edits that needs to be done to go from one string to another.

In [None]:
from langchain.evaluation import StringDistance, load_evaluator
evaluator = load_evaluator('string_distance', distance=StringDistance.DAMERAU_LEVENSHTEIN)

In [79]:
evaluator.evaluate_strings(
    prediction="(A): Sunspot counts, and (D): Sun's flux at a wavelength of 10.7 cms",
    reference="A. Count the number of sun spots D. Measure the Sun's flux at a wavelength of 10.7 cm",
)

{'score': 0}

In [80]:
evaluator.evaluation_name

'exact_match'

## 3.3 Embeddings-based distance with an Hugging Face Model

In [81]:
from langchain.embeddings import HuggingFaceEmbeddings
embedding_model = HuggingFaceEmbeddings()
hf_evaluator = load_evaluator("embedding_distance", embeddings=embedding_model)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [84]:
hf_evaluator.evaluate_strings(
    prediction="A and D",
    reference="A, D")

{'score': 0.11246855730954064}

In [85]:
hf_evaluator.evaluate_strings(
    prediction="(A): Sunspot counts, and (D): Sun's flux at a wavelength of 10.7 cms",
    reference="A. Count the number of sun spots D. Measure the Sun's flux at a wavelength of 10.7 cm")

{'score': 0.16899124146517308}

In [86]:
hf_evaluator.evaluate_strings(
    prediction="(E): Impossible to say",
    reference="A. Count the number of sun spots D. Measure the Sun's flux at a wavelength of 10.7 cm")

{'score': 0.5744426085977041}

## 3.4 LLM-as-a-Judge

In [87]:
# Load the API key and relevant Python libaries.
import openai
openai.api_key  = userdata.get('OPENAI_API_KEY')

### 3.4.1 LabeledScoreString
In this first approach the result of the evaluation consists of:
- reasoning about the quality of the answer
- score between 1 and 10 (greater is better).

**Important** In this notebook with use `gpt-3.5-turbo-0125` as the judge, but in most cases you would want to use the best available LLM (`gpt4`) unless evaluating if the predictions are correct is straightforward. We are using `gpt-3.5` to for demonstration purpose and to mitigate the costs of running this notebook.

In [89]:
from langchain.evaluation import load_evaluator
from langchain.chat_models import ChatOpenAI
evaluator = load_evaluator("labeled_score_string",
                           llm=ChatOpenAI(
                               model="gpt-3.5-turbo-0125", # gpt-3.5-turbo-0125 # gpt-4-0125-preview
                               openai_api_key=userdata.get('OPENAI_API_KEY')
                               )
                           )

### Test LLM-as-Judget on one example query and reference

In [90]:
INPUT_PROMPT = """
### QUESTION: The activity of the Sun has a periodicity of 11 years. What are the methods to observe it? (Select all that apply)

### CHOICES:
A. Count the number of sun spots
B. Measure the Sun's bolometric flux (i.e. the flux over the whole spectrum)
C. Measure the Sun's flux in a spectral band called H-alpha
D. Measure the Sun's flux at a wavelength of 10.7 cm
"""

In [93]:
# Correct
result = eval_result = evaluator.evaluate_strings(
    prediction="The correct answer is A",
    reference="A. Count the number of sun spots\n D. Measure the Sun's flux at a wavelength of 10.7 cm",
    input=INPUT_PROMPT,
)

In [94]:
result['score']

2

In [95]:
print(result['reasoning'])

The response provided by the AI assistant is not comprehensive or helpful as it only gives the correct answer without any explanation or further information. It lacks depth of thought and does not address the different methods available to observe the periodicity of the Sun's activity. Additionally, it would have been more beneficial to the user if the assistant had explained why counting the number of sunspots is a method to observe the Sun's activity.

Rating: [[2]]


### 3.4.2 LLM-as-Judge with a Labeled Criteria
**Labeled Criteria** give either 1 or 0 depending if the answer is consider correct or incorrect. In our evaluation scenario, we do not need nuance, so we can use this one.
- Note that you can evaluate on the basis of different criteria. Below **we choose correctness** but you could choose other criteria and even define your own.

In [96]:
llm_labeled_criteria_correctness_evaluator = load_evaluator("labeled_criteria",
                                                            criteria="correctness",
                                                            llm=ChatOpenAI(
                                                                model="gpt-3.5-turbo-0125", # gpt-3.5-turbo-0125 # gpt-4-0125-preview
                                                                openai_api_key=userdata.get('OPENAI_API_KEY'),
                                                                temperature=0.0
                                                                ),
                                                            )

In [97]:
# We can even override the model's learned knowledge using ground truth labels
result = llm_labeled_criteria_correctness_evaluator.evaluate_strings(
    prediction="The correct answer is A",
    reference="A. Count the number of sun spots\n D. Measure the Sun's flux at a wavelength of 10.7 cm",
    input=INPUT_PROMPT,
)

In [100]:
print(result['reasoning'])

1. Is the submission correct, accurate, and factual?
- The submission states that the correct answer is A, which is "Count the number of sun spots."
- The correct answers based on the reference are A and D.
- The submission is incorrect as it only mentions A and does not include D.
- Therefore, the submission does not meet the criteria for correctness.

N


In [101]:
print(result['score'])

0


### 3.5 F1-score

What is nice about the `SQuADv2 F1 score` is that enables to pass multiple acceptable reference answers. So if the LLM answers correctly but in a slightly different manner, it has more change not to be wrongly penalized.

In [103]:
from evaluate import load
squad_metric = load("squad_v2")

#### An example usage

In [105]:
# testing the squad metric
predictions = [
    {'prediction_text': "correct answers are A and C",
     'id': '0',
     'no_answer_probability': 0}
    ]

# you can use multiple ground truth reference for SQuAD which is great in our case to no penalize wrongly the f1 score
references = [
    {'answers': {'answer_start': [0], 'text': ["A and C", "answers are A and C"]}, 'id': '0'}
    ]

In [106]:
results = squad_metric.compute(predictions=predictions, references=references)
results

{'exact': 0.0,
 'f1': 88.8888888888889,
 'total': 1,
 'HasAns_exact': 0.0,
 'HasAns_f1': 88.8888888888889,
 'HasAns_total': 1,
 'best_exact': 0.0,
 'best_exact_thresh': 0.0,
 'best_f1': 88.8888888888889,
 'best_f1_thresh': 0.0}

Each of these metrics provide a different flavor to estimate the performance. They all have pros and cons and in generative AI it is rare that we can only rely on one metric due to possible flaws and edge-cases. For this reason, it is recommended to look at multiple metrics when needed, unless the expected outcome can be unambiguously evaluated.

# Evaluate Exact Match and F1 on the whole dataset

In [108]:
squadv2_predictions = []
squadv2_references = []

predictions = []
references = []

for id in pred_df.id:

    # extract data from instance
    mcqa = pred_df.query('id==@id')
    question = mcqa.question.values[0]
    propositions = ast.literal_eval(mcqa.propositions.values[0])
    labels = ast.literal_eval(mcqa.labels.values[0])

    # format the mcqa choices
    formatted_choices = [l + '. ' + prop for l, prop in zip('ABCDEF', propositions)]
    choices = '\n'.join(formatted_choices)
    correct_answers = [prop for idx, prop in enumerate(formatted_choices) if labels[idx]==1]

    # different acceptable text strings
    correct_completion_newline = '\n'.join(correct_answers)
    # llm_references.append(correct_completion_newline)
    # llm_questions.append(question)
    # llm_choices.append(choices)

    # F1 SCORE LIKE SQUAD
    pred_item = {'prediction_text': mcqa.prediction.values[0],
                 'id': str(id),
                 'no_answer_probability': 0}

    reference_item = {
        'answers': {
            'answer_start': [0],
            'text': mcqa.references.values[0]
        },
        'id': str(id)
    }

    squadv2_predictions.append(pred_item)
    squadv2_references.append(reference_item)

    predictions.append(mcqa.prediction.values[0])
    references.append(mcqa.references.values[0])

In [109]:
squadv2_eval_results = squad_metric.compute(predictions=squadv2_predictions, references=squadv2_references)
squadv2_eval_results

{'exact': 23.0,
 'f1': 47.92004938662442,
 'total': 200,
 'HasAns_exact': 23.0,
 'HasAns_f1': 47.92004938662442,
 'HasAns_total': 200,
 'best_exact': 23.0,
 'best_exact_thresh': 0.0,
 'best_f1': 47.92004938662442,
 'best_f1_thresh': 0.0}

# Evaluate String Distance on the whole dataset
- Character-based distance
- Embedding-based distance

In [118]:
from langchain.evaluation import StringDistance, load_evaluator
evaluator = load_evaluator('string_distance', distance=StringDistance.DAMERAU_LEVENSHTEIN)

from langchain.embeddings import HuggingFaceEmbeddings
embedding_model = HuggingFaceEmbeddings()
hf_evaluator = load_evaluator("embedding_distance", embeddings=embedding_model)

In [119]:
string_distance_scores = []
embedding_distance_scores = []

for id, gt_list in enumerate(references):

    lowest_string_distance = 1.0
    lowest_embedding_distance = 1.0

    for gt in gt_list:

        instance_distance = evaluator.evaluate_strings(
            prediction=predictions[id],
            reference=gt
            )

        instance_embedding = hf_evaluator.evaluate_strings(
            prediction=predictions[id],
            reference=gt)

        if instance_distance['score']<lowest_string_distance:
            lowest_string_distance=instance_distance['score']
        if instance_embedding['score']<lowest_embedding_distance:
            lowest_embedding_distance=instance_embedding['score']

    string_distance_scores.append(1.0-lowest_string_distance)
    embedding_distance_scores.append(1.0-lowest_embedding_distance)

In [120]:
len(string_distance_scores), len(embedding_distance_scores)

(200, 200)

In [121]:
import numpy as np

In [122]:
np.mean(string_distance_scores), np.median(string_distance_scores)

(0.44947585815949204, 0.3410479323308271)

In [123]:
np.mean(embedding_distance_scores), np.median(embedding_distance_scores)

(0.7063601942835875, 0.7554101518354527)

## LLM as a Judge Evaluation on the Whole Dataset
- We use GPT-3.5 turbo
- Labeld Criteria correctness

In [None]:
llm_scores = []

llm_labeled_criteria_correctness_evaluator = load_evaluator("labeled_criteria",
                                                            criteria="correctness",
                                                            llm=ChatOpenAI(
                                                                model="gpt-3.5-turbo-0125", # gpt-3.5-turbo-0125 # gpt-4-0125-preview
                                                                openai_api_key=userdata.get('OPENAI_API_KEY'),
                                                                temperature=0.0
                                                                ),
                                                            )

for id in pred_df.id:

    print(f"{id}/{len(pred_df)}")
    try:
        pred_instance = pred_df.query("id==@id")

        result = llm_labeled_criteria_correctness_evaluator.evaluate_strings(
            prediction=pred_instance['prediction'].values[0],
            reference=pred_instance['llm_reference'].values[0],
            input=pred_instance['query'].values[0])

        llm_scores.append(result['score'])
    except:
        print("LLM-as-Judge - evaluation ERROR")
        llm_scores.append(0)

0/200
1/200
2/200
3/200
4/200
5/200
6/200
7/200
8/200
9/200
10/200
11/200
12/200
13/200
14/200
15/200
16/200
17/200
18/200
19/200
20/200
21/200
22/200
23/200
24/200
25/200
26/200
27/200
28/200
29/200
30/200
31/200
32/200
33/200
34/200
35/200
36/200
37/200
38/200
39/200
40/200
41/200
42/200
43/200
44/200
45/200
46/200
47/200
48/200
49/200
50/200
51/200
52/200
53/200
54/200
55/200
56/200
57/200
58/200
59/200
60/200
61/200
62/200
63/200
64/200
65/200
66/200
67/200
68/200
69/200
70/200
71/200
72/200
73/200
74/200
75/200
76/200
77/200
78/200
79/200
80/200
81/200
82/200
83/200
84/200
85/200
86/200
87/200
88/200
89/200
90/200
91/200
92/200
93/200
94/200
95/200
96/200
97/200
98/200
99/200
100/200
101/200
102/200
103/200
104/200
105/200
106/200
107/200
108/200
109/200
110/200
111/200
112/200
113/200
114/200
115/200
116/200
117/200
118/200
119/200
120/200
121/200
122/200
123/200
124/200
125/200
126/200
127/200
128/200
129/200
130/200
131/200
132/200
133/200
134/200
135/200
136/200
137/200
138/20

In [None]:
len(llm_scores)

200

In [None]:
pred_df['llm_score']=llm_scores

In [None]:
np.mean(llm_scores)

0.635

# Aggregation of all scores
- As each technique has its pros and cons, we aggregate all scores together with a simple mean. You could also do a weighted average if some metrics feel more reliable than others to your use case.

In [None]:
overall_score = np.mean([np.mean(llm_scores), np.mean(embedding_distance_scores), np.mean(string_distance_scores), squadv2_eval_results['f1']/100])
overall_score

0.5675091365773309

In [None]:
print("====== EVALUATION =======")
print(f" > LLM: \t {round(np.mean(llm_scores),2)}")
print(f" > EMB: \t {round(np.mean(embedding_distance_scores),2)}")
print(f" > STR: \t {round(np.mean(string_distance_scores),2)}")
print(f" > F1 : \t {round(squadv2_eval_results['f1']/100,2)}")
print("-------------------------")
print(f"  Overall score: {round(overall_score,2)}")
print("-------------------------")
# print(f" > EM : \t {round(squadv2_eval_results['exact']/len(pred_df),3)}")

 > LLM: 	 0.64
 > EMB: 	 0.71
 > STR: 	 0.45
 > F1 : 	 0.48
-------------------------
  Overall score: 0.57
-------------------------


# Going Further in evaluation?
- You may want to aggregate the different performance metrics
- You may want to log inference time + input/output token counts + compute throughput
- You may want to compare the performance of different models
- You may want to compare the performance of different quantization levels for the same model
- You may want to assess the performance variability fo a the task specific instruction prompt that we chose at the beginning.


As you can see, evaluating LLMs is more than calling `model.score(X_test, y_test)`.