# Evaluating LLMs

Evaluating models depends on the task you want to evaluate on.  
There are a few types of evaluation:
- Automated Evaluation (Benchmarking with known quantities)
- Model in the loop Evaluation (Model as a judge)
- Human in the loop Evaluation (Human as a judge)

### Automated Evaluation (Benchmarking with known quantities)
This is one of the simpler types. When you have a question and a strict answer, you can automate the evaluation.
Example:
```
Question: "What is the capital of Romania?"
Answer: "Bucharest" / "București" (and lowercase variants)
```
```
Question: What is 2 + 2?
A. 3
B. 4
C. 5

Answer: B
```

Answers are easy to parse and other responses are wrong.

### Model in the loop Evaluation (Model as a judge)
This is a more complex type, but still automatable using a second competent LLM, called a judge.

Example:
```
Input: What if these shoes don't fit?
Expected: You're eligible for a free full refund within 30 days of purchase.
Predicted: We offer a 30-day full refund at no extra cost.
```

In this case, the judge LLM is given the input, the expected answer and the predicted answer.  
It then decides if the predicted answer is correct or not.

### Human in the loop Evaluation (Human as a judge)
This is the most complex type, but also the most accurate, as it can account for nuances, emotions, and has deeper understanding of the task.  
The human judge is given the input, the expected answer and the predicted answer.  
The human judge then decides if the prediction is correct.


In [1]:
!pip install -q transformers datasets torch unsloth ipywidgets scikit-learn numpy plotly pandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Initializing the testing environment

Loading the dataset:

In [2]:
from datasets import load_dataset

dataset = load_dataset("RoBiology/RoBiologyDataChoiceQA")
dataset

DatasetDict({
    train: Dataset({
        features: ['question_number', 'question', 'type', 'options', 'grade', 'stage', 'year', 'right_answer', 'source', 'id_in_source', 'dupe_id'],
        num_rows: 11368
    })
    validation: Dataset({
        features: ['question_number', 'question', 'type', 'options', 'grade', 'stage', 'year', 'right_answer', 'source', 'id_in_source', 'dupe_id'],
        num_rows: 1376
    })
    test: Dataset({
        features: ['question_number', 'question', 'type', 'options', 'grade', 'stage', 'year', 'right_answer', 'source', 'id_in_source', 'dupe_id'],
        num_rows: 1388
    })
})

In [3]:
test_ds = dataset['test'].filter(lambda x: x['type'] == 'single-choice')
test_ds

Dataset({
    features: ['question_number', 'question', 'type', 'options', 'grade', 'stage', 'year', 'right_answer', 'source', 'id_in_source', 'dupe_id'],
    num_rows: 588
})

Loading the tokenizer and model:

In [5]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("unsloth/Llama-3.2-3B-Instruct")
model = AutoModelForCausalLM.from_pretrained("unsloth/Llama-3.2-3B-Instruct", device_map='cuda')

Setting up the input for the model:

In [7]:
def create_chat_prompt(instruction: str, input: str, response: str|None = None):
    return tokenizer.apply_chat_template(
        [
            {"role": "system", "content": instruction},
            {"role": "user", "content": input},
            *([{"role": "assistant", "content": response}] if response else []),
        ],
        tokenize=False,
        add_generation_prompt=True,
    )

In [8]:
instuction = "Answer the question based on the given options. Respond by writing only the letter of the correct answer."

print(create_chat_prompt(
    instuction,
    "\n".join([test_ds[0]["question"]] + test_ds[0]["options"]),
    # test_ds[0]["right_answer"],
))

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 07 Dec 2024

Answer the question based on the given options. Respond by writing only the letter of the correct answer.<|eot_id|><|start_header_id|>user<|end_header_id|>

Următorul reflex este monosinaptic:
A. rotulian
B. cardioaccelerator
C. de apărare
D. vasoconstrictor<|eot_id|><|start_header_id|>assistant<|end_header_id|>




In [9]:
test_ds = test_ds.map(lambda x: {
    "chat": create_chat_prompt(instuction, '\n'.join([x['question']] + x['options'])) 
})

Map:   0%|          | 0/588 [00:00<?, ? examples/s]

## Running the model

We are using a simple evaluation methodology:
- Prefill the model with the chat prompt
- Generate the probability distribution for the next token
- Save the most likely token
- Save the top 10 tokens and their scores

Generation is done without any sampling (and temperature = 0).  
The probability distribution untouched, as outputed from the model.  
This is to have a deterministic evaluation.


In [11]:
from tqdm import tqdm
import torch


def test_model(model, seed=42, do_sample=False, temperature=0):
    answers = []

    for entry in tqdm(test_ds):
        # set seed
        torch.manual_seed(seed)

        # tokenize
        input = tokenizer(
            entry["chat"],
            return_tensors="pt",
        ).to(model.device)

        # generate and return sequence, scores and logits, with given temperature and sampling
        response = model.generate(
            **input,
            max_new_tokens=1,
            do_sample=do_sample,
            temperature=temperature,
            return_dict_in_generate=True,
            output_scores=True,
            output_logits=True,
            pad_token_id=tokenizer.eos_token_id,
        )

        # decode answer
        output = tokenizer.decode(
            response["sequences"][0][input.input_ids.shape[1] :],
            skip_special_tokens=True,
        ).strip()

        # get top 10 tokens and scores
        scores = response["scores"][0]
        top_k = scores.topk(10)
        top_k_tokens = tokenizer.convert_ids_to_tokens(top_k.indices.reshape(-1, 1))

        # append to answers
        answers.append(
            {
                "pred": output,
                "top_k_tokens": top_k_tokens,
                "top_k_scores": top_k.values.tolist()[0],
            }
        )

    return answers


with torch.no_grad():
    answers = test_model(model)

100%|██████████| 588/588 [00:20<00:00, 28.08it/s]


Example of an output:


In [12]:
answers[0]

{'pred': 'A',
 'top_k_tokens': ['A', 'B', 'D', 'C', 'R', 'V', 'E', 'U', 'M', 'S'],
 'top_k_scores': [27.942733764648438,
  27.89405059814453,
  27.47077751159668,
  26.390111923217773,
  19.076398849487305,
  18.385112762451172,
  18.295764923095703,
  17.976375579833984,
  17.62354278564453,
  17.048017501831055]}

## Process results

Merge test dataset with answers:

In [15]:
import pandas as pd
test_df = pd.DataFrame(test_ds)
answers_df = pd.DataFrame(answers).rename(columns={'output': 'pred'})

answers_df = test_df.join(answers_df)
answers_df

Unnamed: 0,question_number,question,type,options,grade,stage,year,right_answer,source,id_in_source,dupe_id,chat,pred,top_k_tokens,top_k_scores
0,30,Următorul reflex este monosinaptic:,single-choice,"[A. rotulian, B. cardioaccelerator, C. de apăr...",VII,locala,2018,A,olimpiada,arad,,<|begin_of_text|><|start_header_id|>system<|en...,A,"[A, B, D, C, R, V, E, U, M, S]","[27.942733764648438, 27.89405059814453, 27.470..."
1,25,Axonii neuronilor olfactivi străbat lama ciuru...,single-choice,"[A. etmoid, B. sfenoid, C. zigomatic, D. frontal]",VII,locala,2018,A,olimpiada,arad,,<|begin_of_text|><|start_header_id|>system<|en...,A,"[A, B, C, D, E, S, R, M, V, T]","[29.128841400146484, 28.690753936767578, 25.94..."
2,24,Astigmatismul:,single-choice,[A. este un defect care nu afectează sistemul ...,VII,locala,2018,C,olimpiada,arad,,<|begin_of_text|><|start_header_id|>system<|en...,C,"[C, B, A, D, R, E, M, V, L, O]","[26.928300857543945, 25.83871841430664, 25.774..."
3,22,Structuri bogat vascularizate sunt:,single-choice,"[A. coroida, dermul, B. corneea, mucoasa nazal...",VII,locala,2018,A,olimpiada,arad,,<|begin_of_text|><|start_header_id|>system<|en...,A,"[A, B, D, C, R, E, V, M, S, P]","[26.416179656982422, 26.19873046875, 25.524808..."
4,20,Arcul reflex:,single-choice,"[A. este format din 3 componente, B. reprezint...",VII,locala,2018,B,olimpiada,arad,,<|begin_of_text|><|start_header_id|>system<|en...,B,"[B, A, C, D, R, E, T, V, S, O]","[28.546669006347656, 27.383541107177734, 27.12..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
583,39,Glicoliza anaerobă se caracterizează prin urmă...,single-choice,"[A. randament foarte ridicat, B. transformarea...",facultate,admitere,2020,A,UMF Brasov,metabolismul/varianta_B,,<|begin_of_text|><|start_header_id|>system<|en...,D,"[D, B, E, A, C, R, G, O, F, S]","[25.47475814819336, 24.43821144104004, 24.4042..."
584,1,Alegeți afirmația incorectă referitoare la aci...,single-choice,[A. prin reacția de beta-oxidare duc la formar...,facultate,admitere,2020,C,UMF Brasov,metabolismul/varianta_B,,<|begin_of_text|><|start_header_id|>system<|en...,A,"[A, C, B, E, D, R, O, M, L, F]","[22.201313018798828, 21.998310089111328, 21.82..."
585,21,Care dintre următorii hormoni au rol predomina...,single-choice,"[A. hormonul somatotrop, B. testosteronul, C. ...",facultate,admitere,2020,C,UMF Brasov,metabolismul/varianta_A,,<|begin_of_text|><|start_header_id|>system<|en...,C,"[C, A, B, D, E, R, F, M, V, S]","[27.479848861694336, 25.796890258789062, 25.41..."
586,11,Despre rolurile lipidelor în organism nu se po...,single-choice,[A. lipidele aflate în organism reprezintă o r...,facultate,admitere,2020,A,UMF Brasov,metabolismul/varianta_A,,<|begin_of_text|><|start_header_id|>system<|en...,C,"[C, B, D, E, A, R, O, F, M, P]","[24.079145431518555, 23.989465713500977, 23.46..."


Show predictions:

In [16]:
answers_df['pred'].value_counts()

pred
B    196
A    184
C    135
D     70
E      3
Name: count, dtype: int64

Compute accuracy:

In [17]:
answers_df['correct'] = answers_df['right_answer'] == answers_df['pred']
print(f'Accuracy: {answers_df["correct"].mean() * 100:.2f}%')
answers_df['correct'].value_counts()

Accuracy: 30.61%


correct
False    408
True     180
Name: count, dtype: int64

Show a confusion matrix.

A confusion matrix is a table that is used to evaluate the performance of a classification model.
It shows the number of correct and incorrect predictions for each class.

On Y we have the true labels and on X we have the predicted labels.

In [32]:

from sklearn.metrics import confusion_matrix
import plotly.graph_objects as go
import numpy as np

def show_confusion_matrix(df):
    # Extract true and predicted labels
    y_true = df['right_answer'].tolist()
    y_pred = df['pred'].tolist()

    # Get unique labels in sorted order
    labels = sorted(list(set(y_true + y_pred)))

    # Create confusion matrix
    cm = confusion_matrix(y_true, y_pred, labels=labels)

    # Create heatmap using plotly
    fig = go.Figure(data=go.Heatmap(
        z=cm,
        x=labels,
        y=labels,
        text=cm,
        texttemplate="%{text}",
        textfont={"size": 12},
        hoverongaps=False,
        colorscale='Blues'
    ))

    fig.update_layout(
        title='Confusion Matrix',
        xaxis_title='Predicted',
        yaxis_title='True',
        width=600,
        height=600
    )

    fig.show()

show_confusion_matrix(answers_df)

As you can observe, the model is not very confident about its predictions.  
The confusion matrix should show a strong 1st degree diagonal, but the answers are spread over the entire matrix.

Let's inspect the individual predictions and look at:
- the top 10 predicted tokens and their scores
- the softmax scores for the top 10 tokens

In [43]:
def show_chart_for_row(row, index):
    NL = '\n'
    row_scores = row['top_k_scores']
    row_scores_softmax = torch.softmax(torch.tensor(row_scores), dim=0).numpy()  #
    row_tokens = row['top_k_tokens']

    print(f'''
{row.question}
{NL.join(row.options)}

{"✅" if row.right_answer == row_tokens[0] else "❌"}{row_tokens[0]}: {row_scores_softmax[0] * 100:.2f}%
---
{NL.join([f'{"✅" if row.right_answer == t else "❌"}{t}: {s * 100:.2f}%' for t, s in zip(row_tokens, row_scores_softmax)][1:len(row.options)])}
''')

    # Create figure with secondary y-axis
    fig = go.Figure()

    # Add raw scores on primary y-axis
    fig.add_trace(
        go.Bar(
            x=[x - 0.2 for x in range(len(row_tokens))],  # Shift left
            y=row_scores,
            name="Raw Scores", 
            text=[f'{score:.2f}' for score in row_scores],
            textposition='auto',
            yaxis='y',
            width=0.4  # Make bars thinner
        )
    )

    # Add softmax scores on secondary y-axis
    fig.add_trace(
        go.Bar(
            x=[x + 0.2 for x in range(len(row_tokens))],  # Shift right
            y=row_scores_softmax,
            name="Softmax Scores",
            text=[f'{score:.2f}' for score in row_scores_softmax],
            textposition='auto',
            yaxis='y2',
            width=0.4  # Make bars thinner
        )
    )

    # Add right answer indicator on tertiary y-axis
    right_answer_values = [1 if token == row.right_answer else 0 for token in row_tokens]
    fig.add_trace(
        go.Bar(
            x=list(range(len(row_tokens))),
            y=right_answer_values,
            name="Right Answer",
            marker_color='limegreen',
            yaxis='y3',
            width=0.2
        )
    )

    fig.update_layout(
        title=f'Top-K Token Scores for {row.question_number=} / {index=}',
        xaxis_title='Tokens',
        yaxis_title='Raw Scores',
        yaxis2=dict(
            title='Softmax Scores',
            overlaying='y',
            side='right'
        ),
        yaxis3=dict(
            title='Right Answer',
            overlaying='y',
            side='right',
            position=0.85,
            range=[0, 1.2],
            showgrid=False,
            visible=False
        ),
        width=800,
        height=500,
        showlegend=True,
        xaxis=dict(
            ticktext=row_tokens,
            tickvals=list(range(len(row_tokens)))
        )
    )

    fig.show()

# Create buttons for navigation
from ipywidgets import Button, HBox, widgets
import IPython.display as display

current_index = 0

def show_prev(_):
    global current_index
    current_index = max(0, current_index - 1)
    display.clear_output(wait=True)
    display_buttons()
    show_chart_for_row(answers_df.iloc[current_index], current_index)

def show_next(_):
    global current_index 
    current_index = min(len(answers_df) - 1, current_index + 1)
    display.clear_output(wait=True)
    display_buttons()
    show_chart_for_row(answers_df.iloc[current_index], current_index)

def on_index_change(change):
    global current_index
    current_index = max(0, min(change.new, len(answers_df)-1))
    display.clear_output(wait=True)
    display_buttons()
    show_chart_for_row(answers_df.iloc[current_index], current_index)

def display_buttons():
    prev_button = Button(description='Previous')
    next_button = Button(description='Next')
    index_input = widgets.IntText(
        value=current_index,
        description='Index:',
        min=0,
        max=len(answers_df)-1
    )
    prev_button.on_click(show_prev)
    next_button.on_click(show_next)
    index_input.observe(on_index_change, names='value')
    display.display(HBox([prev_button, next_button, index_input]))

# Show initial chart and buttons
display_buttons()
show_chart_for_row(answers_df.iloc[current_index], current_index)

HBox(children=(Button(description='Previous', style=ButtonStyle()), Button(description='Next', style=ButtonSty…


Duodenul:
A. are de două ori lungimea esofagului
B. este partea mijlocie a intestinului subțire
C. are numeroase anse intestinale
D. primește sucuri digestive de la 2 glande anexe

❌C: 30.40%
---
❌B: 27.18%
❌A: 21.72%
✅D: 20.69%



Here are a few observations:
- Index 0:
  - Model is not confident in its prediction, having A and B with similar scores (36%, 34% respectively)
  - The correct answer is A (36%) which is the one predicted, but confidence is low
- Index 2:
  - Model is somewhat confident in it's prediction of C (59%) which is correct
- Index 5: 
  - Model is not confident, and the correct answer (C) is 4th on the list of probabilities (15%)
- Index 8:
  - Model is somewhat confidently wrong, having picked D (53%) instead of the correct answer A (15%)
- Index 16, 19, 20: 
  - Strong confidence in the correct answer
- Index 34:
  - Strong confidence in the wrong answer

In [21]:
answers_df['top_k_scores_softmax'] = answers_df['top_k_scores'].apply(lambda x: torch.softmax(torch.tensor(x), dim=0).numpy())

In [26]:
confident_answers = answers_df[answers_df['top_k_scores_softmax'].apply(lambda x: x[0] > 0.5)]
print(f'Confident answers: {confident_answers.shape[0] / answers_df.shape[0] * 100:.2f}%')
print(f'Accuracy: {confident_answers["correct"].mean() * 100:.2f}%')

Confident answers: 29.76%
Accuracy: 42.86%


In [33]:
show_confusion_matrix(confident_answers)