# Model evaluation

Now that the model is trained, we can make an automatic evaluation of it; using Natural Language Processing tools such as Cross-Encoder, Bi-Encoders or Static embeddings. We can also use a LLM to judge if the fine-tuned one makes relevant answers or not.

We compute different metrics between the text generation of the fine-tuned model and the expected answers from the evaluation dataset.

In [None]:
import os
import json

date = "09_02_2025-14h_17min" # change with your date 
test_dir = f"../bucket/fine_tuning_acronym/sessions/results_{date}/tests"
answer_dataset_path = os.path.join(test_dir, "answer_dataset.json")

with open(answer_dataset_path, "rt") as f:
    answer_dataset = json.load(f)

print(answer_dataset[1]) # example

import pandas as pd

pd.options.display.max_colwidth = 500 # to display full texts

df = pd.DataFrame.from_dict(answer_dataset) # packaging everything in a pandas datafram

import random
displayed_examples = random.sample(list(df.index), 5)

display(df.loc[displayed_examples])

### 1 - First approach : Static Embeddings (/ ~ Bi-encoder)

Static embeddings are light to use, but could lack of accuracy in some use cases

In [None]:
from wordllama import WordLlama

# Load pre-trained static embeddings (truncate dimension to 64)
wl = WordLlama.load(trunc_dim=64)

df["static_embedding_sim"] = df.apply(lambda x : wl.similarity(x.answer,x.expected_answer), axis="columns")

# compute similarity between static embeddings of fine-tuned answers and expected answers.

In [None]:
display(df.loc[displayed_examples])

### 2 - Second approach : Cross-Encoder
Using CrossEncoder (https://www.sbert.net/examples/cross_encoder/applications/README.html).

Heavier thant static embeddings, but provides more accuracy when it comes to similarity.

In [None]:
from sentence_transformers.cross_encoder import CrossEncoder

cross_encoder = CrossEncoder("cross-encoder/stsb-distilroberta-base")

In [None]:
couple_list = df[["answer", "expected_answer"]].to_numpy().tolist() # not using direct dataframe to use parallel computing of lib sentence_transformer

res = cross_encoder.predict(couple_list)

df["cross_encoder_score"] = res

In [None]:
display(df.loc[displayed_examples])

### 3 - Third approach, using LLM as a judge

Here we asks an instruct LLM whether the corresponding answer seems relevant or not; and to put the answer inside specific characters.
We reuse the ollama API from the first notebooks. Don't forget to kill the kernels of previous notebooks to make space for models !

In [None]:
ollama_url = "http://localhost:11434"
model_name = "llama3.2:latest"

In [None]:
def create_judgement_prompt(question, answer_to_test, definition):
    """
    Custom prompt to use a LLM as a judge.
    """
    return (
    "You are an evaluator, whose aim is to determine whether a given answer contains appropriate information about a given question. \n"
    "To know if the answer accurately addresses the question, you will be given a definition that must be contained into the answer to validate its accuracy. \n"
    "The result must be either 0 or 1. 0 stands for an inaccurate answer, and 1 for an accurate answer. \n"
    "Furthermore, you'll have to explain why you gave a 1 or a 0 to an answer. \n"
    "All of this will be structured in a json object :\n"
    "{\n"
        "'result': '1 or 0 according to the judgement'\n"
        "'explain': 'the explaination of the above result'\n"
    "}\n"
    "Now, it’s your turn : \n"
    f"Question : “{question}”\n"
    f"Answer to test : “{answer_to_test}”\n"
    f"Definition to assess the answer : “{definition}”")

In [None]:
from ollama import generate

prompt = create_judgement_prompt("What is TOAST", answer_to_test="I don't know", definition="TOAST stands for Technique of Outstanding Appetizers")

scheme_output = {
    "properties": {
        "result": {
            "enum": [0, 1],
            "title": "Result",
            "type": "integer"
        },
        "explain": {
            "title": "Explain",
            "type": "string"
        }
    },
    "required": [
        "result",
        "explain"
    ],
    "title": "Judgement",
    "type": "object"
}

answer = json.loads(generate(model=model_name, prompt=prompt, format=scheme_output).response)

print(answer) # example judgement

In [None]:
from tqdm import tqdm
triplet_list = df[["question", "answer", "expected_answer"]].to_numpy().tolist()

all_results = []
for each_triplet in tqdm(triplet_list):
    prompt = create_judgement_prompt(question=each_triplet[0], answer_to_test=each_triplet[1], definition=each_triplet[2])
    answer = json.loads(generate(model=model_name, prompt=prompt, format=scheme_output).response)
    all_results.append(answer)


In [None]:
df["llm_judge_result"] = pd.Series([each_res["result"] for each_res in all_results], dtype="int")
df["llm_judge_eplain"] = [each_res["explain"] for each_res in all_results]

In [None]:
judge_accuracy = df.llm_judge_result.sum()/df.shape[0] # fine tuned model on more epochs
print("Accuracy according to LLM judge :", judge_accuracy)

In [None]:
display(df.loc[displayed_examples])

## 3 - Save test results for this model

We save the test results as .csv file, and metadata (model, date of test) about this session.

In [None]:
test_result_dir = os.path.join(test_dir, "test_result.csv")
print(f"Saving test results to {test_result_dir}")
df.to_csv(test_result_dir)