# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>


# <font color="#003660">Session 10: How to Evaluate LLMs?</font>

# <font color="#003660">Evaluate using Metrics</font>

<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/dag.png"/><br></center>

<p>

<div>
    <font color="#085986"><b>By the end of this lesson, you ...</b><br><br>
        ... will know how to evaluate LLMs using Metrics. <br>
        ... will know how to apply exact match to a JSON-Problem.
    </font>
</div>
</p>

The following content is heavily inspired by the following excellent sources:

* [LangChain Academy](https://academy.langchain.com/)
* [LangChain Docs (Python)](https://python.langchain.com/)
* [LangSmith Evaluation](https://docs.smith.langchain.com/evaluation)
* [Chang et al. (2024)](https://doi.org/10.1145/3641289)

This Notebook is based on LangSmith Evaluation, but we refrain from using LangSmith, because it is an API driven online-tool.

# Setting up LangChain-Ollama

First we Setup Langchain-Ollama, which will help us to generate the structured outputs.

In [83]:
!sudo apt install pciutils
!pip install -U ollama langchain langchain-community langchain-ollama colab-xterm

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
pciutils is already the newest version (1:3.7.0-6).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.


In [85]:
!curl -fsSL https://ollama.com/install.sh | sh

>>> Cleaning up old version at /usr/local/lib/ollama
>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
############################################################################################# 100.0%
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> NVIDIA GPU installed.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


We run ollama once using ``nohup``command to ensure it is running in the background and not stoping us from running the succeeding cells.

In [None]:
!nohup ollama serve &

nohup: appending output to 'nohup.out'


In [None]:
import os
import re
import ast
import time
import json
import signal
import requests

from langchain_ollama import ChatOllama
from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate

from typing import List, Optional, Union
from pydantic import BaseModel, Field

from tqdm.notebook import tqdm

In [86]:
!ollama pull qwen2.5-coder:7b

[?25lpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest ⠴ [?25h[?25l[2K[1Gpulling manifest ⠦ [?25h[?25l[2K[1Gpulling manifest ⠧ [?25h[?25l[2K[1Gpulling manifest ⠇ [?25h[?25l[2K[1Gpulling manifest ⠏ [?25h[?25l[2K[1Gpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest 
pulling 60e05f210007... 100% ▕▏ 4.7 GB                         
pulling 66b9ea09bd5b... 100% ▕▏   68 B                         
pulling e94a8ecb9327... 100% ▕▏ 1.6 KB                         
pulling 832dd9e00a68... 100% ▕▏  11 KB                         
pulling d9bb33f27869... 100% ▕▏  487 B                         
verifying sha256 digest 
writing manifest 
success [?25h


# An Easy Exact Match Example

First we start with exact match on mail extraction.

Exact match of the values ``a`` and ``b``is calculated by ``1 if a == b else 0`` ([Chang et al., 2024](https://doi.org/10.1145/3641289))

At first we need to setup the chain.

## Setup the Model

In [None]:
LLM_CONFIG = {
    "model": "qwen2.5-coder:7b",
    "temperature": 0.0,
    "seed": 1,
    "timeout": 30, # this is not working with langchain ollama, we will use a workaround
}
# llm to prompt
llm = ChatOllama(
    **LLM_CONFIG
)

## Setup the Prompt

In [None]:
SYSTEM_PROMPT_TEMPLATE = """You are an assistant that extracts email addresses from adress infos.
You return only the email address, nothing else, no additional formatting."""

PROMPT_TEMPLATE = """{query}"""

messages = [
    (
        "system",
        SYSTEM_PROMPT_TEMPLATE,
    ),
    (
        "human",
        PROMPT_TEMPLATE,
    ),
]

# add messages to template
prompt = ChatPromptTemplate.from_messages(
    messages
)


# Setup the Chain

In [None]:
chain = prompt | llm

In [None]:
address_infos_test = [
    ["My Name is Oliver Müller you can contact me using oliver.mueller@uni-paderborn.de", "oliver.mueller@uni-paderborn.de"],
    ["Hey, here is Sascha, please contact me under sascha.kaltenpoth@uni-paderborn.de", "sascha.kaltenpoth@uni-paderborn.de",],
    ["Hello it's Dirk, you should have received an email from dirk.leffrang@uni-paderborn.de. Please answer the email or, if you haven't received the mail write me.", "dirk.leffrang@uni-paderborn.de"],
    ["Hello its Max Mustermann, you can contact me under max.mustermann@muster-firm.com if you want to here about our latest products.", "max.mustermann@muster-firm.com"],
]

In [None]:
def exact_match(address_infos_test):
    for adress_info in address_infos_test:
        resp = chain.invoke({"query": adress_info[0]}).content
        if adress_info[1] == resp:
            print(f"✅ {resp}")
        else:
            print(f"❌ {resp}")

In [None]:
exact_match(address_infos_test)

✅ oliver.mueller@uni-paderborn.de
✅ sascha.kaltenpoth@uni-paderborn.de
✅ dirk.leffrang@uni-paderborn.de
✅ max.mustermann@muster-firm.com


Wow that was easy. But what happens, when we want to apply exact match to a more complex task, like out swimming example.

# Swimming Example

First we need to setup the structure for the output, use a prompt and a model and chain that. Then we can apply ``exact match``to every model response and test set value.

## Setup the Chain

### Setup the structure

In [None]:
class Segment(BaseModel):
    segment_distance: int = Field(description='Distance of the SEGMENT in meters. Starts a new SEGMENT. (e.g., "75 butterfly" can be extracted to "segment_distance": 75)')
    segment_instructions: str = Field(description='Instructions for the SEGMENT. Occur after the "segment_distance". (e.g., "50 Torpedo Fb" can be extracted to "segment_instructions": "Torpedo Fb")')

class Drill(BaseModel):
    set_repetitions: int = Field(default=1, description='Number of repetitions for the SET. Occurs in combination with "set_distance". (e.g., "4x100" can be extracted and "set_repititions": 4 or "600" can be extracted to "set_repititions": 1)')
    set_distance: int = Field(description='Distance of the SET in meters. Occurs in combination with "set_repetitions". (e.g., "10x50" can be extracted to "set_distance": 50 and "500" can be extracted to "set_distance": 500)')
    set_instructions: str = Field(default="", description='Instructions for the SET. Occurs after the set_distance. (e.g., "5x20" K: can be extracted to "set_instructions": "K")')
    set: List[Segment] = Field(defalut=[], description='List of SEGMENTS that occurs after the "set_repetitions", "set_distance" and "set_instructions". Each SEGMENT has a "segment_distance" and "segment_instructions". (e.g., "35 Ges BrB" can be extracted to {"segment_distance": 25, "segment_instructions": "Ges BrB"} and "25 Kombi 25 skullen\n 50 K3-3 P15" can be extracted to {"segment_distance": 25, "segment_instructions": "Kombi"}, {"segment_distance": 25, "segment_instructions": "skullen"} and {"segment_distance": 50, "segment_instructions": "K3-3 P15"})')
    rest_period: int = Field(default=0, description='''Rest period in seconds. (e.g., "P15\"" can be extracted to "rest_period": 15)''')
    drill_form: str = Field(default="", description='Form of the DRILL. Occurs in combination with "drill_intensity" followed by the "drill_distance" and "drill_duration". (e.g., "T2 0,25 8" can be extracted to "drill_form": "T" and "3; 400; 9" can be extracted to "drill_form": "")')
    drill_intensity: int = Field(description='Intensity of the DRILL. Occurs in combination with "drill_form" followed by the "drill_distance" and "drill_duration". (e.g., "2; 1; 15" can be extracted to "drill_intensity": 2 and "B1 1200 8" can be extracted to "drill_intensity": 1)')
    drill_distance: int = Field(description='Distance of the DRILL in meters or kilometers. Occurs after the combination of "drill_form" and "drill_distance". (e.g., "B3 0,3 20" can be extracted to "drill_distance": 300 and "A4 400m 18 min" can be extracted to "drill_distance": 400)')
    drill_duration: int = Field(description='Duration of the DRILL in minutes. Last value in the string that occurs after the "drill_distance". (e.g., "8 0,4 7" can be extracted to "drill_duration": 7 and "1; 0,9; 10 min" can be extracted to "drill_duration": 10)')

### Setup the Prompt and Format instruction.

In [None]:
SYSTEM_PROMPT_TEMPLATE = """You are an assistant who extracts information from a semi-structured string into JSON format.

{format_instructions}

DO NOT include any additional comments, context or explanations in your JSON response."""

PROMPT_TEMPLATE = """{query}"""

In [None]:
FORMAT_INSTRUCTIONS_JSON_TEMPLATE = """The output should be formatted as a JSON instance that conforms to the JSON schema below.

```json
{
        "set_repetitions": int, Number of repetitions for the SET. Occurs in combination with "set_distance". (e.g., "4x100" can be extracted and "set_repititions": 4 or "600" can be extracted to "set_repititions": 1), If you can not find any "set_repetitions", "set_repetitions" defaults to 1,
        "set_distance": int, Distance of the SET in meters. Occurs in combination with "set_repetitions". (e.g., "10x50" can be extracted to "set_distance": 50 and "500" can be extracted to "set_distance": 500),
        "set_instructions": str, Instructions for the SET. Occurs after the set_distance. (e.g., "5x20" K: can be extracted to "set_instructions": "K"), If you can not find any "set_instructions", "set_instructions" defaults to "",
        "set": [ List, List of SEGMENTS that occurs after the "set_repetitions", "set_distance" and "set_instructions". Each SEGMENT has a "segment_distance" and "segment_instructions". (e.g., "35 Ges BrB" can be extracted to {"segment_distance": 25, "segment_instructions": "Ges BrB"} and "25 Kombi 25 skullen\n 50 K3-3 P15" can be extracted to {"segment_distance": 25, "segment_instructions": "Kombi"}, {"segment_distance": 25, "segment_instructions": "skullen"} and {"segment_distance": 50, "segment_instructions": "K3-3 P15"})
            {
                "segment_distance": 0, Distance of the SEGMENT in meters. Starts a new SEGMENT. (e.g., "75 butterfly" can be extracted to "segment_distance": 75),
                "segment_instructions": "", Distance of the SEGMENT in meters. Starts a new SEGMENT. (e.g., "75 butterfly" can be extracted to "segment_distance": 75), If you can not find any "segment_instructions", "segment_instructions" defaults to ""
            }
            ... (more segments) ...
        ],
        "rest_period": int, Rest period in seconds. (e.g., "P15"" can be extracted to "rest_period": 15), If you can not find any "rest_period", "rest_period" defaults to 0,
        "drill_form": str, Form of the DRILL. Occurs in combination with "drill_intensity" followed by the "drill_distance" and "drill_duration". (e.g., "T2 0,25 8" can be extracted to "drill_form": "T" and "3; 400; 9" can be extracted to "drill_form": ""), If you can not find any "drill_form", "drill_form" defaults to "",
        "drill_intensity": int, Intensity of the DRILL. Occurs in combination with "drill_form" followed by the "drill_distance" and "drill_duration". (e.g., "2; 1; 15" can be extracted to "drill_intensity": 2 and "B1 1200 8" can be extracted to "drill_intensity": 1),
        "drill_distance": int, Distance of the DRILL in meters or kilometers. Occurs after the combination of "drill_form" and "drill_distance". (e.g., "B3 0,3 20" can be extracted to "drill_distance": 300 and "A4 400m 18 min" can be extracted to "drill_distance": 400),
        "drill_duration": int, Duration of the DRILL in minutes. Last value in the string that occurs after the "drill_distance". (e.g., "8 0,4 7" can be extracted to "drill_duration": 7 and "1; 0,9; 10 min" can be extracted to "drill_duration": 10), }
```"""
print(FORMAT_INSTRUCTIONS_JSON_TEMPLATE)

The output should be formatted as a JSON instance that conforms to the JSON schema below.

```json
{
        "set_repetitions": int, Number of repetitions for the SET. Occurs in combination with "set_distance". (e.g., "4x100" can be extracted and "set_repititions": 4 or "600" can be extracted to "set_repititions": 1), If you can not find any "set_repetitions", "set_repetitions" defaults to 1,
        "set_distance": int, Distance of the SET in meters. Occurs in combination with "set_repetitions". (e.g., "10x50" can be extracted to "set_distance": 50 and "500" can be extracted to "set_distance": 500),
        "set_instructions": str, Instructions for the SET. Occurs after the set_distance. (e.g., "5x20" K: can be extracted to "set_instructions": "K"), If you can not find any "set_instructions", "set_instructions" defaults to "",
        "set": [ List, List of SEGMENTS that occurs after the "set_repetitions", "set_distance" and "set_instructions". Each SEGMENT has a "segment_distance" an

In [None]:
messages = [
    (
        "system",
        SYSTEM_PROMPT_TEMPLATE,
    ),
    (
        "human",
        PROMPT_TEMPLATE,
    ),
]

# add messages to template
prompt = ChatPromptTemplate.from_messages(
    messages
)

In [None]:
prompt = prompt.partial(
    format_instructions=FORMAT_INSTRUCTIONS_JSON_TEMPLATE
)

### Setup the Model

In [None]:
LLM_CONFIG = {
    "model": "qwen2.5-coder:7b",
    "temperature": 0.0,
    "seed": 1,
    "timeout": 30, # this is not working with langchain ollama, we will use a workaround
}
# llm to prompt
llm = ChatOllama(
    **LLM_CONFIG
)

In [None]:
chain = prompt | llm

## Generate Responses

### Download the test set
(here called train.json).

In [None]:
response = requests.get("https://raw.githubusercontent.com/olivermueller/amlta-2024/refs/heads/main/Session_10/train.json")
# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Write the content to a local file
    with open('train.json', 'wb') as file:
        file.write(response.content)
    print("File downloaded successfully.")
else:
    print(f"Failed to download file. Status code: {response.status_code}")

with open("train.json", encoding="utf-8") as j_file:
    test = json.load(j_file)
print(json.dumps(test, indent=4)[:720] + "...")
test_prompts = list(test.keys())

File downloaded successfully.
{
    "4x100: 25 butterfly, 50 torpedo, 25 freestyle; A2; 400 m; 8 min": {
        "set_repetitions": 4,
        "set_distance": 100,
        "set_instructions": "",
        "set": [
            {
                "segment_distance": 25,
                "segment_instructions": "butterfly"
            },
            {
                "segment_distance": 50,
                "segment_instructions": "torpedo"
            },
            {
                "segment_distance": 25,
                "segment_instructions": "freestyle"
            }
        ],
        "rest_period": 0,
        "drill_form": "A",
        "drill_intensity": 2,
        "drill_distance": 400,
        "drill_duration": 8
    },
    "4x100: 25 To...


### Generate actual responses and store them

In [None]:
eval_results = {}
for i in tqdm(range(len(test_prompts))):
    st = time.time()
    query = test_prompts[i]

    # parser soft evaluation
    # we parse first with literal evaluation and then try json
    try:
        response = chain.invoke({"query": query}).content
        matches = re.search(r"```json(.*?)```", response, re.DOTALL)
        if matches:
            response = matches.group(1).strip()
        # we parse first with literal evaluation
        response = ast.literal_eval(response)
    except Exception as e:
        print(e)
        try:
            response = json.loads(response)
        except Exception as e:
            response = {}
    eval_results[query] = response
with open(f"{LLM_CONFIG.get('model').split('/')[-1]}.json", "w", encoding="utf-8") as j_file:
    json.dump(eval_results, j_file, indent=4, ensure_ascii=False)

  0%|          | 0/20 [00:00<?, ?it/s]

### Write the exact match code

(with little help of ChatGPT)

In [None]:
def compare_dictionaries(dict1, dict2):
    """
    Compares two dictionaries and returns a list of differences.

    Args:
        dict1 (dict): The first dictionary to compare.
        dict2 (dict): The second dictionary to compare.

    Returns:
        list: A list of strings describing differences between the dictionaries.
    """
    differences = []

    def compare_values(key, val1, val2, parent=""):
        if val1 != val2:
            location = f"{parent + '.' if parent else ''}{key}"
            differences.append(f"Difference at '{location}': {val1} != {val2}")

    def compare_segments(segment_list1, segment_list2, parent="set"):
        len1, len2 = len(segment_list1), len(segment_list2)
        if len1 != len2:
            differences.append(f"Difference in {parent} length: {len1} != {len2}")
            return

        for i, (seg1, seg2) in enumerate(zip(segment_list1, segment_list2)):
            for key in seg1.keys() | seg2.keys():
                compare_values(f"{parent}[{i}].{key}", seg1.get(key), seg2.get(key), parent=parent)

    # Compare top-level keys
    for key in dict1.keys() | dict2.keys():
        val1, val2 = dict1.get(key), dict2.get(key)

        if key == "set":
            if isinstance(val1, list) and isinstance(val2, list):
                compare_segments(val1, val2)
            else:
                compare_values(key, val1, val2)
        else:
            compare_values(key, val1, val2)

    return differences

def compare_list_of_dictionaries(list1, list2):
    """
    Compares two lists of dictionaries and counts similar values for every key.
    For segments, it counts the number of matching segments.

    Args:
        list1 (list): The first list of dictionaries to compare.
        list2 (list): The second list of dictionaries to compare.

    Returns:
        dict: A dictionary with counts of similar values for each key and matching segments.
    """
    similarity_counts = {}

    def count_segments(seg_list1, seg_list2):
        matching_segments = 0
        for seg1, seg2 in zip(seg_list1, seg_list2):
            if seg1 == seg2:
                matching_segments += 1
        return matching_segments

    for dict1, dict2 in zip(list1, list2):
        for key in dict1.keys() | dict2.keys():
            val1, val2 = dict1.get(key), dict2.get(key)

            if key == "set":
                if isinstance(val1, list) and isinstance(val2, list):
                    matching_segments = count_segments(val1, val2)
                    similarity_counts[key] = similarity_counts.get(key, 0) + matching_segments
                else:
                    similarity_counts[key] = similarity_counts.get(key, 0)
            else:
                if val1 == val2:
                    similarity_counts[key] = similarity_counts.get(key, 0) + 1
                else:
                    similarity_counts[key] = similarity_counts.get(key, 0)

    return similarity_counts

In [None]:
with open("train.json", encoding="utf-8") as j_file:
    train = json.load(j_file)
    dict_list1 = list(train.values())

with open(f"{LLM_CONFIG.get('model').split('/')[-1]}.json", encoding="utf-8") as j_file:
    dict_list2 = json.load(j_file)
    dict_list2 = list(dict_list2.values())

In [None]:
actual = dict(([(x, 20) for x in train[list(train.keys())[0]].keys() if x != "set"]))
actual["set"] = 0
for t in train:
    actual["set"] += len(train[t].get("set"))
print(actual)
# Find similarity counts
similarity_counts = compare_list_of_dictionaries(dict_list1, dict_list2)
print(json.dumps(similarity_counts, indent=4))

{'set_repetitions': 20, 'set_distance': 20, 'set_instructions': 20, 'rest_period': 20, 'drill_form': 20, 'drill_intensity': 20, 'drill_distance': 20, 'drill_duration': 20, 'set': 50}
{
    "set_instructions": 10,
    "set": 25,
    "drill_duration": 12,
    "drill_distance": 7,
    "set_distance": 17,
    "drill_form": 17,
    "drill_intensity": 12,
    "set_repetitions": 20,
    "rest_period": 14
}


### Calculate the final results

In [None]:
for similarity_count in similarity_counts:
    print(f"{similarity_count}: {similarity_counts[similarity_count] / actual[similarity_count] * 100}")

set_instructions: 50.0
set: 50.0
drill_duration: 60.0
drill_distance: 35.0
set_distance: 85.0
drill_form: 85.0
drill_intensity: 60.0
set_repetitions: 100.0
rest_period: 70.0


Wow, maybe we need to improve prompting here. Only 50% of the sets are correctly extracted and only 35% of the drill distances.