# Conversational AI (25/26) Assignment 3: Evaluation
This notebook focuses on the question of what it means to evaluate conversational AI and how this can be done in practice.
The dataset used in the assignment is the one from [DSTC11's task 5](https://github.com/lkra/dstc11-track5/blob/main/data), same as assignments 1 and 2.


## Assignment
In this assignment, we will evaluate the vanilla and fine-tuned versions of a large language model (Qwen-2.5) using four automatic metrics and compare them to human judgments.

**Assignment steps:**

First, run this notebook and make sure you understand what is going on. Throughout the notebook, you will find nine questions you need to answer to complete the assignment. These are both coding questions and questions that evaluate your understanding of the data, the process, and the model behavior. These questions are indicated as <span style="background:yellow">__Q#__, namely:</span>


1. Understand the provided metrics: BLEU, Rouge, and BERTScore. Run them for the vanilla and the fine-tuned model for 100 dialogues. Reflect on how they perform in relation to their design. **(Q1 - Q3)**
3. Define manual/qualitative metrics and compare them to the automatic metrics you generated so far. Analyze and reflect on what you observe. **(Q4 - Q5)**
4. Complete the code snippet for the LLM-as-a-judge function. Compare its scores to the human scores and the scores of the other metrics. **(Q6 - Q9)**

**Submission:**
Please submit your code (as a Kaggle notebook) on Canvas by **17th November 23:59**.

**Grading:** The assignment is graded with a pass/fail grade.

## A brief reminder about Kaggle notebooks

1. Pay attention to usage statistics, especially memory, CPU, and GPU
2. Pay attention to the quota of GPU (measured in hours)
3. "Turn OFF" the internet after each session. Turn on the internet when starting a session.
4. "Turn off" the accelerator after each use. Turn the accelerator on when starting a session.
5. Save a version after you make changes. This ensures that your teammate can see the latest changes. If you get a question from Kaggle about versions, you can revert to the latest version. With "quick save" you can save a version without running everything. However, while submitting the assignment, the outputs must be visible,

## Ready?
Let's get started! There are four kinds of preparations we will start with:
1. installing packages
2. importing relevant packages
3. setting up the directory with the task data
4. defining helper functions

## Installation

The Kaggle notebooks use a Python 3 environment, and they are already "pre-loaded" with various analytic Python packages, like Json, Pandas, and Numpy. If you are curious, you can see the package definition in [this repository](https://github.com/kaggle/docker-python).

We will install a few specialized packages for evaluation and for working with language models.

Let's just double check that Python is set up and we are using a relatively new version (like 3.10):

In [1]:
!python --version

Python 3.14.0


The following libraries are relevant for our evaluation functions. As they are commonly used evaluation metrics, people have created standardized code that we can reuse. That way we won't need to define the evaluation comparisons ourselves.

We also install `transformers`, the package needed to work with Large Language Models.

For our NLI-based evaluation (step 2), we will use the `sentence-transformers` library, which is very handy for computing cosine similariries between sentences. Computing NLI scores is an abstraction over these similarity scores.

In [2]:
# To use BLEU and ROUGE
!pip install sacrebleu
!pip install rouge

# To use BERTSCORE
!pip install evaluate
!pip install bert_score

# To use ْQwen
!pip install transformers

# To calculate cosine similarities between sentences
!pip install sentence-transformers

Collecting sacrebleu
  Downloading sacrebleu-2.5.1-py3-none-any.whl.metadata (51 kB)
Collecting portalocker (from sacrebleu)
  Downloading portalocker-3.2.0-py3-none-any.whl.metadata (8.7 kB)
Collecting tabulate>=0.8.9 (from sacrebleu)
  Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Collecting lxml (from sacrebleu)
  Downloading lxml-6.0.2-cp314-cp314-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl.metadata (3.6 kB)
Downloading sacrebleu-2.5.1-py3-none-any.whl (104 kB)
Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Downloading lxml-6.0.2-cp314-cp314-manylinux_2_26_x86_64.manylinux_2_28_x86_64.whl (5.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.3/5.3 MB[0m [31m73.7 MB/s[0m  [33m0:00:00[0m
[?25hDownloading portalocker-3.2.0-py3-none-any.whl (22 kB)
Installing collected packages: t

## Imports
The following code loads several standard packages, packages for evaluation, and packages for working with LLMs.

In [13]:
import numpy as np 
import json
import os
import shutil
import subprocess

from sacrebleu.metrics import BLEU
from rouge import Rouge

from evaluate import load

from sentence_transformers import CrossEncoder

from transformers import AutoModelForCausalLM, AutoTokenizer

bleu_scorer = BLEU()
rouge_scorer = Rouge()
bertscore = load("bertscore")

#my local directory for the assignment
myDir = "/home/song0409/Desktop/CAI"


## Clone the data/task repository

We start by cloning the data and task repository and changing the working directory to that directory.

In [None]:
def setup_repo(repo_url: str, repo_name: str, work_dir: str = myDir):
    os.chdir(work_dir)
    
    # Remove repo if it exists
    if os.path.exists(os.path.join(work_dir, repo_name)):
        shutil.rmtree(os.path.join(work_dir, repo_name))
    
    # Clone repo
    subprocess.run(["git", "clone", repo_url], check=True)
    
    # Move into repo/data
    os.chdir(os.path.join(repo_name, "data"))


setup_repo("https://github.com/lkra/dstc11-track5.git", "dstc11-track5")

Cloning into 'dstc11-track5'...


Let's list all files in the current directory iteratively:

In [6]:
for dirname, _, filenames in os.walk('.'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

./output_schema.json
./README.md
./knowledge_aug_domain_reviews.json
./knowledge_aug_reviews.json
./knowledge.json
./val/labels.json
./val/logs.json
./test/labels.json
./test/logs.json
./train/logs_bkp.json
./train/labels.json
./train/logs.json
./train/bkp/labels.json
./train/bkp/logs.json


There are two files that are vital to this notebook. `Logs` contains the conversation logs, whereas `labels` are the ground-truth responses by humans. Feel free to inspect the contents of these files at this point.

In [7]:
with open('train/logs.json', 'r') as f:
    contexts=json.load(f)

with open('train/labels.json', 'r') as f:
    labels=json.load(f)

## Fine-tuned model 

We will use the fine-tuned model from A2. In the previous assignment, we saved our trained model in the output/adapter directory. Let's load the fine-tuned model from there.

**Note:** alternatively, you can download a fine-tuned model using this link: https://drive.google.com/file/d/1f4bSiE356aZaQzbHUh48SN0c2uedDOnA/view?usp=drive_link 
If you do so, upload the model using the Input -> Upload option in your sidebar.

In [14]:
import shutil

src = myDir + "/week2/"+"outputs/adapter"
dst = myDir + "/dstc11-track5/data/outputs/adapter"

shutil.copytree(src, dst, dirs_exist_ok=True)

'/home/song0409/Desktop/CAI/dstc11-track5/data/outputs/adapter'

In [15]:
base_model_id = "Qwen/Qwen3-1.7B"
adapter_path  = myDir+"/week2/"+"outputs/adapter" 

In [10]:
tokenizer = AutoTokenizer.from_pretrained(base_model_id, use_fast=True)
model_base = AutoModelForCausalLM.from_pretrained(base_model_id, torch_dtype="auto", device_map="auto")
model_base_for_adapter = AutoModelForCausalLM.from_pretrained(base_model_id, torch_dtype="auto", device_map="auto")

`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.43it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  6.02it/s]


In [16]:
from peft import PeftModel
import torch

model = PeftModel.from_pretrained(model_base_for_adapter, adapter_path)

## Evaluation with standard generative metrics

BLEU, ROUGE, and BERTScore are three common metrics for the evaluation of machine-generated text. These were discussed during this week's lecture. Before we proceed with empirically exploring them, let's first reflect on what they measure.

<span style="background:yellow">__Q1:__ For each of the three metrics (BLEU, ROUGE, Bertscore), describe what it measures and how it does that in practice. For the "how" part, you can also explain through the metric's equation.</span>

```
# Your answer here
```

ROUGE, BLEU, and Bertscore are already implemented in this notebook. **We suggest you spend a few minutes reading through their code below.**

In [17]:
bleu_scorer = BLEU(effective_order=True)

def compute_rouge(hypothesis, reference):
    """
    Returns the prediction and the reference for a single dialogue, returns their ROUGE score.
    """
    score = rouge_scorer.get_scores(
          hyps=hypothesis,
          refs=reference,
      )
    return score[0]["rouge-l"]["f"]

def compute_bleu(hypothesis, reference):
    """
    Returns the prediction and the reference for a single dialogue, returns their BLEU score.
    """
    score = bleu_scorer.sentence_score(
        hypothesis=hypothesis,
        references=[reference]
    )
    return score.score / 100

def compute_bertscore(predictions, references):
    """
    Receives two lists of strings with equal length. Returns their pairwise Bertscores (precision, recall, F1 score).
    """
    results = bertscore.compute(predictions=predictions, references=references, lang="en")
    return results

The next function obtains the evaluation scores for all three metrics. For BERTScore we will use the F1 score.

In [18]:
def evaluate_answers(model_response, gt_response):
    rouge = compute_rouge(model_response, gt_response)
    bleu = compute_bleu(model_response, gt_response)
    bert = compute_bertscore([model_response], [gt_response])["f1"][0]

    print("MODEL:", model_response)
    print("GROUND TRUTH:", gt_response)
    print("ROUGE:", rouge, "BLEU:", bleu, "BERTSCORE:", bert)
    print()

    return rouge, bleu, bert

## LLM functions

* **get_qwen3_model and get_qwen2_model**: Load the Qwen models and assign them to global variables.
* **query_qwen**: Get the output of the models for a given prompt.

In [19]:
_qwen3_model = None
_qwen3_tokenizer = None

_qwen2_model = None
_qwen2_tokenizer = None


def get_qwen3_model():
    global _qwen3_model, _qwen3_tokenizer
    if _qwen3_model is None or _qwen3_tokenizer is None:
        model_name = "Qwen/Qwen3-1.7B"
        _qwen3_tokenizer = AutoTokenizer.from_pretrained(model_name)
        _qwen3_model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype="auto",
            device_map="auto"
        )
    return _qwen3_model, _qwen3_tokenizer


def get_qwen2_model():
    global _qwen2_model, _qwen2_tokenizer
    if _qwen2_model is None or _qwen2_tokenizer is None:
        model_name = "Qwen/Qwen2.5-3B-Instruct"
        _qwen2_tokenizer = AutoTokenizer.from_pretrained(model_name)
        _qwen2_model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype="auto",
            device_map="auto"
        )
    return _qwen2_model, _qwen2_tokenizer

In [20]:
def query_qwen(messages, model_name):
    
    if model_name == "Qwen3-1.7B":
        model, tokenizer = get_qwen3_model()

        text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=False
        )
        
    elif model_name == "Qwen2.5-3B":
        model, tokenizer = get_qwen2_model()

        text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
        )
    
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=2000
    )

    output_ids = generated_ids[0][model_inputs.input_ids.shape[1]:]
    output = tokenizer.decode(output_ids, skip_special_tokens=True).strip()

    return output

* **prepare_messages:**  builds the input message sequence. It first defines the system instruction that sets the chatbot’s role as an expert in hotels and restaurants, then constructs the dialogue context by iterating through the conversation history. Each utterance is labeled according to the speaker (“user” or “assistant”) and added to the message list, and it returns a structured list of messages.


In [21]:
def prepare_messages(history):
    """
    Prepare the messages that the model will get as input. 
    Considers the primary model instruction, demonstrations for few-shot settings, and the history of the current chat. Returns the list of messages.
    """
    
    
    # We write a suitable instruction for the model.
    instruction=f'You are a chatbot with expertise in hotels and restaurants.\n'
    
    # Construct dialogue history
    messages=[]
    messages.append({"role": "system", "content": instruction})
    for ut in history:
        role = "user" if ut['speaker']=="U" else "system"
        messages.append({"role": role, "content": ut['text']})

    return messages

<span style="background:yellow">__Q2:__ Now call the functions *prepare_messages* and *query_qwen* with a sample input (prediction and reference) to understand how they work. Use the same (prediction, reference) pair for a fair comparison between the three metrics. For the prediction, use Qwen3-1.7B to generate the response.</span>

In [None]:
# Your solution here

Now that you can run the three metrics with a single model and example, we are ready to scale up. 

* **run_pipeline**: this function iterates over the dialogs until reaching n_dialogues. It builds the prompt (zero-shot or few-shot) and generates the model's response (query_qwen). Then, it evaluates that response with ROUGE and BLEU metrics. Finally, it saves predictions, references, and metrics in lists and prints the progress and results of each dialog.

In [None]:
def run_pipeline(n_dialogues, model_name):
    
    all_bleus = []
    all_rouges = []
    all_bertscores = []
    all_predictions = []

    all_bleus_ft = []
    all_rouges_ft = []
    all_bertscores_ft = []
    all_predictions_ft = []

    all_references = []
    all_inputs = []
    
    counter_dialogues = 0
    idx_dialogue = 0

    while counter_dialogues < n_dialogues:
        chat = contexts[idx_dialogue]
        if 'response' not in labels[idx_dialogue]:
            idx_dialogue += 1
            continue

        gt_response = labels[idx_dialogue]["response"]
        counter_dialogues += 1
        idx_dialogue += 1

        all_references.append(gt_response)

        print("Processing dialogue", counter_dialogues)

        messages = prepare_messages(chat)

        model_response = query_qwen(messages, model_name)
        print("BASE MODEL:")
        rouge, bleu, bert = evaluate_answers(model_response, gt_response)
        
        all_predictions.append(model_response)
        all_bleus.append(bleu)
        all_rouges.append(rouge)
        all_bertscores.append(bert)

        # TODO: query_ft needs to be implemented (you can reuse the code from assignment 2)
        # This function should be used to query the fine-tuned model
        model_response_ft = query_ft(messages) 
        print("FINE-TUNED MODEL:")
        rouge_ft, bleu_ft, bert_ft = evaluate_answers(model_response_ft, gt_response)

        all_predictions_ft.append(model_response_ft)
        all_bleus_ft.append(bleu_ft)
        all_rouges_ft.append(rouge_ft)
        all_bertscores_ft.append(bert_ft)

        if isinstance(chat, list):
            input_text = " ".join([turn["text"] for turn in chat if turn["speaker"] == "U"])
        else:
            input_text = chat.get("text", str(chat))
        all_inputs.append(input_text)

    return (
        all_bleus, all_rouges, all_bertscores, all_predictions,
        all_bleus_ft, all_rouges_ft, all_bertscores_ft, all_predictions_ft,
        all_references, all_inputs
    )

In [None]:
n_dialogues = 10
model_name = ""

# TODO: Call the `run_pipeline` function to obtain the evaluation scores

<span style="background:yellow">__Q3:__ Using the `run_pipeline` function, generate responses for the first 100 dialogues with both the vanilla and the fine-tuned model, and evaluate them using BLEU, ROUGE, and BERTScore metrics. This allows us to say which version of the model is better according to these metrics. What do you observe? Which model scores better in general? Do the metrics agree?
Note: as part of this question, you'd need to define the function `query_ft`.
</span>

Let's print all our metrics for this configuration:

In [None]:
print("QWEN BASE")
print("Mean BLEU:", np.mean(bleus))
print("Mean ROUGE:", np.mean(rouges))
print("Mean BERTScore:", np.mean(bertscores))

print("\nFINE-TUNED")
print("Mean BLEU:", np.mean(bleus_ft))
print("Mean ROUGE:", np.mean(rouges_ft))
print("Mean BERTScore:", np.mean(bertscores_ft))

```
# Your answer here
```

## Manual evaluation

In this block, we will simulate a human evaluation. We will do so on a smaller set of 10 dialogues.


<span style="background:yellow">__Q4:__ Now define three quality criteria that you will apply to evaluate manually, by consulting papers by [Deriu et al.](https://link.springer.com/article/10.1007/s10462-020-09866-x) and by [Howcroft et al.](https://aclanthology.org/2020.inlg-1.23/). Describe each of the metrics in turn, both in terms of what it captures and how it works.</span>

```
# Your answer here
```

<span style="background:yellow">__Q5:__ Analyze the generation with the vanilla and fine-tuned model using these three metrics. Which of the model variants seems to be of the highest quality according to your human assessment? Note that the automatic metrics are only a proxy, and your impression may or may not agree with them. Remember to give examples and include results when describing your findings.</span>


```
# Your answer here
```

## LLM-as-a-judge

LLM-as-a-judge has emerged as a popular and controversial evaluation technique. Given how successful LLMs have become, do we even need metrics based on human heuristics or human judgment? Can't we just judge a response using an LLM?
We will next investigate whether LLM-as-a-judge can reliably judge the generations of our model in a way that resembles manual and other automated metrics.

The code below provides a function that evaluates an answer with LLM-as-a-judge. Read through the code to understand what it does and how it works.

* **query_qwen_as_a_judge:** this function evaluates how coherent an assistant’s response is relative to the user’s original input. It first sets a strict system instruction requiring the model to act as a coherence judge and return a JSON object containing a score between 0.0 and 1.0 plus a brief explanation. The function prepends this instruction to the provided message(s), applies Qwen’s chat template, and loads Qwen3-1.7B. It then tokenizes the input, generates up to 512 new tokens, and decodes the output. A regular expression is used to extract the JSON block between <JSON> and </JSON> tags. If successful, the JSON is parsed and returned; otherwise, the function returns a dictionary with a None score and an error description.

In [None]:
import json
import re

def query_qwen_as_a_judge(messages, model_name):
    system_prompt = {
        "role": "system",
        "content": """
### Role Assignment
You are a Coherence Evaluation Judge.
Your job is to evaluate how coherent the **assistant’s response** is with respect to the **user’s request**, in the context of conversations about hotels.

### Task Definition
You must:
1. Assign a **coherence score** from **0.0 to 1.0**
2. Provide a **short explanation** (maximum 2 sentences)

### Output Format (STRICT)
Return ONLY:

<JSON>
{
  "coherence_score": float between 0.0 and 1.0,
  "explanation": "brief rationale"
}
</JSON>
"""
    }

    messages_judge = [system_prompt] + messages

    if model_name == "Qwen3-1.7B":
        model, tokenizer = get_qwen3_model()
    else:
        model, tokenizer = get_qwen2_model()

    text = tokenizer.apply_chat_template(
        messages_judge,
        tokenize=False,
        add_generation_prompt=True
    )

    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
    generated_ids = model.generate(
        **model_inputs, max_new_tokens=512
    )
    output_ids = generated_ids[0][model_inputs.input_ids.shape[1]:]
    raw_output = tokenizer.decode(output_ids, skip_special_tokens=True).strip()

    # Extract JSON using regex
    match = re.search(r"<JSON>(.*?)</JSON>", raw_output, re.DOTALL)
    if match:
        json_str = match.group(1).strip()
        try:
            data = json.loads(json_str)
            return data
        except:
            return {"coherence_score": None, "explanation": "JSON parse error"}
    
    return {"coherence_score": None, "explanation": "No JSON found"}

This code block computes coherence scores for all model predictions using Qwen as a judge. It loops through each pair of user inputs and generated responses, formats them into a single evaluation prompt, and sends the prompt to query_qwen_as_a_judge. The function returns a JSON object containing a coherence score, which is extracted and stored in the list scores_base.

In [None]:
scores_base = []
model_name = "Qwen3-1.7B"  

for the_input, prediction in zip(all_inputs, all_predictions):
    messages = [{"role": "user",
        "content": f"User Input:\n{the_input}\n\nAssistant Output:\n{prediction}"}]

    result = query_qwen_as_a_judge(messages, model_name)
    scores_base.append(result["coherence_score"])

<span style="background:yellow">__Q6:__ Now write a brief code snippet to run the LLM-as-a-judge evaluation for two samples (from the dataset): one where you expect a higher and one where you expect a lower score. Analyze whether you agree with the judge scores.</span>

In [None]:
# Your solution here

<span style="background:yellow">__Q7:__ Now evaluate the same 10 responses from Q4+Q5 for the vanilla and the fine-tuned model using LLM-as-a-judge. Compare the judgments of LLM-as-a-judge to your own judgments for these dialogues. What do you conclude? Would LLM-as-a-judge be a good proxy for your human judgments on this task? Why/why not?</span>


```
# Your answer here
```

In [None]:
# Your solution here

<span style="background:yellow">__Q8:__ Compare the judgments of all four automatic metrics for the 100 dialogues, and compute Spearmann correlation for each metric pair: (llm-judge, bertscore), (llm-judge, bleu), (bleu, rouge). </span>

```
# Your answer here
```

<span style="background:yellow">__Q9:__ Visualize the scores between each metric pair as a matrix. What do you observe? Which of the metrics agree and disagree the most? Connect your observations to your knowledge about what these metrics capture.</span>

```
# Your answer here
```

In [None]:
# Your solution and visualization here