## Conversational AI (25/26) Assignment 2: Fine-tuning for Task-aligned Response Generation

When asking task-specific questions, an off-the-shelf language model may not have the necessary domain knowledge to provide adequate answers. Fine-tuning the model on questionâ€“answer pairs from the relevant domain can help address this gap.

In this assignment, our goal is to fine-tune a large language model (Qwen3), aiming to handle task-oriented conversational modeling more effectively.

Reminder: we  focus on DSTC11 Task5, subtask 3 **Knowledge-grounded Response Generation**. The task and dataset are explained in detail [here](https://github.com/alexa/dstc11-track5).
In essence, the task is to respond to subjective user requests regarding hotels and restaurants, e.g. "does the restaurant have a nice vibe?". Our conversational model has to generate a response to this question based on conversation history.


To fine-tune our model, we will use LoRA fine-tuning.
LoRA (Low-Rank Adaptation of Large Language Models) allows us to fine-tune models more efficiently by updating only a small subset of parameters instead of modifying all of the modelâ€™s weights. You can read more [here](https://arxiv.org/abs/2106.09685) and [here](https://huggingface.co/learn/llm-course/en/chapter11/4).

**This notebook contains the following parts:**
1. Formatting the DSTC11 Task5 data in the proper structure
2. Fine-tuning Qwen3 on the DSTC11 Task5 data
3. Analyzing the model behavior in comparison to its off-the-shelf baseline

## Assignment

The goal is to compare the performance between off-the-shelf usage and a domain-specific fine-tuned model. For this assignment, we will focus on manual human evaluation. 

Throughout the notebook, you will find questions you need to answer to complete the assignment. These are both coding questions and questions that evaluate your understanding of the data, the process, and the model behavior. These questions are indicated as  **<span style="background:yellow">Q#</span>**.

**<span style="color:red">Important Note:</span>** Fine-tuning a model is time-consuming. Since this assignment requires you to experiment with different parameters, completing it may take longer than expected. Therefore, please start the assignment as early as possible and avoid leaving it until the last day.


**Assignment steps:**
1. Load the dataset and convert it to the appropriate format. **(Q1)**
2. Prepare your data and fine-tune your model on our dataset. **(Q2)**
3. Manually compare the off-the-shelf vs finetuned Qwen3 performance on 10 dataset samples. Answer questions about the assignment, based on your coding steps and manual analysis. **(Q3 - Q9)**

**Submission:**
Please submit your code (as a Kaggle notebook) on Canvas by **10th November 23:59**.

**Grading:** The assignment is graded with a pass/fail grade.

## Reminders about Kaggle notebooks
1. Pay attention to usage statistics, especially memory, CPU, and GPU
2. Pay attention to the quota of GPU (measured in hours)
3. "Turn OFF" the internet after each session. Turn on the internet when starting a session.
4. "Turn off" the accelerator after each use. __Turn the accelerator on when starting a session__.
5. Save a version after you make changes. This ensures that your teammate can see the latest changes. If you get a question from Kaggle about versions, you can revert to the latest version. With "quick save" you can save a version without running everything. However, while submitting the assignment, the outputs must be visible,
6. Some blocks of codes can take longer to run. Read the instructions and be patient! 

## Installation

The Kaggle notebooks use a Python 3 environment, and they are already "pre-loaded" with various analytic Python packages, like Json, Pandas, and Numpy. If you are curious, you can see the package definition in [this repository](https://github.com/kaggle/docker-python).

We will install and load relevant packages for working with language models: Transformers, TRL (short for Transformer Reinforcement Learning), and PEFT (short for Parameter-Efficient Fine-Tuning).

This is a short explanation regarding each library:
- **Transformers**: We use this library to load and work with our base model.
- **TRL**: We use this library to fine-tune our base model.
- **PEFT**: We use this library to specify which parameters to update instead of updating all of them, resulting in more lightweight fine-tuning.

In [1]:
# Check versions
import transformers, trl, peft
print("transformers:", transformers.__version__)
print("trl:", trl.__version__)
print("peft:", peft.__version__)

  from .autonotebook import tqdm as notebook_tqdm


transformers: 4.57.1
trl: 0.24.0
peft: 0.16.0


## Imports

The following code loads several standard packages, packages for working with LLMs, packages for defining our parameters, and also packages for training the model.

A brief explanation of the functions that we loaded:

- **AutoModelForCausalLM, AutoTokenizer**: To load our base language model and the tokenizer.
- **SFTTrainer, SFTConfig**: To define training configurations and create a trainer object for supervised fine-tuning.
- **LoraConfig**: Specifies the LoRA parameters we want to use for parameter-efficient fine-tuning.
- **get_peft_model**: Creates the adapter model by combining our base model with the LoRA configuration, which is then passed to the trainer.
- **PeftModel**: Used to load or save the fine-tuned model after training.

Also, as you can see, we set our seed to a specific number. The reason for this is to make our fine-tuning process reproducible. This ensures that aspects like data shuffling, weight initialization, and other random operations remain consistent across different runs. You can experiment with different seed values if you are interested in seeing how they affect the results.


In [2]:
import numpy as np 
import json
import os
import shutil
import subprocess
import sys
from typing import List
from datasets import Dataset
import torch
import random
import requests
import gc
torch.manual_seed(3407); random.seed(3407); np.random.seed(3407)


from transformers import AutoModelForCausalLM, AutoTokenizer, TrainerCallback, TrainingArguments, Trainer
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig, get_peft_model, PeftModel

## Clone the data/task repository

We start by cloning the data and task repository and changing the working directory to that directory.

In [3]:
def setup_repo(repo_url: str, repo_name: str, work_dir: str = "/kaggle/working"):
    os.chdir(work_dir)
    
    # Remove repo if it exists
    if os.path.exists(os.path.join(work_dir, repo_name)):
        shutil.rmtree(os.path.join(work_dir, repo_name))
    
    # Clone repo
    subprocess.run(["git", "clone", repo_url], check=True)
    
    # Move into repo/data
    os.chdir(os.path.join(repo_name, "data"))

work_dir = '/home/song0409/Desktop/CAI/'
setup_repo("https://github.com/lkra/dstc11-track5.git", "dstc11-track5",work_dir)

Cloning into 'dstc11-track5'...


## Exploring the data
The data we will use contains these essential components: 
- __knowledge__: The reviews for the corresponding restaurant or hotel (the data also contains FAQs - we won't be using those!)
- __dialogue__: The dialogue history between the user and the system
- __response__: The ground-truth response we expect from the system

Let's list all files in the current directory iteratively and then load the data first to understand it better!

In [4]:
## List all files in the current directory iteratively:
for dirname, _, filenames in os.walk('.'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

./output_schema.json
./README.md
./knowledge_aug_domain_reviews.json
./knowledge_aug_reviews.json
./knowledge.json
./val/labels.json
./val/logs.json
./test/labels.json
./test/logs.json
./train/logs_bkp.json
./train/labels.json
./train/logs.json
./train/bkp/labels.json
./train/bkp/logs.json


In [5]:
with open('train/logs.json', 'r') as f:
    train_ds=json.load(f)
    print(len(train_ds))

with open('train/labels.json', 'r') as f:
    labels=json.load(f)

with open('knowledge.json', 'r') as f:
    knowledge_base=json.load(f)

32604


## Create Dataset
Until now, we have looked at and applied our functions to only one sample. Later, when we want to fine-tune our model, it is thus useful to convert our dataset into a HuggingFace dataset. Converting our dataset into a HuggingFace Dataset provides several advantages, such as faster data loading and processing. You can read more about this [here](https://huggingface.co/docs/datasets/en/create_dataset).

<span style="background:yellow">__Q1:__ We would like to load the data and create a validation set and a test set. For this, we would have a code similar to **assignment 1**, ending with a function that takes as input the split of the DSTC11 dataset and returns the dataset processed by all the steps we applied to the training dataset.</span>

* **format_dialogue:** Formats a list of dialogue by turning it into a readable list. Given a list of dictionaries, each representing a dialogue turn, this function formats the dialogue by prefixing each turn with either "user" or "system" as the role based on the speaker. The speaker is identified by the 'speaker' key in the dictionary: 'U' for user, else it is the system.

In [6]:
def format_dialogue(dialogue: List[dict]) -> List[dict]: 
    """
    Args:
    dialogue (List[dict]): A list of dictionaries where each dictionary contains two keys:
        - 'speaker' (str): A string indicating the speaker of the turn ('U' for user, 'S' for system).
        - 'text' (str): The text spoken by the respective speaker.

    Returns:
        List[dict]: A new array with a specific role and content
    """
    # Your solution here
    messages=[]
    messages.append({"role": "system", "content": "You are an assistant."})
    for dialogue_element in dialogue:
        role = dialogue_element['speaker']
        role = 'user' if role == 'U' else 'system'
        messages.append({"role": role, "content": dialogue_element['text']})

    return messages

* **reformat_dataset:** this function formats our dataset into a structure suitable for conversational language-model training. For each sample, it extracts the dialogue and its associated response, appends the response as a message with the "system" role, and adds the resulting message list to a new dataset. Samples that cause errors are skipped. The final output is the Hugging Face Dataset, containing conversations organized as lists of role-content message objects.

NOTE: This function is slightly different from the one in Assignment 1. Here, you only need to access the dialogue and the response.


In [7]:
def reformat_dataset(dataset, labels_dataset): 
    reformatted_dataset = {
        "messages": []
    }
    for sample_index in range(len(dataset)): 
        # Your solution here
        try:
            sample_dialogue = format_dialogue(dataset[sample_index])
            sample_response = labels_dataset[sample_index].get("response", "")
            sample_dialogue.append({"role": "system", "content": sample_response})
            
            reformatted_dataset["messages"].append(sample_dialogue)
        except:
            continue

        
    return reformatted_dataset

reformatted_dataset = reformat_dataset(train_ds, labels)
dataset = Dataset.from_dict(reformatted_dataset)
dataset[1]

{'messages': [{'content': 'You are an assistant.', 'role': 'system'},
  {'content': 'Do you have information about the Warkworth House?',
   'role': 'user'},
  {'content': 'Yes I do! The Warkworth House is a 4 star guesthouse that is located in the east section of town. Would you like for me to book you a room?',
   'role': 'system'},
  {'content': 'No, but can you give me that phone number please?',
   'role': 'user'},
  {'content': "Most definitely. The Warkworth House's phone number is 01223363682. Can I help you with anything else?",
   'role': 'system'},
  {'content': 'Yes I need to find an expensive place to eat serving Indian food.',
   'role': 'user'},
  {'content': 'There are over a dozen expensive Indian restaurants in the city. Do you have an area of town in mind?',
   'role': 'system'},
  {'content': "Actually, can you suggest one of them. I'm willing to try something new. I want to reserve a table at the one you recommend.",
   'role': 'user'},
  {'content': 'How about the

From this point on, we have already created the dataset for our training dataset. In the following function, you need to do the same for the validation and test datasets.

* **process_dataset_split:** based on the input string of the function, load the corresponding split and create the dataset using the same process as for the training data.

In [8]:
def process_dataset_split(split: str) -> Dataset: 
    """Loads, reformats, and processes a dataset split for model training or evaluation.

    This function loads a dataset split (e.g., 'val', 'test') and generates a dataset for it, similar to what we had for the train split.

    Args:
        split (str): The name of the dataset split to process

    Returns:
        dataset: A HuggingFace `Dataset` object that contains the preprocessed and reformatted data for the specified split.
    """
    log_dir = split+'/logs.json'
    with open(log_dir, 'r') as f:
        ds=json.load(f)
    labels_dir = split+'/labels.json'
    with open(labels_dir, 'r') as f:
        labels=json.load(f)
    
    reformatted_dataset = reformat_dataset(ds, labels)
    dataset = Dataset.from_dict(reformatted_dataset)
    return dataset

validation_ds = process_dataset_split("val")
test_ds = process_dataset_split("test")

validation_ds, test_ds

(Dataset({
     features: ['messages'],
     num_rows: 4173
 }),
 Dataset({
     features: ['messages'],
     num_rows: 5475
 }))

## Fine-Tuning

### Setting up

Now that we have our data ready, we can look at fine-tuning our model. We will be using HuggingFace Transformers. 

As a language model, we will be using __Qwen3-1.7B__. Let's load it!

In [9]:
model_id = "Qwen/Qwen3-1.7B"
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
base = AutoModelForCausalLM.from_pretrained(model_id, dtype="auto", device_map="auto")

Loading checkpoint shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 2/2 [00:00<00:00,  5.72it/s]


### Let's finetune!
For fine-tuning, we use the supervised fine-tuning Trainer from HuggingFace. To understand the function better, you can read the documentation [here](https://huggingface.co/docs/trl/main/en/sft_trainer#trl.SFTTrainer).

The outputs (model weights etc.) will be stored in the _"outputs"_ folder. 

A large part of fine-tuning is determining the values for the **hyperparameters**. Particularly: the batch size, number of epochs, learning rate, and warmup steps are important parameters to get right. 

To prevent our GPU from running out of memory, we will use **gradient accumulation**. This technique allows training neural networks with a larger effective batch size than fits into GPU memory by splitting the batch into smaller mini-batches. With a training batch size of $2$ and gradient accumulation step of $4$ (how many forward and backward passes before updating the model weights), this essentially compares to a batch size of $8$. 

It is up to you to determine the values of the number of training epochs, learning rate, and warmup steps. 
In principle, the recommended standard amount of epochs to train is mostly $1-3$ epochs to prevent overfitting.

The amount of warmup steps can range between $6-10\%$ of the total amount of training steps, so this is tied to the amount of epochs you train with. 

For the learning rate, the usual values range between $2e-4$ and $5e-5$. 

Play around with these values! The aim is to have a training loss below $1$. A training loss close to $0$ indicates overfitting. 

First, we define the configuration specifying which parts of the model will receive trainable adapter layers. It then wraps the base model with these adapters so that only the small LoRA parameters are fine-tuned while the rest of the model remains frozen.

In [10]:
peft_cfg = LoraConfig(r=16, 
           lora_alpha=32, 
           lora_dropout = 0.05, 
           bias = "none", 
           use_rslora = False, 
           target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"])

model = get_peft_model(base, peft_cfg)

Checks if the GPU supports bfloat16 (bf16).

In [11]:
def pick_bf16():
    if torch.cuda.is_available():
        major, _ = torch.cuda.get_device_capability()
        return major >= 8
    return False

Letâ€™s define our training parameters.

Here, you need to provide values for three of them: NUM_TRAIN_EPOCHS, LEARNING_RATE, and WARMUP_STEPS.

In [12]:
NUM_TRAIN_EPOCHS = 2
#between 1 - 3
LEARNING_RATE    = 5e-5
#between 2e-4 and 5e-5
WARMUP_STEPS     = 650
#Total training steps = 8152, warmup steps between 490(6%) and 815(10%).

"""
Test log
    1. 2, 5e-5, 490
    2. 2, 5e-5, 600
    3. 2, 5e-5, 815
    4. 2, 1e-4, 490
    5. 2, 1e-4, 650
    6. 2, 1e-4, 815
    7. 2, 7e-5, 600
"""

os.environ["WANDB_DISABLED"] = "true"

sft_args = SFTConfig(
    output_dir="outputs",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=LEARNING_RATE,
    warmup_steps=WARMUP_STEPS,
    num_train_epochs=NUM_TRAIN_EPOCHS,
    logging_steps=10,
    lr_scheduler_type="linear",
    weight_decay=0.01,
    max_length=1024,
    optim="adamw_torch_fused",
    fp16=not pick_bf16(),
    bf16=pick_bf16(),
    packing=False,
    dataset_num_proc=2,
    report_to="none",
    seed=3407
)


Letâ€™s define our trainer object using the parameters we set earlier.

In [None]:
#Auxialary callback function to notify the user the completion of the time-consuming training session.

class NotifyOnFinishCallback(TrainerCallback):
    def on_train_end(self, args, state, control, **kwargs):
        """Called at the end of training."""
        print("Training finished. Sending notification...")
        try:
            requests.post(
                "https://ntfy.sh/my-llm-job-is-done-cai2",
                data=f"Training complete! Total steps: {state.global_step}"
            )
        except Exception as e:
            print(f"Failed to send notification: {e}")

In [None]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,      
    eval_dataset=validation_ds,
    args=sft_args,
    processing_class=tok,

    #call back when the training is completed
    callbacks=[NotifyOnFinishCallback]
)

Now, letâ€™s train the model. Depending on the number of epochs, the process may take between 2 to 6 hours.

In [None]:
trainer.train()

Finally, save your trained model in the output/adapter directory.

You can then download it for future use.

In [None]:
trainer.save_model("/home/song0409/Desktop/CAI/outputs_(2,5e-5,650)/adapter")  
tok.save_pretrained("/home/song0409/Desktop/CAI/outputs_(2,5e-5,650)/adapter")

<span style="background:yellow">__Q2:__ Print the values for the three parameters you defined in your tuning procedure. Use the sft_args object to access the values.</span> 

In [13]:
# Your coding solution here.
print("NUM_TRAIN_EPOCHS: ", sft_args.num_train_epochs)
print("LEARNING_RATE: ", sft_args.learning_rate)
print("WARMUP_STEPS: ", sft_args.warmup_steps)


NUM_TRAIN_EPOCHS:  2
LEARNING_RATE:  5e-05
WARMUP_STEPS:  650


#### Now let's take a look at what our fine-tuned model does! 

In [14]:
base_model_id = "Qwen/Qwen3-1.7B"
adapter_path  = "/home/song0409/Desktop/CAI/outputs_(2,5e-5,650)/adapter" 
# adapter_path  = "/home/song0409/Desktop/CAI/outputs_(2,7e-4,600)/adapter"


Load the base model.

In [15]:
tokenizer = AutoTokenizer.from_pretrained(base_model_id, use_fast=True)
model_base = AutoModelForCausalLM.from_pretrained(base_model_id, dtype="auto", device_map="auto")
model_base_for_adapter = AutoModelForCausalLM.from_pretrained(base_model_id, dtype="auto", device_map="auto")

Loading checkpoint shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 2/2 [00:00<00:00,  5.90it/s]
Loading checkpoint shards: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 2/2 [00:00<00:00,  5.83it/s]


Now, load the finetuned model from outputs/adapter.

In [16]:
model = PeftModel.from_pretrained(model_base_for_adapter, adapter_path)

Now, let's generate outputs for a dialog using the base model and the finetuned model.

In [28]:
NUM_EXAMPLES = 10

def test_models(test_ds, NUM_EXAMPLES):

    for i in range(NUM_EXAMPLES):

        dialogue = test_ds[i]['messages'][:-1]
        response = test_ds[i]['messages'][-1]

        text = tokenizer.apply_chat_template(dialogue, tokenize=False, add_generation_prompt=True, enable_thinking=False)
        model_inputs = tokenizer([text], return_tensors="pt").to(model_base.device)

        print(f"\nDialogue {i+1}:")
        for turn in dialogue:
            print(f"    {turn['role']}: {turn['content']}")
        
        print("\nModel Response:")

        #Base Model output
        generated_ids = model_base.generate(**model_inputs, max_new_tokens=500)
        output_ids = generated_ids[0][model_inputs.input_ids.shape[1]:]
        print("\nBase Model: \n    ", tokenizer.decode(output_ids, skip_special_tokens=True).strip())

        #Finetuned Model output
        generated_ids = model.generate(**model_inputs, max_new_tokens=500)
        output_ids = generated_ids[0][model_inputs.input_ids.shape[1]:]
        print("\nFinetuned Model: \n    ", tokenizer.decode(output_ids, skip_special_tokens=True).strip())

        #Ground-truth
        print("\nGround Truth")
        if response['content'] != '':
            print(f"     {response['content']}")
        else:
            print("      No ground truth")
        
        print("-" * 120)
        print("\n")

    return 

test_models(test_ds, NUM_EXAMPLES)


Dialogue 1:
    system: You are an assistant.
    user: I need to find out information about a particular hotel called warkworth house

Model Response:

Base Model: 
     Warkworth House is a historic hotel located in Warkworth, North Yorkshire, England. It is a Grade II listed building, recognized for its architectural significance and historical importance. Here's a brief overview of the hotel:

### **Location**
- **Address:** Warkworth House, Warkworth, North Yorkshire, DN22 3EP
- **Coordinates:** Approximately 53.8887Â° N, 1.7165Â° W

### **History**
- **Built in:** 1850s (exact date is uncertain, but it is believed to be constructed in the early 1850s)
- **Architectural Style:** The building is a fine example of the **Victorian** style, with a mix of **Bye and Dymond** architectural elements.
- **Purpose:** Originally a **hotel**, it was used as a **boarding house** for travelers and visitors to the area. It has been a **tourist attraction** for many years.

### **Features**
- **

## Analyze Model Responses
Now that we have our model up and running, let's look at what types of generations the fine-tuned model produces and compare them with the generations of the vanilla model. For this, we will use samples from the test set. Pick the first 10 samples from the __test dataset__ for the analysis and use those to answer the questions below. 

Note: Please print the outputs of these 10 examples.

#### Questions

Please answer each question with a single paragraph, consisting of 2-5 sentences. Include examples to support your answers.


<span>__Q3:__ What values for the following hyperparameters did you set: number of epochs, learning rate, and warmup steps? Explain briefly why you chose these parameters. </span>


1. Learning rate = 5e-5
2. Num_train_epochs = 2
3. warmup_steps = 600

The learning rate is the most significant parameter in the fine-tuning. Since the model has already been pre-trained on a vast dataset, it does not need to assign a high number of learning rate. Instead, the model needs to be smoothly nudged to perform well on the given tasks. 5e-5 is widely accepted standard for fine-tuning avoiding catastrophic forgetting. The low number of epochs prevents overfitting by limiting the risk of memorizing noise and specific examples, which causes the model to lack adaptability on unseen data. The number of the warmup steps, which determines training stability , was set to 600. This gradually increases the learning rate in the process from 0 up to 5e-5 over the first 600 steps , which is approximately 7% of the global 8152 steps. This leads the model to ensure a smooth transition of the weights to fit the specific task that need to perform. 

<span>__Q4:__ What is the training loss you achieved, what is its trend over the iterations, and what does this mean for your model? </span>

training_loss  : 	0.906593855773853

In the initial stages, it roughtly increased upto 3.4 and decreased over steps. In the middle of the process(4000 ~ 6000), the loss stabilized under 1 and oscillitated for 1000 steps hitting the minimum 0.8 and the maximum 1.1 , and eventually converged to 0.9. Overall, it has been fine-tuned in balance without overfitting or underfitting. 




<span>__Q5:__ Compare the off-the-shelf to the fine-tuned model: What do you notice in terms of quality of the generation? Are the generations responding to the query?</span> 


Considering the comparison of the generated outputs from dialogues, the quality of the generations are vary depending on the length of the dialogus and contexts. However, the generall tendencies are found and we summarized them with the following three metrics. 

**metric1: task completion**
base model often fails to book a hotel or miss the context asking redundant questions while the finetuned model successes to book and provides the reference number(Dialogue2).

**metric2: Helpfulness**
base model often stats "I don't have that information"(Dialogue 5, 9), while the finetuned model answers confidently by providing domin-specific questions.

**metric3: Dialogue state**
base model gets confused by long context(Dialogue 6) or topic changes(Dialogue7), while the fineturned model tracks the flow of conversation smootly answering in a suitable manner. 


<span>__Q6:__ Repeat the experiment with a different hyperparameter value for number of epochs, learning rate, or warmup steps. Fine-tune the model using these new settings, and describe your observations regarding the training loss, as well as the quality of the modelâ€™s answers to the questions from the test dataset.</span> 


1st attempt	Metric	Value
NUM_TRAIN_EPOCHS = 2	global_step	8152
LEARNING_RATE = 5e-5	training_loss	1.16488689970923
WARMUP_STEPS = 490	train_runtime	6883.323
    train_samples_per_second	9.473
    train_steps_per_second	1.184
    total_flos	1.4783868001286554e+17
    train_loss	1.16488689970923
    entropy	0.97457680106163
    num_tokens	13116786
    mean_token_accuracy	0.745005001624425
    epoch	2
        
**2nd attempt	Metric	Value
NUM_TRAIN_EPOCHS = 2	global_step	8152
LEARNING_RATE = 5e-5	training_loss	0.906593855773853
WARMUP_STEPS = 600	train_runtime	6779.2459
    train_samples_per_second	9.619
    train_steps_per_second	1.202
    total_flos	1.4783868001286554e+17
    train_loss	0.906593855773853
    entropy	0.796342800060908
    num_tokens	13116786
    mean_token_accuracy	0.791683614253998
    epoch	2**
        
3rd attempt	Metric	Value
NUM_TRAIN_EPOCHS = 2	global_step	8152
LEARNING_RATE = 5e-5	training_loss	1.18683568222249
WARMUP_STEPS = 815	train_runtime	6828.8728
    train_samples_per_second	9.549
    train_steps_per_second	1.194
    total_flos	1.4783868001286554e+17
    train_loss	1.18683568222249
    entropy	0.97774146993955
    num_tokens	13116786
    mean_token_accuracy	0.737687706947327
    epoch	2
        
4th attempt	Metric	Value
NUM_TRAIN_EPOCHS = 2	global_step	8152
LEARNING_RATE = 1e-4	training_loss	1.06328832440744
WARMUP_STEPS = 490	train_runtime	6831.3005
    train_samples_per_second	9.545
    train_steps_per_second	1.193
    total_flos	1.4783868001286554e+17
    train_loss	1.06328832440744
    entropy	0.83813997109731
    num_tokens	13116786
    mean_token_accuracy	0.777122060457865
    epoch	2
        
5th attempt	Metric	Value
NUM_TRAIN_EPOCHS = 2	global_step	8152
LEARNING_RATE = 1e-4	training_loss	1.07322859430687
WARMUP_STEPS = 650	train_runtime	6874.0591
    train_samples_per_second	9.486
    train_steps_per_second	1.186
    total_flos	1.4783868001286554e+17
    train_loss	1.07322859430687
    entropy	0.822829554478327
    num_tokens	13116786
    mean_token_accuracy	0.777821401755015
    epoch	2
        
6th attempt	Metric	Value
NUM_TRAIN_EPOCHS = 2	global_step	8152
LEARNING_RATE = 1e-4	training_loss	1.08287470592542
WARMUP_STEPS = 815	train_runtime	6867.4809
    train_samples_per_second	9.495
    train_steps_per_second	1.187
    total_flos	1.4783868001286554e+17
    train_loss	1.08287470592542
    entropy	0.830711364746094
    num_tokens	13116786
    mean_token_accuracy	0.781337857246399
    epoch	2
        
7th attempt	Metric	Value
NUM_TRAIN_EPOCHS = 2	global_step	8152
LEARNING_RATE = 7e-5	training_loss	1.12875533206306
WARMUP_STEPS = 650	train_runtime	6837.0424
    train_samples_per_second	9.537
    train_steps_per_second	1.192
    total_flos	1.4783868001286554e+17
    train_loss	1.12875533206306
    entropy	0.912445922692617
    num_tokens	13116786
    mean_token_accuracy	0.754364132881165
    epoch	2


<span>__Q7:__ How does the fine-tuned model perform in the following scenarios? Compare this to your answer about the vanilla model in assignment 1.</span> 
1. When all reviews agree with each other
2. When only one review disagrees
3. When opinions in the reviews are mixed (i.e. high disagreement)

When all reviews agree with each other, the fine-turned model performs excellent providing specific details aligning with the ground truth(Dialogue5, Dialogue8 and Dialogue9), while the vanilla model in previous assignment strugle and failed 2 out of 3 times(Dialogue5,8 and 9). When only one review disagrees, there is no clear example of this in the first 10 dialogues. When opinions in the reviews are mixed(Dialogue10), the fine-tuned model performed better by weighting the majority's opinion, while the vanilla model confidentely provided the incorrect answer based on a review ignoring all other contradicting reviews(1 vs 3).

<span>__Q8:__ How does the length of the dialogue/conversation history affect the fine-tuned model's generation? Compare this to your answer about the vanilla model in assignment 1.</span> 

The length of the dialogue/conversation history can be defined as the number of tokens in the input parameters for generating response. According to the advanced architecture of Qwen3, it has large context window up to 1M tokens, which means it captures the long context or implicit preferences of the user in a given dialogue. During the fine-turning process based on the data from the domain "hotel", this model has been trained to pay attention to the specific task that the users interested in, while the vanila model generated the output only based on general pre-training data. This difference explain why the fine-turened model is able to adapt the user's intent more precisly in a long dialogue and complete the booking taks successfully.

<span>__Q9:__ Create 5 samples that come from a different domain (e.g., airports, shops, web-stores; you can use an LLM or search the web for this). How does the fine-tuned LLM perform on these, compared to the dialogues in the DSTC-11 task5 domains?</span> 

This question seems to be a key point of the assignment.

The fine-tuning process has created a highly specialized, domain-speicific agent. This is a double-edged sword: while the model's performance on the in-domain(DSTC-11) task is excellent, this specilization has caused a catastrophic drop in performance on the out-of-domain(OOD) airport task.

The model is no longer a general-purpose assistant; it is a task-oriented bot that has developed strong, rigid biases. It attempts to disregard the logic from the travel booking domain, resulting in hallucinations, confusion, and a complete failure to assist the user.

This demonstrated across three key metrics:

metric1: task completion
the finetuned model continuously fails to give a correct response to the user question in the dialogues from the airport domain. The model gives for example false data like there are no restaurants in SFO(Dialogue2). when given dialogues in the DSTC-11 task5 domains the model performs tasks very well and succeeds in for example booking and providing the reference number(Dialogue2) when asked.

metric2: Helpfulness
the finetuned model is very unhelpful when faced with the airport domain it gives for example false and confusing information. in dialogue 4 for example when asked for lounges in terminal 4 at JFK. It invents a lounge called "the Nusha" which does not exist and the asked if it can give the address and phone number which is irrelevant and would be confusing for user. while when given the  DSTC-11 task5 domains the finetuned model answers confidently by providing domain-specific questions.

metric3: Dialogue state
the finetuned model is unable to track context and intent of the dialogue. It seems that the model keeps applying logic from the DSTC-11 task5 domains. for example in dialogue 3 when asked for a terminal recommendation for food. the model answers with places to eat near LAX or if the user wants to stay in a specific area of the town. the first part shows the model fails to understand the context and the second part shows failure in understanding the intent of the user. while when given the DSTC-11 task5 domains the fine turned model tracks the flow of conversation smoothly answering in a suitable manner.



In [29]:
#preprocessing
log_dir_airport = "/home/song0409/Desktop/CAI/data_5samples/logs_airports.json"
labels_dir_airport = "/home/song0409/Desktop/CAI/data_5samples/labels_airports.json"
with open(log_dir_airport,'r') as f:
    ds_airport = json.load(f)
with open(labels_dir_airport,'r') as f:
    labels = json.load(f)

reformatted_dataset_airport = reformat_dataset(ds_airport, labels)
test_ds_airport = Dataset.from_dict(reformatted_dataset_airport)

#test
test_models(test_ds_airport, 5)



Dialogue 1:
    system: You are an assistant.
    user: What's SFO like? I'm flying there with my family.
    system: SFO is noted for being very modern, especially Harvey Milk Terminal 1. Guests also mention the terminals are clean and there are play areas for kids.
    user: That's great to hear. Are the play areas easy to find?
    system: Yes, they are located in several terminals post-security. You can find them in Terminal 1, Terminal 2, and the International Terminal G-gates.
    user: Perfect. What about food options in Terminal 1, any recommendations?

Model Response:

Base Model: 
     That's a great question! Terminal 1 at SFO is a popular spot for food, and there are plenty of options to choose from. Here are some recommendations:

### **Popular Food Options in Terminal 1:**

1. **CafÃ©s and Restaurants:**
   - **The SFO CafÃ© (Harvey Milk Terminal 1):** A cozy spot with a great selection of coffee and pastries.
   - **The SFO CafÃ© (Terminal 2):** Also a good option, with