## Conversational AI (25/26) Assignment 1: LLM inference in task-oriented dialogues

LLMs are being widely deployed as dialogue systems, e.g. QA systems. 
In this assignment, we will be loading a large language model (Qwen3) and using it for task-oriented conversational modeling with subjective knowledge. 

The DSTC11 Task 5 and its dataset are explained in detail [here](https://github.com/alexa/dstc11-track5). Also, you can read more about it in the [paper](https://arxiv.org/pdf/2305.12091).

In essence, the task is to respond to subjective user requests regarding hotels and restaurants, e.g., "does the restaurant have a nice vibe?". Our conversational model has to generate a response to this question based on conversation history and previous reviews it has access to. 

This task consists of three sub-tasks: 
1. Knowledge-seeking Turn Detection
2. Knowledge Selection
3. **Knowledge-grounded Response Generation** (our focus)

We will be focusing on the last sub-task, _knowledge-grounded response generation_, for which we need the dialog history, request, and knowledge. With these as the input, the model should then generate a response.

**This notebook contains the following parts:**
1. Exploring the DSTC11 Task5 data and formatting it in the proper structure
2. Off-the-shelf inference with Qwen3-1.7B
3. Analyzing the model behavior

## Assignment

We will be using the dialogue history between a user and a system to have Qwen generate a response to a knowledge-seeking request from the user. 
The goal is to load an LLM and understand the strengths and limitations of its off-the-shelf usage for task-oriented dialogues.

Throughout the notebook, you will find questions you need to answer to complete the assignment. These are both coding questions and questions that evaluate your understanding of the data, the process, and the model behavior. These questions are indicated as  **<span style="background:yellow">Q#</span>**.

**Assignment steps:**
1. Load the dataset and understand its structure, individual components, what the inputs should be and what the output should be. **(Q1)**
2. Convert the dataset to a structure that is useful for our LLM to generate its responses with. **(Q2 - Q4)**
4. Answer questions about the assignment, based on your implementation and manual analysis of 10 samples. **(Q5 - Q9)**
    

**Submission:**
Please submit your code (as a Kaggle notebook) on Canvas by **3rd November 23:59**.

**Grading:** The assignment is graded with a pass/fail grade.

## A couple of notes about Kaggle notebooks
* [Terms and conditions](https://www.kaggle.com/terms)
* Short note on the [GPU limits](https://www.kaggle.com/discussions/general/108481)
* [Uploading your own model to Kaggle](https://www.kaggle.com/discussions/questions-and-answers/63328)

**Kaggle directories** may be confusing because they are present on another machine that we have no access to (effectively "in the cloud"). 
To help your orientation, keep this in mind:
* Input data files are available in the read-only "/kaggle/input/" directory
* You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
* You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Kaggle seems to run smoothly, including in terms of GPUs. There are two common environment issues that happen because we forget to turn on the internet or the GPUs. Specifically:
* `ERROR: Could not find a version that satisfies the requirement ... (from versions: none) ERROR: No matching distribution found for ...` - to solve this issue, turn your notebook Internet ON.
Make sure your accelerator is set to `GPU P100` or `GPU T4x2`

**<span style="color:red">Important</span>**

These are five good practices when working with Kaggle notebooks:
1. Pay attention to usage statistics, especially memory, CPU, and GPU
2. Pay attention to the quota of GPU (measured in hours)
3. "Turn OFF" the internet after each session. Turn on the internet when starting a session.
4. "Turn off" the accelerator after each use. Turn the accelerator on when starting a session.
5. Save a version after you make changes. This ensures that your teammate can see the latest changes. If you get a question from Kaggle about versions, you can revert to the latest version.

## Installation

The Kaggle notebooks use a Python 3 environment, and they are already "pre-loaded" with various analytic Python packages, like Json, Pandas, and Numpy. If you are curious, you can see the package definition in [this repository](https://github.com/kaggle/docker-python).

We will install the Transformers package for working with language models.

In [1]:
!pip install transformers==4.53.3

Collecting huggingface-hub<1.0,>=0.30.0 (from transformers==4.53.3)
  Downloading huggingface_hub-0.36.0-py3-none-any.whl.metadata (14 kB)
Downloading huggingface_hub-0.36.0-py3-none-any.whl (566 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m566.1/566.1 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hInstalling collected packages: huggingface-hub
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface-hub 1.0.0rc2
    Uninstalling huggingface-hub-1.0.0rc2:
      Successfully uninstalled huggingface-hub-1.0.0rc2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 4.1.1 requires pyarrow>=21.0.0, but you have pyarrow 19.0.1 which is incompatible.
gradio 5.38.1 requires pydantic<2.12,>=2.0, but you have pydantic 2.12.0a1 which is incompatible.[0m[31m
[0mSuccessfully installed huggingface-hu

## Imports

The following code loads several standard packages and packages for working with LLMs.

In [2]:
import numpy as np 
import json
import os
import shutil
import subprocess
import sys
from typing import List
from datasets import Dataset


from transformers import AutoModelForCausalLM, AutoTokenizer, AutoModelForSequenceClassification, pipeline

2025-11-02 11:38:14.946832: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1762083495.141324      37 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1762083495.195491      37 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## Clone the data/task repository

We start by cloning the data and task repository and changing the working directory to that directory.

In [6]:
def setup_repo(repo_url: str, repo_name: str, work_dir: str = "/kaggle/working"):
    os.chdir(work_dir)
    
    # Remove repo if it exists
    if os.path.exists(os.path.join(work_dir, repo_name)):
        shutil.rmtree(os.path.join(work_dir, repo_name))
    
    # Clone repo
    subprocess.run(["git", "clone", repo_url], check=True)
    
    # Move into repo/data
    os.chdir(os.path.join(repo_name, "data"))


setup_repo("https://github.com/lkra/dstc11-track5.git", "dstc11-track5")

Cloning into 'dstc11-track5'...


## Exploring the data
The data we will use contains these essential components: 
- __knowledge__: The reviews for the corresponding restaurant or hotel (the data also contains FAQs - we won't be using those!)
- __dialogue__: The dialogue history between the user and the system
- __response__: The ground-truth response we expect from the system

Let's list all files in the current directory iteratively and then load the data first to understand it better!

In [7]:
## List all files in the current directory iteratively:
for dirname, _, filenames in os.walk('.'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

./knowledge_aug_domain_reviews.json
./README.md
./knowledge_aug_reviews.json
./output_schema.json
./knowledge.json
./test/labels.json
./test/logs.json
./train/labels.json
./train/logs.json
./train/logs_bkp.json
./train/bkp/labels.json
./train/bkp/logs.json
./val/labels.json
./val/logs.json


Loading the data: For this assignment, we will only use the __training data__. So let's load it.

In [8]:
with open('train/logs.json', 'r') as f:
    train_ds=json.load(f)

Let's look at the first example:

In [9]:
sample = train_ds[0]
sample

[{'speaker': 'U',
  'text': 'Can you help me find a place to stay that is moderately priced and includes free wifi?'},
 {'speaker': 'S', 'text': 'sure, i have 17 options for you'},
 {'speaker': 'U',
  'text': "Are any of them in the south? I'd like free parking too."},
 {'speaker': 'S',
  'text': 'Yes, two are in the south and both have free parking and internet. I recommend the Bridge Guesthouse. Would you like me to book a reservation?'},
 {'speaker': 'U',
  'text': 'I have back issues. Does this place have comfortable beds?'}]

## Conversation History
The first index is the dialogue, clearly marking what the user and system have said until now. The last dictionary in this list is the last utterance from the dialogue, indicating the query that the model should respond to. 

<span style="background:yellow">__Q1:__ In the block below, write the code to first access the dialogue from _sample_, and then the query:</span>

In [10]:
dialogue = sample[:-1]
query = sample[-1]

dialogue, query

([{'speaker': 'U',
   'text': 'Can you help me find a place to stay that is moderately priced and includes free wifi?'},
  {'speaker': 'S', 'text': 'sure, i have 17 options for you'},
  {'speaker': 'U',
   'text': "Are any of them in the south? I'd like free parking too."},
  {'speaker': 'S',
   'text': 'Yes, two are in the south and both have free parking and internet. I recommend the Bridge Guesthouse. Would you like me to book a reservation?'}],
 {'speaker': 'U',
  'text': 'I have back issues. Does this place have comfortable beds?'})

<span style="background:yellow">__Q2:__ To make it more readable for the model (and also ourselves), let's reformat the conversation history to the following:</span>

>[
>{'role': 'user', 'content': 'Can you help me find a place to stay that is moderately priced and includes free wifi?'},
>{'role': 'system', 'content': 'sure, i have 17 options for you'}
>...]

Formats a list of dialogue by turning it into a readable list.

Given a list of dictionaries, each representing a dialogue turn, this function formats the dialogue by prefixing
each turn with either "user" or "system" as the role based on the speaker. The speaker is identified by the 'speaker' key in the dictionary: 'U' for user, else it is the system.

In [11]:
def format_dialogue(sample: List[dict]) -> List[dict]: 
    """
    Args:
    sample (List[dict]): A list of dictionaries where each dictionary contains two keys:
        - 'speaker' (str): A string indicating the speaker of the turn ('U' for user, 'S' for system).
        - 'text' (str): The text spoken by the respective speaker.

    Returns:
        List[dict]: A new array with a specific role and content

    """
    # Your solution here
    messages=[]
    messages.append({"role": "system", "content": "You are an assistant."})
    for dialogue_element in sample:
        role = dialogue_element['speaker']
        role = 'user' if role == 'U' else 'system'
        messages.append({"role": role, "content": dialogue_element['text']})

    return messages
    
    

print(format_dialogue(dialogue))

[{'role': 'system', 'content': 'You are an assistant.'}, {'role': 'user', 'content': 'Can you help me find a place to stay that is moderately priced and includes free wifi?'}, {'role': 'system', 'content': 'sure, i have 17 options for you'}, {'role': 'user', 'content': "Are any of them in the south? I'd like free parking too."}, {'role': 'system', 'content': 'Yes, two are in the south and both have free parking and internet. I recommend the Bridge Guesthouse. Would you like me to book a reservation?'}]


### Knowledge

Let's look at the knowledge, which contains the reviews for the conversation:

In [12]:
## first load the labels
with open('train/labels.json', 'r') as f:
    labels=json.load(f)

From the labels, load the necessary knowledge, and print it. 

In [13]:
knowledge_sample_list = labels[0]["knowledge"]
print(knowledge_sample_list)

[{'domain': 'hotel', 'entity_id': 11, 'doc_type': 'review', 'doc_id': 3, 'sent_id': 1}, {'domain': 'hotel', 'entity_id': 11, 'doc_type': 'review', 'doc_id': 2, 'sent_id': 6}, {'domain': 'hotel', 'entity_id': 11, 'doc_type': 'review', 'doc_id': 4, 'sent_id': 3}, {'domain': 'hotel', 'entity_id': 11, 'doc_type': 'review', 'doc_id': 2, 'sent_id': 5}, {'domain': 'hotel', 'entity_id': 11, 'doc_type': 'review', 'doc_id': 3, 'sent_id': 5}]


As you can see, here we just have some entries provided. We need to extract the knowledge from the knowledge base using these entries.

So, let's now load the knowledge base.

In [14]:
with open('knowledge.json', 'r') as f:
    knowledge_base=json.load(f)

This is again a list of different reviews. If we were to give this knowledge directly to the LLM, we are giving it extra information that is not necessary for generating the response. We only need the review _itself_. 

<span style="background:yellow">__Q3:__ In the block below, write a function that takes as input the knowledge of one sample and outputs the reviews in a list:</span>

This function extracts and returns a list of review sentences from the given knowledge data.

Given a list of dictionaries representing knowledge data, this function collects review text from each dictionary.

In [15]:
def get_reviews(knowledge: List[dict]) -> List[str]: 
    """
    Args:
        knowledge (List[dict]): A list of dictionaries containing review information.

    Returns:
        List[str]: A list of strings where each string is a review extracted from the knowledge data.

    """
    # Your solution here
    sources = []
    for k in knowledge:
        try:
            domain = k["domain"]
            entity_id = str(k["entity_id"])
            doc_type = k["doc_type"]
            doc_id = str(k["doc_id"])
            sent_id = str(k["sent_id"])
            
            # Extract the review sentence from the knowledge base
            sentence = knowledge_base[domain][entity_id][f"{doc_type}s"][doc_id]['sentences'][sent_id]
            sources.append(sentence)
        except:
            continue

    return sources
    

print(get_reviews(knowledge_sample_list))

['The room was clean and comfortable and not expensive.', 'It could ruin your stay if you mind that kind of thing.', "Sadly though, I found that the bed in the room wasn't very comfortable at all.", 'I do have to say, though, the bed is extremely uncomfortable.', 'and the interior of the room was very good and bed was also very much comfortable.']


### Response (Ground-truth)
Similarly, we can load the ground-truth response. 

In [16]:
response = labels[0]['response']
response

'The Bridge Guest House is known for having pretty uncomfortable beds according to most guests. Only one guest found it to be comfortable.'

## Create Dataset
Until now we have looked at and applied our functions to only one sample.

<span style="background:yellow">__Q4:__ In the block below, write a function that takes the content and labels as input and creates a dataset.</span>

Let us now correctly format our entire dataset! Essentially, we want our nicely formatted dialogue, knowledge, and the ground-truth response. We can do this as follows: 

__Important__: In this step, you should use the previously defined functions and the loaded data. Most of your work here involves calling those functions or using the provided data. Avoid starting from scratch.

In [17]:
def reformat_dataset(dataset, labels_dataset): 
    reformatted_dataset = {
        "dialogue": [],
        "knowledge": [],
        "response": [],
    }
    # Your solution here
    for sample_index in range(len(dataset)): 
        try:
            sample_dialogue = format_dialogue(dataset[sample_index])
            sample_knowledge = labels_dataset[sample_index].get("knowledge", [])
            sample_response = labels_dataset[sample_index].get("response", "")
            
            reformatted_dataset["dialogue"].append(sample_dialogue)
            reformatted_dataset["knowledge"].append(sample_knowledge)
            reformatted_dataset["response"].append(sample_response)
        except:
            continue
        
    return reformatted_dataset

reformatted_dataset = reformat_dataset(train_ds, labels)
dataset = Dataset.from_dict(reformatted_dataset)
dataset

Dataset({
    features: ['dialogue', 'knowledge', 'response'],
    num_rows: 32604
})

Now if we access the first sample, it is much more readable and we can access the different components directly!

In [18]:
dataset[0]

{'dialogue': [{'content': 'You are an assistant.', 'role': 'system'},
  {'content': 'Can you help me find a place to stay that is moderately priced and includes free wifi?',
   'role': 'user'},
  {'content': 'sure, i have 17 options for you', 'role': 'system'},
  {'content': "Are any of them in the south? I'd like free parking too.",
   'role': 'user'},
  {'content': 'Yes, two are in the south and both have free parking and internet. I recommend the Bridge Guesthouse. Would you like me to book a reservation?',
   'role': 'system'},
  {'content': 'I have back issues. Does this place have comfortable beds?',
   'role': 'user'}],
 'knowledge': [{'doc_id': 3,
   'doc_type': 'review',
   'domain': 'hotel',
   'entity_id': 11,
   'sent_id': 1},
  {'doc_id': 2,
   'doc_type': 'review',
   'domain': 'hotel',
   'entity_id': 11,
   'sent_id': 6},
  {'doc_id': 4,
   'doc_type': 'review',
   'domain': 'hotel',
   'entity_id': 11,
   'sent_id': 3},
  {'doc_id': 2,
   'doc_type': 'review',
   'doma

## Off-the-shelf usage

To see what our LLM does without any adaptation, let's run inference directly on a couple of samples!
We first need to load our model. 

In [19]:
model_name = "Qwen/Qwen3-1.7B"
qwen3_tokenizer = AutoTokenizer.from_pretrained(model_name)
qwen3_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/726 [00:00<?, ?B/s]



model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/622M [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Now, let’s generate the responses and compare them to the ground-truth answers.

In [20]:
messages = dataset[0]['dialogue']

text = qwen3_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
model_inputs = qwen3_tokenizer([text], return_tensors="pt").to(qwen3_model.device)

generated_ids = qwen3_model.generate(**model_inputs, max_new_tokens=500)
output_ids = generated_ids[0][model_inputs.input_ids.shape[1]:]

print("MODEL: ", qwen3_tokenizer.decode(output_ids, skip_special_tokens=True).strip())
print("Ground-truth: ", dataset[0]["response"])

MODEL:  Yes, the Bridge Guesthouse has comfortable beds. They are designed to provide a restful and supportive sleeping experience, which is especially important for people with back issues. If you'd like, I can check if there are any specific amenities or features that would be helpful for your needs. Would you like me to assist with that?
Ground-truth:  The Bridge Guest House is known for having pretty uncomfortable beds according to most guests. Only one guest found it to be comfortable.


In [22]:
messages = dataset[1]['dialogue']

text = qwen3_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
model_inputs = qwen3_tokenizer([text], return_tensors="pt").to(qwen3_model.device)

generated_ids = qwen3_model.generate(**model_inputs, max_new_tokens=500)
output_ids = generated_ids[0][model_inputs.input_ids.shape[1]:]

print("MODEL: ", qwen3_tokenizer.decode(output_ids, skip_special_tokens=True).strip())
print("Ground-truth: ", dataset[1]["response"])

MODEL:  Yes, the Curry Garden has a lovely outdoor dining area with a beautiful garden. It's a great spot for a relaxed meal, especially if you're looking to try something new. Would you like me to check the reservation details for you?
Ground-truth:  The Curry Garden has a patio outside that guests say they enjoy. Do you want to make a reservation now?


In [21]:
messages = dataset[20]['dialogue']

text = qwen3_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
model_inputs = qwen3_tokenizer([text], return_tensors="pt").to(qwen3_model.device)

generated_ids = qwen3_model.generate(**model_inputs, max_new_tokens=500)
output_ids = generated_ids[0][model_inputs.input_ids.shape[1]:]

print("MODEL: ", qwen3_tokenizer.decode(output_ids, skip_special_tokens=True).strip())
print("Ground-truth: ", dataset[20]["response"])

MODEL:  The entrance fee for the Broughton House Gallery is £6. Would you like me to proceed with the booking?
Ground-truth:  


In [24]:
# Assuming 'dataset' contains the reformatted dataset and 'model_outputs' contains the model's responses
for i in range(10):
    print(f"Dialogue {i + 1}:")
    print("Dialogue:")
    for turn in dataset[i]["dialogue"]:
        print(f"  {turn['role']}: {turn['content']}")
    
    print("\nKnowledge:")
    if labels[i]["target"]!=False:
        knowledges = get_reviews(labels[i]["knowledge"])
        for id,knowledge in enumerate(knowledges):
            print(f" review {id}. {knowledge}")
    else:
        print("  No reviews")
    
    print("\nGround Truth Response:")
    if dataset[i]["response"] !='':
        print(f"  {dataset[i]['response']}")
    else:
        print("  No ground truth")
    
    print("-" * 80)

Dialogue 1:
Dialogue:
  system: You are an assistant.
  user: Can you help me find a place to stay that is moderately priced and includes free wifi?
  system: sure, i have 17 options for you
  user: Are any of them in the south? I'd like free parking too.
  system: Yes, two are in the south and both have free parking and internet. I recommend the Bridge Guesthouse. Would you like me to book a reservation?
  user: I have back issues. Does this place have comfortable beds?

Knowledge:
 review 0. The room was clean and comfortable and not expensive.
 review 1. It could ruin your stay if you mind that kind of thing.
 review 2. Sadly though, I found that the bed in the room wasn't very comfortable at all.
 review 3. I do have to say, though, the bed is extremely uncomfortable.
 review 4. and the interior of the room was very good and bed was also very much comfortable.

Ground Truth Response:
  The Bridge Guest House is known for having pretty uncomfortable beds according to most guests. On

## Analyze Model Responses
Let's look at what types of generations the LLM produces. For this, pick the first **10** samples for the analysis. 
Then, based on your observations, answer the following 5 questions (remember to give examples to illustrate your findings).


#### Questions

Please answer each question with a single paragraph, consisting of 2-5 sentences. Include examples to support your answers.

<span style="background:yellow">__Q5:__ Describe the dataset in detail, focusing on its structure, domains, and size. </span> 

The dataset is consist of structured dialogue samples comfiled in json format where each includes 'dialogue', 'knowledge' and 'response'. The 'dialogue' represents a list of dictionaries with conversational turns between a 'user' and 'system; each turn is identifiued by the key, 'role' and 'content'. The field 'knowledge' contains external reference data such as 'domain', 'doc_id', 'doc_type', 'entity_id', and 'sent_id', serving as the basis for generating responses. The field 'response'  is the ground-truth reply which the system is expected to produce based on the 'dialogue' and 'knowledge' (generating a simlar contextual response). The dataset contains two types of domains in the hospitality secotors, 'hotel' and 'restaurant, and a planty of various and complex samples corresponding reviews on the domains. The size of the dataset is 32,604 elements on both domains, which means that there are sufficient data for analysis.

<span style="background:yellow">__Q6:__ What data pre-processing steps did we take and why?</span> 

The dataset was pre-processed to ensure its usability as input for the model. The dialogues were reformatted to align with the 'format_dialogue' function, this served to standardize the roles of 'user' and 'system' content structure in a robust manner. Then, the 'knowledge' and 'response' were extracted and linked to the specific dialogues, this created a complete sample. Additionally, the errors regarding invalid data or null data are filtered out during the processing through try .. except block to improve the reliability of the generated output.

<span style="background:yellow">__Q7:__ What do you observe about the quality of the generated responses? Do they effectively address the query? 
Explain what data components were considered by the LLM to generate the response. What role do reviews, as a source of knowledge, play in generating the responses?</span> 

The quality of the generated responses are various some extends. Some of them do effectively address the query and fullfill the user's inquairy considering given dialogue. As an example, the 'dataset[0]', the model confidently states that the Bridge Guesthouse has comfortable beds designed to provide a resful night's sleep; directly contradicting the ground_truth where it was mentioned that the beds were pretty uncomfortable according to the review from most guests, and only a single guest found the beds comfortable. This discrepancy suggests that the model could have been outweight engaging and pleasing output rather than representing the real feedback from the knowledge given from reviewers, which means that the model positively biased during the training for some reasons. Moreover, the model could provide far-fetched details on the response which were not supported by the knowledge, in the 'dataset[1]' for instance, the responses contained extra details about the Curry Garden's outdoor dining area 'beautiful garden' in a 'relaxed and scenic setting'. 

<span style="background:yellow">__Q8:__ How does the LLM perform in the following scenario?</span> 
1. When all reviews agree with each other
2. When only one review disagrees
3. When opinions in the reviews are mixed (i.e., high disagreement)

The LLM produces confident and accurate responses that well suppored by the knowledge when all reviews are in agreement(A1). For example, if all reviews of a hotel are praising the customer service, the model will generate a response in a high confidence. However, when faced with disagreement(A2,A3), the model behavior is mixed: it occasionally overlooks a single disagreeing review, focusing instead on the majority perspective, which suggests it does not always capture a balanced view of all aspectes (e.g., dataset[20]), More critically, when opinions are truly mixed, the model often defaults to generating positive and pleasing responses even if some positive and negative feedbacks on a particular perspective coexist in a hotel facility, no matter how many opinions are categorized on each side(dataset[0]).


<span style="background:yellow">__Q9:__ Compare the behavior of the model for dialogues of different length. Do you observe any impact of the length of the dialogue/conversation history on the LLM's generation?</span> 


We interpret "length" as the number of turns in each dialogue, as this represents the frequency of interaction between the system the user. This is particularly significant in the architecture of Mixture-of-Experts (MoE) models like Qwen3, whereas metrics like the number of tokens (common in BoW models) are more typical for classical NLP.

The number of turns indeed impacts the model's ability to generate coherent and relevant reponses. In brief dialogues, the model is able to perform reasonably well, generating concise and contextually accurate replies. For example, in 'dataset[1]' the model effectively responds to the query about Curry Garden's outdoor dining area. Although, in long dialogues, such as 'dataset[20] the model had trouble to maintain coherence and relevance; it generated a response which included some extra details that were not supported by the knowledge as well as the ground_truth. This phenomenon suggests that while the model is able to extract information from extended dialogues in some degree, however, the relevance and coherence may decrease as the number of turns increases suggesting the model loses important contextual elements for analysis. This limitations suggests a key area for future research should be improving the model's ability to maintain context and coherence in complex, multi-turn dialogues.