## Part 0: Dataset Analysis

### Motivation, Contributions and Methodology

Existing datasets in the open-domain QA, up until the release of PragmaticCQA, largely focused on evaluating QA systems' accuracy regarding the literal answers to given questions. They did not examine or evaluate a system's ability to infer the questioner's unmentioned needs from context; whether they be in the form of follow-up questions, or relevant information that the questioner is not even necessarily aware of, due to the lack of knowledge in the topic being questioned. This ability to grasp intent is key to efficient and productive conversations, and the fact that it so far went largely ignored by common metrics and evaluation datasets is what motivated the paper's authors to create this dataset, along with the corresponding metrics, to allow for a meaningful examination of this ability in NLP models. 

Specifically, what they've produced is:

* A crowd-sourcing framework that achieves "incentive alignment". The authors claim that many of the recent datasets that are crowd-sourced suffer from this "incentive misalignment", which essentially means that annotators are rewarded for producing as many examples as they can, and so they create 'basic' examples that they can churn out quickly. These examples tend to lack nuance, or are often similar, and thus allow the model to learn surface-level patterns in order to achieve good results. This naturally goes against the intent of the dataset creators, as it does not truly test for a model's reasoning abilities, which is where the supposed "incentive misalignment" stems from.     
The authors claim to have solved this issue by allowing the annotators to work on topics they're interested in, and having actually discuss these topics between themselves, which ends up increasing their engagement with the task and producing examples that feel natural and resemble standard human interaction better than other datasets.

* An open-domain ConvQA dataset that follows the prior framework, and features pragmatic answers and metrics that allow for the evaluation of pragmatic reasoning.

* An analysis of their dataset that shows that it presents a challenge to existing models, proving its relevance in the field.

I will now focus on the third point, which is their analysis of the dataset, and why it proves to be challenging to current NLP nodels.

The split datasets each contains separate topics, meaning there's no overlap between the topics and thus no overlap of information between questions of different topics. This forces the model to actually generalize and rely on its internal reasoning capabilities.

The answers in the dataset tend to be constructed with information from different elements, substantially more so than other, commonly used datasets in the field. This proves to be quite hard for the NLP, as it requires it to collate information from a large number of different sources.

The answers are also often formed of a combination of small factoids, and a larger narrative that ties these factoids together with the answer to the original questions. The model should be able to replicate this, and that is more complex than giving literal answers, such as labelling or providing a direct, literal answer to a question without further consideration.

These aspects tell us the dataset seeks specific pragmatic phenomena:
* The recognition of potential follow-up questions and the inclusion of their answers.
* Being cooperative in the conversation: A model should attempt to keep the conversation flowing with the provided answers, whether it be by including relevant information that allows for further discussion, or other such methods that humans employ (Another one would be trying to return the question, or other follow up questions to the student after providing sufficient information, but I'm not sure if the dataset actually covers this case as well).
* Being selective in providing information: A model shouldn't just provide a list of connected data, but rather consider the question, the context in which it's asked, like the background of the questioner, and providing relevant information based on these elements.

These aspects all serve to complicate the task for NLP models, and thus challenge them.


### Sample Analysis

1) The topic is 'Vampires', with the starting question being "So, what is a vampire, exactly?"

   The literal answer we'd expect from a non-cooperative teacher would be something along the lines of "An undead monster", which doesn't elaborate greatly on what differentiates vampires from the plethora of other undead monsters in various fictions, like zombies, or even ghosts. 

   The given answer in the dataset was as follows: "Vampires are a kind of undead monster that feeds on the life essence of living creatures like humans."
   This answer provides the full information we expect to see from a literal interpretation of the question - "A kind of undead monster", and further specifies that it 'feeds on the life essence of living creatures', prompting the student to equate them to beasts of prey, as they feed on other living beings. This lets the student distinguish between vampires and say, ghosts, that are depicted as malevolent, metaphysical beings that exist to haunt people.

   * I've got to say that I don't actually like this answer since life essence is such a weird term. When has anyone ever seen a depiction of vampires that sustain themselves on something other than blood? just say blood...

2) The topic is 'The Wheel of Time', with the starting question being "who was the writer of the wheel of time?"

    The literal answer we'd expect from a non-cooperative teacher could simply state that the writer was Robert Jordan, as he is both an author and the one who wrote the majority of the books, as the student asked about a singular writer. However, it is known that Brandon Sanderson is the one who wrote the latter books, so we expect a pragmatic answer to include this fact, along with the reason why Brandon Sanderson ended up writing the last few books instead of Robert Jordan (Robert Jordan died).
    

    The given answer in the dataset was as follows: "Robert Jordan is the author but he sadly passed away and his books were finished by Brandon Sanderson."
    This answer is more pragmatic as we can easily see the added information as something necessary - Most people wouldn't think beforehand that a book series was written by more than one person, and they would default to asking about a singular writer or author. This is despite them actually wanting to know about all the potential writers, if there were indeed multiple writers. We expect this basic level of inference in daily conversation, and this answer provides that. The literal answer, however, does not.


3) The topic is 'Cats Musical Wiki', with the starting question being "I am a student and know nothing about cats musical wiki".

    This is an interesting 'question' as it's not phrased as a question, but rather, it is a simple statement when taken literally. If used to initiate a conversation, the other party would recognize this as a request to learn about the topic, or an attempt to make small talk by giving the teacher leeway to introduce tidbits of information of their own choosing to the conversation, thus steering it in their desired direction. 

    Surprisingly enough, both the literal and pragmatic answer spans do not even mention the 'wiki', but rather just talk about the cats musical itself.

    The text provided by the dataset for the literal answer just includes details about the creator, the source material, and a range of dates and locations when and where it was played.

    The answer was as follows: "Cats was one of the longest running plays ever, starting in London and running for 21 years. I was lucky enough to sit in the audience in New York city for a performance once."

    We can see that the teacher actually chose to share his own experience regarding the musical, despite not being prompted for such a thing; they actively chose to steer the conversation to talk about their own experience, and thus proving to be a cooperative conversationalist (And the conversation actually continued down that path, with questions like "Where were you seated" and so on), rather than simply providing basic details and closing off the conversation, like we'd expect from a literal answer by a non-cooperative teacher.

4) The topic is 'Edward Elric', with the starting question being "Who is Edward Elric?".

    So a literal answer to this question would be quite succint, such as "A fictional alchemist in 'The Fullmetal Alchemist'", "The main protagonist of 'The Fullmetal Alchemist' series" and so on. We'd expect a pragmatic answer to both combine these details, and then enrich the answer by giving context; sharing information about the character's traits or background.

    The answer given does indeed fulfill these expectations: "Edward Elric is the main protagonist of the Fullmetal Alchemist series. Edward lost hist right arm and left leg due to a failed Human Transplantation attempt and became the youngest State Alchemist in history at the age of twelve."

    The answer includes further details than we'd expect from a purely literal interpretation, in line with what one would likely want to know when asking about a fictional character - like a background that hints at the character's motivation, and thus also the main plot of the story.

I will stop here since I think this section is wordy enough already.









## Part 1: The "Traditional" NLP Approach

In [20]:
import json
import os
import torch
import dspy
import numpy as np
from transformers import DistilBertForQuestionAnswering, DistilBertTokenizer
from sentence_transformers import SentenceTransformer
from bs4 import BeautifulSoup
from dspy.evaluate import SemanticF1 #no longer necessary
import configparser
from dspy.evaluate.auto_evaluation import (
    SemanticRecallPrecision,
    DecompositionalSemanticRecallPrecision
)
from dspy.predict.chain_of_thought import ChainOfThought
import warnings
import logging
import time
import tqdm

config = configparser.ConfigParser()
config.read('grok_key.ini')
api_key = config['DEFAULT']['XAI_API_KEY']

In [2]:
def get_first_questions(filepath='../PragmatiCQA/data/val.jsonl'):

    questions = []
    
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            conv = json.loads(line)
            first_qa = conv['qas'][0]

            #the spans will only include the text strings, not the keys
            literal_spans = []
            pragmatic_spans = []

            if 'literal_obj' in first_qa['a_meta']:
                for span_obj in first_qa['a_meta']['literal_obj']:
                    literal_spans.append(span_obj['text'])
                    
            if 'pragmatic_obj' in first_qa['a_meta']:
                for span_obj in first_qa['a_meta']['pragmatic_obj']:
                    pragmatic_spans.append(span_obj['text'])

            questions.append({
            'question': first_qa['q'],
            'gold_answer': first_qa['a'],
            'topic': conv['topic'],
            'genre': conv.get('genre', ''),
            'community': conv.get('community', ''),
            'literal_spans': literal_spans,
            'pragmatic_spans': pragmatic_spans
            })

    return questions

questions = get_first_questions()

In [3]:
for i in range(10):
    print(questions[i])

{'question': 'who is freddy krueger?', 'gold_answer': "Freddy Kruger is the nightmare in nighmare on Elm street. Please note, and to be very clear, the system that loads up wiki is not allowing access to Adam Prag, to the page... so I'll have to go from memory.  Normally you can paste things and back up what you are saying, but today that's not happening. alas.", 'topic': 'A Nightmare on Elm Street (2010 film)', 'genre': 'Movies', 'community': 'A Nightmare on Elm Street', 'literal_spans': ['Cannot GET /wiki/A%20N'], 'pragmatic_spans': ['Cannot GET /wiki/A%20N']}
{'question': 'who was the star on this movie?', 'gold_answer': "Robert Englund IS Freddy Kruger, the bad guy for these films. Note to you and to Adam, the Pragmatic one, the link here is broken and I can't paste relevant things, as has always been Nightmare's case, I'm perfectly good with answering your questions and will quickly do it, but have to open a tab in another window separate from the hit, I WILL go quickly and answer

As can be seen from this excerpt, there are a few questions with no literal or pragmatic spans at all, and this is not an issue on my end as even the teachers themselves state that they cannot access these wikis in their answers (see first, third and fourth questions). 
Considering that, and considering that the NLP model requires context, I'm left with two choices:

1) Filter out the problematic questions
2) Ignore them and set the context to be the same as the question, which will likely lead to errors and underplay distilbert's performance.

We'll go with the filtering:

In [3]:
def filter_valid_questions(questions):
    
    valid_questions = []
    
    for q in questions:
        # Check if any spans are invalid (start with "Cannot GET /wiki/")
        invalid_literal = any(span.startswith("Cannot GET /wiki/") for span in q['literal_spans'])
        invalid_pragmatic = any(span.startswith("Cannot GET /wiki/") for span in q['pragmatic_spans'])
        
        # Keep only questions with valid spans in both configurations
        if not invalid_literal and not invalid_pragmatic:
            valid_questions.append(q)
    
    return valid_questions

# Execute filtering
print(f"Original questions: {len(questions)}")
questions = filter_valid_questions(questions)
print(f"Valid questions: {len(questions)}")

Original questions: 179
Valid questions: 174


Five questions have been filtered, which is not a very substantial amount, so it shouldn't really affect our testing.

Below is our model which will handle all three contexts - literal, pragmatic and retrieved spans.

In [5]:
class DistilbertRAG:
    def __init__(self):
        self.tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased-distilled-squad')
        self.model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-cased-distilled-squad')

        retriever = SentenceTransformer("sentence-transformers/static-retrieval-mrl-en-v1", device="cpu")
        self.embedder = dspy.Embedder(retriever.encode)

        self.search_dict = {}  # a cache for retrievers

    def create_search(self, community, topk_docs_to_retrieve=5):
        
        if community in self.search_dict:
            return self.search_dict[community]

        if not community:
            return "No community given."
        
        directory = f'../PragmatiCQA-sources/{community}'
        corpus = []
        #just the read_html from rag.ipynb 
        for filename in os.listdir(directory):
            if filename.endswith(".html"):
                with open(os.path.join(directory, filename), 'r', encoding='utf-8') as file:
                    soup = BeautifulSoup(file, 'html.parser')
                    corpus.append(soup.get_text())

        search = dspy.retrievers.Embeddings(embedder=self.embedder, corpus=corpus, k=topk_docs_to_retrieve)
        self.search_dict[community] = search

        return search
    
    def answer(self, question, context_type):
        if context_type == 'literal':
            context = " ".join(question['literal_spans'])

        elif context_type == 'pragmatic':
            context = " ".join(question['pragmatic_spans'])

        elif context_type == 'rag':
            search = self.create_search(question['community'], topk_docs_to_retrieve=3)
            result = search(question['question'])
            
            # truncating each passage since we want to include multiple docs within a small token limit
            truncated_passages = []
            for passage in result.passages:
                truncated_passages.append(passage[:500] + "..." if len(passage) > 500 else passage)
            
            context = " ".join(truncated_passages)

        else:
            return "[Invalid context_type]"  

        # calculate available space for context, this is necessary since we want to minimize context token length since we feed it to the LLM afterwards... and that costs money.
        question_tokens = self.tokenizer.encode(question['question'], add_special_tokens=False)
        max_context_length = 512 - len(question_tokens) - 3  # 3 for special tokens [CLS], [SEP], [SEP]
        
        context_tokens = self.tokenizer.encode(context, add_special_tokens=False)
        if len(context_tokens) > max_context_length:
            context_tokens = context_tokens[:max_context_length]
            context = self.tokenizer.decode(context_tokens)
        
        inputs = self.tokenizer.encode_plus(
            question['question'], 
            context,
            add_special_tokens=True,
            max_length=512, #the model can't handle more than 512 tokens as input anyway, so we cap it to prevent errors.
            truncation='only_second',
            return_tensors='pt'
        )
        
        with torch.no_grad():
            outputs = self.model(**inputs)
        start_idx = torch.argmax(outputs.start_logits).item()
        end_idx = torch.argmax(outputs.end_logits).item()
        
        if end_idx < start_idx:
            end_idx = start_idx
            
        tokens = inputs['input_ids'][0][start_idx:end_idx+1]
        answer = self.tokenizer.decode(tokens, skip_special_tokens=True)
        
        return {
            "answer": answer.strip() if answer.strip() else "[No answer found]",
            "context": context
        }

So I have checked out SemanticF1 - it does not compute all three scores, it really only computes the F1 score, and that's it. I checked all of its attributes, and found nothing else regarding precision and recall. So I'll also be using the function SemanticF1 calls (According to the documentation) instead.

Below is a small test with two evaluator LMs as I wanted to see if I can handle using a local evaluator on my 6GB VRAM GFX. 

Spoilers: The local model performed pretty badly.

In [None]:
SigClass = DecompositionalSemanticRecallPrecision
sig_module = ChainOfThought(SigClass)

def semantic_scores(example, pred):
    scores = sig_module(
        question=example.question,
        ground_truth=example.response,
        system_response=pred.response
    )
    precision = scores.precision
    recall = scores.recall
    f1 = 2 * precision * recall / (precision + recall + 1e-8)
    return {"precision": precision, "recall": recall, "f1": f1}

def evaluate_model(model, n=5):
    for i, q in enumerate(questions[:n]):
        print("="*60)
        print(f"Question {i+1}: {q['question']}")
        print("Gold Answer:", q['gold_answer'])

        for context_type in ["literal", "pragmatic", "rag"]:
            ans = model.answer(q, context_type)

            gold_ex = dspy.Example(
                question=q['question'],
                response=q['gold_answer'],
                inputs={'context': ans['context']}
            )
            pred_ex = dspy.Example(response=ans['answer'])

            scores = semantic_scores(gold_ex, pred_ex)

            print(f"\n{context_type.capitalize()} Answer: {ans['answer']}")
            print(f"  Precision: {scores['precision']:.2f}, Recall: {scores['recall']:.2f}, F1: {scores['f1']:.2f}\n")

In [None]:

model = DistilbertRAG()

print("="*30, "EVALUATION WITH QWEN 2.5", "="*30)

evaluator_lm_qwen = dspy.LM('ollama_chat/qwen2.5:3b', api_base='http://localhost:11434', api_key='')
dspy.configure(lm=evaluator_lm_qwen)

evaluate_model(model, n=3)

print("="*30, "EVALUATION WITH GROK-3-MINI", "="*30)

evaluator_lm_grok = dspy.LM('xai/grok-3-mini', api_key=api_key)
dspy.configure(lm=evaluator_lm_grok)

evaluate_model(model, n=3)

Question 1: Is the Batman comic similar to the movies?
Gold Answer: I would say the movie and comics has same story line, as Batmans parents were the most wealthy folks in Gotham city, and they were killd while returning from a function by a small time criminal called Joe Chill

Literal Answer: Bruce Wayne is born to Dr. Thomas Wayne and his wife Martha Kane, two very wealthy and charitable Gotham City socialites
  Precision: 0.25, Recall: 0.25, F1: 0.25

Pragmatic Answer: his parents were killed by a small - time criminal named Joe Chill
  Precision: 0.25, Recall: 0.50, F1: 0.33





Rag Answer: The Batman film franchise consists of a total of nine theatrical live - action films and two live - action serials featuring the DC Comics superhero Batman
  Precision: 0.00, Recall: 0.00, F1: 0.00

Question 2: what is batman's real name?
Gold Answer: Batman was created by Bob Kane and Bill Finger. His real identity is Bruce Wayne.

Literal Answer: Bruce Wayne
  Precision: 1.00, Recall: 0.33, F1: 0.50

Pragmatic Answer: Bruce Wayne
  Precision: 1.00, Recall: 0.33, F1: 0.50





Rag Answer: Bruce Wayne Aliases
  Precision: 1.00, Recall: 0.33, F1: 0.50

Question 3: How old was batman when he first became batman?
Gold Answer: I don't know. It is not clear when Bruce Wayne becomes Batman, but he becomes Batman sometime after his parents die.

Literal Answer: I don't know
  Precision: 0.00, Recall: 0.00, F1: 0.00

Pragmatic Answer: Bruce
  Precision: 0.00, Recall: 0.00, F1: 0.00

Rag Answer: February 23, 1948
  Precision: 0.00, Recall: 0.00, F1: 0.00

Question 1: Is the Batman comic similar to the movies?
Gold Answer: I would say the movie and comics has same story line, as Batmans parents were the most wealthy folks in Gotham city, and they were killd while returning from a function by a small time criminal called Joe Chill

Literal Answer: Bruce Wayne is born to Dr. Thomas Wayne and his wife Martha Kane, two very wealthy and charitable Gotham City socialites
  Precision: 0.50, Recall: 0.25, F1: 0.33

Pragmatic Answer: his parents were killed by a small - time c

As can be seen above, the qwen 2.5 model is not good at evaluating. Frankly, neither is grok-3-mini. Answering "Bruce" to the question "How old was batman when he first became batman" should get a precision of near-zero, if not zero. Definitely not '1'. Anyway, I'll be proceeding with grok-3-mini as the evaluator model. 
Woe to the grok budget...

In [19]:
def evaluate_all_questions(model, questions):
    import time
    import warnings
    # i'm suppressing the output format warnings since they're frequent and annoying.
    warnings.filterwarnings("ignore", message="Failed to use structured output format")
    
    results = {"literal": [], "pragmatic": [], "rag": []}
    detailed_results = {"literal": [], "pragmatic": [], "rag": []}
    
    print(f"Evaluating {len(questions)} questions...")
    
    for i, q in enumerate(questions):
        print(f"Processing question {i + 1}/{len(questions)}.")
            
        for context_type in ["literal", "pragmatic", "rag"]:
            
            ans = model.answer(q, context_type)
            
            #creating an example like in the semanticf1 example. not sure if passing the context is strictly necessary; it'll be a massive waste of tokens in part 2.
            gold_ex = dspy.Example(
                question=q['question'],
                response=q['gold_answer'],
                inputs={'context': ans['context']}
            )
            pred_ex = dspy.Example(response=ans['answer'])
            
            scores = semantic_scores(gold_ex, pred_ex)
            results[context_type].append(scores)
            
            # storing data for future analysis
            detailed_results[context_type].append({
                "question": q['question'],
                "gold_answer": q['gold_answer'],
                "predicted_answer": ans['answer'],
                "context": ans['context'],
                "scores": scores
            })
            
            # short delay to avoid hitting limits
            time.sleep(3)
    
    # note that i can't use dspy.Evaluate since i need all three (or at least the first two) metrics individually...
    avg_results = {}
    for context_type in results:
        if results[context_type]:  
            avg_results[context_type] = {
                "precision": sum(s["precision"] for s in results[context_type]) / len(results[context_type]),
                "recall": sum(s["recall"] for s in results[context_type]) / len(results[context_type]),
                "f1": sum(s["f1"] for s in results[context_type]) / len(results[context_type])
            }
    
    json_data = {
        "summary": avg_results,
        "detailed_results": detailed_results
    }
    
    with open('part_1_eval.json', 'w') as f:
        json.dump(json_data, f)
    
    
    return avg_results

def print_results_table(avg_results):
    
    print("\nEVALUATION RESULTS")
    print("=" * 50)
    print(f"{'Context Type':<20} {'Precision':<12} {'Recall':<12} {'F1':<12}")
    print("-" * 50)
    
    for context_type in ["literal", "pragmatic", "rag"]:
        if context_type in avg_results:
            precision = avg_results[context_type]["precision"]
            recall = avg_results[context_type]["recall"]
            f1 = avg_results[context_type]["f1"]
            print(f"{context_type.capitalize():<20} {precision:<12f} {recall:<12f} {f1:<12}")
        else:
            print(f"{context_type.capitalize():<20} {'N/A':<12} {'N/A':<12} {'N/A':<12}")

In [20]:
evaluator_lm_grok = dspy.LM('xai/grok-3-mini', api_key=api_key)
dspy.configure(lm=evaluator_lm_grok)

model = DistilbertRAG()

avg_results = evaluate_all_questions(model, questions)


Evaluating 174 questions...
Processing question 1/174.
Processing question 2/174.
Processing question 3/174.
Processing question 4/174.
Processing question 5/174.
Processing question 6/174.
Processing question 7/174.
Processing question 8/174.
Processing question 9/174.
Processing question 10/174.
Processing question 11/174.
Processing question 12/174.
Processing question 13/174.
Processing question 14/174.
Processing question 15/174.
Processing question 16/174.
Processing question 17/174.
Processing question 18/174.
Processing question 19/174.
Processing question 20/174.
Processing question 21/174.
Processing question 22/174.
Processing question 23/174.
Processing question 24/174.
Processing question 25/174.
Processing question 26/174.
Processing question 27/174.
Processing question 28/174.
Processing question 29/174.
Processing question 30/174.
Processing question 31/174.
Processing question 32/174.
Processing question 33/174.
Processing question 34/174.
Processing question 35/174.
P

#### Results:

In [21]:
print_results_table(avg_results)


EVALUATION RESULTS
Context Type         Precision    Recall       F1          
--------------------------------------------------
Literal              0.816571     0.289534     0.40379242303802887
Pragmatic            0.744828     0.268122     0.3715092992171465
Rag                  0.247222     0.085625     0.11761999779137157


### Analysis

Below I'll be printing some excerpts of questions, and then analyzing them

In [31]:
with open('part_1_eval.json', 'r') as f:
    data = json.load(f)

print("ANALYSIS OF EVALUATION RESULTS")
print("=" * 60)

print("\n1. RAG EXAMPLES WITH F1 SCORE = 0.0")
print("-" * 40)

rag_zero_examples = []
for i, result in enumerate(data['detailed_results']['rag']):
    if result['scores']['f1'] == 0.0:
        rag_zero_examples.append((i, result))

print(f"Found {len(rag_zero_examples)} RAG examples with F1 = 0.0")
print("\nShowing first 5 examples:\n")

for idx, (i, example) in enumerate(rag_zero_examples[:5]):
    print(f"Example {idx + 1} (Question #{i + 1}):")
    print(f"Question: {example['question']}")
    print(f"Gold Answer: {example['gold_answer']}")
    print(f"RAG Answer: {example['predicted_answer']}")
    print(f"F1 Score: {example['scores']['f1']}")
    print("-" * 40)



ANALYSIS OF EVALUATION RESULTS

1. RAG EXAMPLES WITH F1 SCORE = 0.0
----------------------------------------
Found 126 RAG examples with F1 = 0.0

Showing first 5 examples:

Example 1 (Question #1):
Question: Is the Batman comic similar to the movies?
Gold Answer: I would say the movie and comics has same story line, as Batmans parents were the most wealthy folks in Gotham city, and they were killd while returning from a function by a small time criminal called Joe Chill
RAG Answer: The Batman film franchise consists of a total of nine theatrical live - action films and two live - action serials featuring the DC Comics superhero Batman
F1 Score: 0.0
----------------------------------------
Example 2 (Question #3):
Question: How old was batman when he first became batman?
Gold Answer: I don't know. It is not clear when Bruce Wayne becomes Batman, but he becomes Batman sometime after his parents die.
RAG Answer: February 23, 1948
F1 Score: 0.0
----------------------------------------
Exa

In [36]:
print("ANALYSIS: LITERAL OUTPERFORMING PRAGMATIC")
print("=" * 60)

# Calculate gaps where literal F1 > pragmatic F1
literal_results = data['detailed_results']['literal']
pragmatic_results = data['detailed_results']['pragmatic']

gaps = []
for i, (lit_result, prag_result) in enumerate(zip(literal_results, pragmatic_results)):
    lit_f1 = lit_result['scores']['f1']
    prag_f1 = prag_result['scores']['f1']
    
    gap = lit_f1 - prag_f1
    
    # Only consider cases where literal is better by more than 0.2
    if gap > 0.2:
        gaps.append({
            'question_idx': i,
            'question': lit_result['question'],
            'gold_answer': lit_result['gold_answer'],
            'literal_answer': lit_result['predicted_answer'],
            'pragmatic_answer': prag_result['predicted_answer'],
            'literal_f1': lit_f1,
            'pragmatic_f1': prag_f1,
            'gap': gap,
            'literal_scores': lit_result['scores'],
            'pragmatic_scores': prag_result['scores']
        })

for idx, example in enumerate(gaps[:10]):
    print(f"Example {idx + 1} (Question #{example['question_idx'] + 1}):")
    print(f"Gap: {example['gap']:.3f} (Literal: {example['literal_f1']:.3f}, Pragmatic: {example['pragmatic_f1']:.3f})")
    print(f"Question: {example['question']}")
    print(f"Gold Answer: {example['gold_answer']}")
    print(f"Literal Answer: {example['literal_answer']}")
    print(f"Pragmatic Answer: {example['pragmatic_answer']}")
    print(f"Literal Scores - P: {example['literal_scores']['precision']:.3f}, R: {example['literal_scores']['recall']:.3f}")
    print(f"Pragmatic Scores - P: {example['pragmatic_scores']['precision']:.3f}, R: {example['pragmatic_scores']['recall']:.3f}")
    print("-" * 60)

ANALYSIS: LITERAL OUTPERFORMING PRAGMATIC
Example 1 (Question #11):
Gap: 0.500 (Literal: 0.500, Pragmatic: 0.000)
Question: how old is batman?
Gold Answer: Batman made his first appearence in media in May of 1939, so he is quite old. He premiered as a vigilante.
Literal Answer: May, 1939
Pragmatic Answer: Batman
Literal Scores - P: 1.000, R: 0.333
Pragmatic Scores - P: 1.000, R: 0.000
------------------------------------------------------------
Example 2 (Question #13):
Gap: 0.400 (Literal: 0.400, Pragmatic: 0.000)
Question: what is batman's real name? 
Gold Answer: Bruce Wayne is the real name of Batman who, after witnessing the murder of his parents as a child, donned a bat-themed costume in order to fight crime.
Literal Answer: Bruce Wayne
Pragmatic Answer: his parents
Literal Scores - P: 1.000, R: 0.250
Pragmatic Scores - P: 1.000, R: 0.000
------------------------------------------------------------
Example 3 (Question #15):
Gap: 0.667 (Literal: 0.667, Pragmatic: 0.000)
Question: 

The most glaring issue of the experiment is the low scoring on the retrieved answers. It's quite easy to see why this is the case.

First, we must consider the distilbert model's token limit, which is rather small (512), together with the fact that at the end of our RAG module's embeddings, we end up with the text of entire documents per document, instead of specific, more relevant sentences or passages. Even after stripping all the HTML code and such (Through BeautifulSoup), we still end up with quite a lot of "fluff", like section titles and so on, that don't provide a lot of meaningful info. This ends up consuming the majority of the allocated tokens per document, making it so that even if the document is relevant, the retrieved passages are not all that likely to include the relevant information unless it's at the very start of the document.

For example, the second question has a retrieved answer of 'February 23, 1948', which is nonsensical in the context of the question - a question about batman's age at a certain period of time. It becomes apparent after a little bit of searching that this is the birth date of some writer of the batman series, and this date was likely just present at the very start of one of the document the module retrieved, and one of the very few pieces of information that are even tangentially related to 'age' or 'dates'.

Regarding pragmatic answers, I've printed out a few examples where the literal answer outperforms the pragmatic one, to see if the model repeatedly fails at providing more details. After looking at some examples (I've checked out more examples than shown, this is truncated), the reality is that the pragmatic answers were both short and very literal, while being wrong. I believe this shows a clear limitation of the distilbert model to extract answers, even literal answers, from a span that contains more information on top of what's immediately relevant to the question.

The model seems to succeed at extracting relatively simple contextual information, like correlating "when" in the question with a date in the span, and so on.

Overall, this model proved to be incapable of proper pragmatic inference / answering.

## Part 2: The LLM Multi-Step Prompting Approach

A brief explanation of the model:
1) Retriever - it's basically the same as the retriever I've embedded directly in the previous model for the traditional approach. Using a dictionary as a corpus cache since I can spare the memory for it (Besides, we're only evaluating part of the dataset at once, and not the entire thing).

2) QAModel - It's a model designed to handle a single question of a conversation, with I/O fields defined specifically for that purpose. I've decided to incorporate the reasoning and summary strategies to try and improve performance as they don't require a whole lot of effort to implement, and are likely to improve results. The summary is fed to the model with every question, and the model is expected to return an updated summary, in a form that's hopefully not a list of QA pairs, which includes the current QA pair.

    There should be an emphasis on the questions, rather than the answers within the summary, as the answers stem from the model itself and are not ground_truth; they could throw off future answers.

3) ConversationModel - This model is mainly just a wrapper for the QAModel that maintains the summary object across questions within a single conversation. The method of running this model is simply calling process_question iteratively for every question in the conversation.

In [21]:
class Retriever:
    def __init__(self):
        retriever = SentenceTransformer("sentence-transformers/static-retrieval-mrl-en-v1", device="cpu")
        self.embedder = dspy.Embedder(retriever.encode)
        self.search_dict = {}

    def create_search(self, community, topk_docs_to_retrieve=5):
        if community in self.search_dict:
            return self.search_dict[community]

        if not community:
            return None
        
        directory = f'../PragmatiCQA-sources/{community}'
        corpus = []
        
        for filename in os.listdir(directory):
            if filename.endswith(".html"):
                with open(os.path.join(directory, filename), 'r', encoding='utf-8') as file:
                    soup = BeautifulSoup(file, 'html.parser')
                    corpus.append(soup.get_text())

        search = dspy.retrievers.Embeddings(embedder=self.embedder, corpus=corpus, k=topk_docs_to_retrieve)
        self.search_dict[community] = search
        return search
    

class CooperativeTeaching(dspy.Signature):
            
            question = dspy.InputField(desc="The current question from the student.")
            topic = dspy.InputField(desc="The main topic being discussed.")
            context = dspy.InputField(desc="Retrieved context relevant to the question (max 10k characters).")
            previous_summary = dspy.InputField(desc="Summary of the conversation so far, focusing on student interests and queries")

            reasoning = dspy.OutputField(desc="A short justification for your answer, no more than a single sentence")
            answer = dspy.OutputField(desc="Cooperative answer that predicts other relevant information the student while need, while staying focused and no longer than two sentences.")
            updated_summary = dspy.OutputField(desc=" A new summary that outlines both the previous summary and the current question and answer. No more than 200 words.")
        


class QAModel(dspy.Module):
    
    def __init__(self):
        super().__init__()

        base_prompt = (
            "You are a teacher who provides answers to their students' questions. "
            "You are given a single question, which is part of on-going discussion with a specific student. " 
            "You are tasked with answering their question in a helpful and cooperative manner, that keeps the conversation flow going. " 
            "Your answer should be derived largely from the provided context, which are excerpts from relevant pages of the Fandom wiki which corresponds to the topic of discussion. "
            "You are also tasked with maintaining an up-to-date summary of all student-teacher interactions in this conversation, which should include the current question and answer. " 
            "The summary should be natural, and be helpful to a teacher who might have to answer the next question of the student in a cooperative manner. "
            "Also provide a short reasoning for your answer, no more than a sentence long. "
            
        )

        self.generate_response = dspy.ChainOfThought(
            signature=CooperativeTeaching,
            prompt=base_prompt
        )
    
    def limit_summary_chars(self, summary, max_chars=1000):
        if len(summary) > max_chars:
            return summary[:max_chars]
        return summary
    
    def forward(self, question, topic, context, previous_summary="New conversation starting"):
        # Ensure previous_summary is within character limits
        limited_summary = self.limit_summary_chars(previous_summary)

        response = self.generate_response(
            question=question,
            topic=topic, 
            context=context,
            previous_summary=limited_summary
        )
        
        # Limit the updated summary to character count
        limited_updated_summary = self.limit_summary_chars(response.updated_summary)
        
        return dspy.Prediction(
            reasoning=response.reasoning,
            answer=response.answer,
            updated_summary=limited_updated_summary
        )


class ConversationModel:
    def __init__(self, qa_model, retriever):
        self.qa_model = qa_model
        self.retriever = retriever
        self.conversation_summary = "New conversation starting"
    
    def process_question(self, question, topic, community):
        
        search = self.retriever.create_search(community, topk_docs_to_retrieve=3)
        if search:
            result = search(question)
            
            # Truncate each passage (document) to 2000 characters
            truncated_passages = []
            for passage in result.passages:
                truncated_passages.append(passage[:2000] + "..." if len(passage) > 2000 else passage)
            
            context = " ".join(truncated_passages)
        else:
            context = "No context available"
        
        prediction = self.qa_model(
            question=question,
            topic=topic,
            context=context,
            previous_summary=self.conversation_summary
        )
        
        self.conversation_summary = prediction.updated_summary
        
        return prediction
    
    def reset_conversation(self):
        """Reset for new conversation"""
        self.conversation_summary = "New conversation starting"
    

In [25]:
#setup for the model and conversations. the model is used for both prediction and evaluation.
lm = dspy.LM('xai/grok-3-mini', api_key=api_key)
dspy.configure(lm=lm)

filepath = '../PragmatiCQA/data/val.jsonl'

conversations = []

with open(filepath, 'r', encoding='utf-8') as f:
    for line in f:
        conversations.append(json.loads(line))

retriever = Retriever()
qa_model = QAModel()
conv_model = ConversationModel(qa_model, retriever)

In [None]:
# We'll do an evaluation of a single conversation to check the model's validity.

conversation = conversations[9] 

print(f"Testing conversation on topic: '{conversation['topic']}'")
print(f"Number of questions: {len(conversation['qas'])}")
print("=" * 60)

for i, qa_pair in enumerate(conversation['qas']):
    question = qa_pair['q']
    gold_answer = qa_pair['a']
    
    print(f"\nQuestion {i+1}:")
    print(f"Q: {question}")
    
    try:
        prediction = conv_model.process_question(
            question=question,
            topic=conversation['topic'],
            community=conversation['community']
        )
        
        predicted_answer = prediction.answer
        print(f"Predicted Answer: {predicted_answer}")
        print(f"Gold Answer: {gold_answer}")
        
        print(f"Reasoning: {prediction.reasoning}")
        print(f"Updated Summary: {prediction.updated_summary}")
        
    except Exception as e:
        print(f"Error processing question: {str(e)}")
        predicted_answer = "[ERROR]"
        print(f"Predicted Answer: {predicted_answer}")
        print(f"Gold Answer: {gold_answer}")
    
    print("-" * 60)

print(f"\nConversation complete! Processed {len(conversation['qas'])} questions.")

Testing conversation on topic: 'Batman'
Number of questions: 9

Question 1:
Q: What is Batmans real name?
Predicted Answer: Batman's real name is Bruce Wayne, the wealthy billionaire and philanthropist who uses his resources and skills to fight crime in Gotham City as a vigilante. Since you're asking about this, it seems like you might be curious about superhero origins—would you like me to explain more about Bruce Wayne's backstory, his role in the Justice League, or recommendations for Batman stories to dive deeper?
Gold Answer: Batman's real identity is Bruce Wayne. He lives in Gotham City and is the CEO of Wayne Enterprises.
Reasoning: First, the student's question is "What is Batman's real name?", which is a direct inquiry about the identity of the Batman character. The topic is "Batman", so I reviewed the provided context, which includes multiple references to Batman's real name. In the context, under sections like "The Batman" and "General Information", it explicitly states that

#### Two main issues can be spotted:

1) The model's answers and reasoning are both too long on average. this doesn't really matter in regards to reasoning, but we should try to keep the predicted answers shorter on average, as they are seemingly twice to thrice the gold answers' length.

2) The model isn't actually updating its summary to include the entire conversation - the summary is only about the latest question.

Both issues stem from the prompt, and so I've adjusted the prompt (and the field descriptions) to hopefully improve on these problems. Now, we'll proceed with the prompt optimization.

For the optimization, I've elected to go with using the F1 score on a question-by-question basis, instead of scores over a full conversation. The main reason is honestly because it's the simplest metric to implement, and it's also likely to be the one to produce the best overall results. A metric that looks at entire conversations might help at engineering a prompt that results in better summaries, but it is also likely that it will overly-emphasize the historical context of the conversation, and cause the model to underperform when given questions that are somewhat "out of left field", so to speak. 

All in all, the conversation-based metric potential advantages over a regular question-based metric are nebulous at best, and harder to implement besides. This line of thought is what also dissuaded me from other similar metrics.

This also makes it so this optimizer actually optimizes for the 'first questions', since it won't have the carryover summary because we'll be evaluating on a per-question basis (I am optimizing the QAModel, not the full ConversationModel which handles the summaries).

In [None]:
import random
import json
from dspy.teleprompt import MIPROv2
from dspy.evaluate import SemanticF1

filepath = '../PragmatiCQA/data/train.jsonl'

train_conversations = []

with open(filepath, 'r', encoding='utf-8') as f:
    for line in f:
        train_conversations.append(json.loads(line))

train_examples = train_conversations[:20]

val_examples = conversations[10:20]

def conversation_to_dspy_examples(conversation_dict):
    examples = []
    
    for qa_pair in conversation_dict['qas']:
        question = qa_pair['q']
        gold_answer = qa_pair['a']
        
        example = dspy.Example(
            question=question,
            topic=conversation_dict['topic'],
            context="", 
            previous_summary="New conversation starting", 
            gold_answer=gold_answer
        ).with_inputs('question', 'topic', 'context', 'previous_summary')
        
        examples.append(example)
    
    return examples


def qa_f1_metric(example, pred, trace=None):
    """F1 metric function using SemanticF1 evaluator"""
    
    gold_ex = dspy.Example(
        question=example.question,
        response=example.gold_answer 
    ).with_inputs('question')

    pred_ex = dspy.Example(response=pred.answer)
    
    semantic_f1 = SemanticF1()
    score = semantic_f1(gold_ex, pred_ex)
    
    return score

def train_qa_optimizer():
    
    train_dspy_examples = []
    for conv in train_examples:
        examples = conversation_to_dspy_examples(conv)
        train_dspy_examples.extend(examples)
    
    val_dspy_examples = []
    for conv in val_examples:
        examples = conversation_to_dspy_examples(conv)
        val_dspy_examples.extend(examples)
    
    optimizer = MIPROv2(
        metric=qa_f1_metric,
        max_bootstrapped_demos=3,
        init_temperature=1.0,
        track_stats=True,
    )
    
    predictor = QAModel()
    
    compiled_predictor = optimizer.compile(
        predictor,
        trainset=train_dspy_examples,
        valset=val_dspy_examples,
        requires_permission_to_run=False
    )
    
    return compiled_predictor

compiled_qa_model = train_qa_optimizer()


The optimizer had finished, but it seems like it failed to actually improve upon my original prompt. I'm not sure if this is because I gave it too few trials, or if my prompt was originally pretty good (Though the scores are still quite low, an average of roughly 0.3 on the F1 score).

Either way, I'll simply be proceeding with the model as-is, I don't see a reason to spend many more tokens on what's likely to result in minimal improvement.

Now, we'll continue with the evaluation on the first questions of the validation set

In [None]:
from dspy.evaluate import Evaluate
import copy
from tqdm import tqdm



def evaluate_first_questions(qa_model, questions, num_threads=4):
    
    warnings.filterwarnings("ignore", message="Failed to use structured output format")
    
    devset = [
        dspy.Example(
            question=q['question'],
            topic=q['topic'],
            context="",  
            previous_summary="New conversation starting",
            gold_answer=q['gold_answer']
        ).with_inputs('question', 'topic', 'context', 'previous_summary')
        for q in questions
    ]
    
    def qa_f1_metric(example, pred, trace=None):
        gold_ex = dspy.Example(
            question=example.question,
            response=example.gold_answer
        ).with_inputs('question')
        
        pred_ex = dspy.Example(response=pred.answer)
        
        semantic_f1 = dspy.evaluate.SemanticF1()
        return semantic_f1(gold_ex, pred_ex)
    
    evaluator = Evaluate(
        devset=devset,
        metric=qa_f1_metric,
        display_progress=True,
        display_table=True,
        num_threads=num_threads
    )
    
    return evaluator(qa_model)


In [None]:
avg_f1_first = evaluate_first_questions(compiled_qa_model, questions, num_threads=4)
# i ran the evaluation once before while trying to make the table somewhat more presentable (less truncated), and i couldn't really get it to work well;
# i either removed the truncation entirely, and also kept the string truncation on a column-basis in the table.. this thing is quite a pain to work with.

Average Metric: 58.78 / 174 (33.8%): 100%|██████████| 174/174 [00:00<00:00, 1189.55it/s]

2025/08/24 15:01:37 INFO dspy.evaluate.evaluate: Average Metric: 58.78399556618422 / 174 (33.8%)





Unnamed: 0,question,topic,context,previous_summary,gold_answer,reasoning,answer,updated_summary,qa_f1_metric
0,Is the Batman comic similar to the movies?,Batman,,New conversation starting,"I would say the movie and comics has same story line, as Batmans p...",Using general knowledge of Batman since no specific context is pro...,Batman comics and movies share core elements like the character's ...,"Previously, this was a new conversation starting on the topic of B...",✔️ [0.286]
1,what is batman's real name?,Batman,,New conversation starting,Batman was created by Bob Kane and Bill Finger. His real identity ...,"Based on general knowledge from DC Comics, as no specific context ...","Batman's real name is Bruce Wayne, a billionaire philanthropist wh...","This conversation is just starting, with the student asking about ...",✔️ [0.286]
2,How old was batman when he first became batman?,Batman,,New conversation starting,"I don't know. It is not clear when Bruce Wayne becomes Batman, but...","Since the provided context is empty, I am relying on general knowl...","Batman, or Bruce Wayne, first became Batman at the age of 25 after...",The conversation started as a new discussion on Batman. The studen...,✔️ [0.286]
3,"Does Batman Have super powers, like invisibility, or the ability t...",Batman,,New conversation starting,"No, Batman has no super powers like other super heroes because he ...","Based on established Batman lore from DC Comics, he is depicted as...","No, Batman does not have superpowers like invisibility or the abil...",This conversation began with a new query about whether Batman has ...,✔️ [0.615]
4,Who are Batman's biggest enemies?,Batman,,New conversation starting,"The Joker and Catwoman are original enemies of Batman. However, th...","Based on general knowledge of Batman lore from comics and media, a...","Batman's biggest enemies include the Joker, his chaotic arch-nemes...",Previous summary: New conversation starting. The student asked abo...,✔️ [0.250]
...,...,...,...,...,...,...,...,...,...
169,Who creat the game of thrones universe?,Game of Thrones,,New conversation starting,Game of Thrones was produced by the HBO cable network. It is based...,"Based on general knowledge of the Game of Thrones series, the crea...",The Game of Thrones universe was created by author George R.R. Mar...,This conversation began with a new query about the creator of the ...,✔️ [0.571]
170,Where was the Game of Thrones shot?,Game of Thrones,,New conversation starting,The pilot episode was filmed in Northern Ireland and Morocco. This...,"Since the provided context is empty, I am drawing from general kno...","Game of Thrones was primarily filmed in Northern Ireland, with add...",New conversation starting. The student asked about the filming loc...,✔️ [0.444]
171,who is the protagonist of the show?,Game of Thrones,,New conversation starting,"Super tough one, many different episodes focus on different protag...",Game of Thrones features an ensemble cast without a single clear p...,"In Game of Thrones, there isn't one definitive protagonist due to ...",This conversation is starting on Game of Thrones. The student aske...,✔️ [0.286]
172,when was the firs series released?,Game of Thrones,,New conversation starting,"It feels crazy that it's this long ago, but the first season aired...","Based on general knowledge of the Game of Thrones series, I can co...","The first season of Game of Thrones was released on April 17, 2011...","This conversation is about Game of Thrones, starting with the stud...",✔️ [0.284]


Comparison to the traditional model will be written below

In [29]:
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm
import copy
import warnings

def evaluate_single_conversation(conv_idx, conversation, compiled_qa_model):
    
    warnings.filterwarnings("ignore", message="Failed to use structured output format")
    
    retriever = Retriever()
    conv_model = ConversationModel(compiled_qa_model, retriever)
    
    pred_conversation = copy.deepcopy(conversation)
    
    semantic_f1 = dspy.evaluate.SemanticF1()
    conversation_scores = []
    
    conv_model.reset_conversation()
    
    for qa_idx, qa_pair in enumerate(pred_conversation['qas']):
        try:
            prediction = conv_model.process_question(
                question=qa_pair['q'],
                topic=pred_conversation['topic'],
                community=pred_conversation.get('community', '')
            )
            
            gold_ex = dspy.Example(
                question=qa_pair['q'],
                response=qa_pair['a']
            ).with_inputs('question')
            
            pred_ex = dspy.Example(response=prediction.answer)
            score = semantic_f1(gold_ex, pred_ex)
            
            qa_pair['predicted_answer'] = prediction.answer
            qa_pair['reasoning'] = prediction.reasoning
            qa_pair['updated_summary'] = prediction.updated_summary
            qa_pair['f1_score'] = score
            
            conversation_scores.append(score)
            
        except Exception as e:
            qa_pair['predicted_answer'] = f"[ERROR: {str(e)}]"
            qa_pair['reasoning'] = "[ERROR]"
            qa_pair['updated_summary'] = "[ERROR]"
            qa_pair['f1_score'] = 0.0
            conversation_scores.append(0.0)
    
    return conv_idx, pred_conversation, conversation_scores

def evaluate_full_conversations(qa_model, conversations, num_threads=4):
    
    warnings.filterwarnings("ignore", message="Failed to use structured output format")
    
    pred_conversations = [None] * len(conversations)
    all_scores = []
    
    with ThreadPoolExecutor(max_workers=num_threads) as executor:
        future_to_conv = {executor.submit(evaluate_single_conversation, i, conv, qa_model): i 
                         for i, conv in enumerate(conversations)}
        
        for future in tqdm(as_completed(future_to_conv), 
                          total=len(conversations), 
                          desc="Evaluating conversations"):
            conv_idx = future_to_conv[future]
            try:
                conv_idx, pred_conversation, conversation_scores = future.result()
                pred_conversations[conv_idx] = pred_conversation
                all_scores.extend(conversation_scores)
            except Exception as e:
                print(f"Error processing conversation {conv_idx}: {str(e)}")
                pred_conversations[conv_idx] = copy.deepcopy(conversations[conv_idx])
    
    avg_score = sum(all_scores) / len(all_scores) if all_scores else 0.0
    
    return avg_score, pred_conversations

In [30]:
avg_f1_conv, eval_results_conv = evaluate_full_conversations(qa_model, conversations)

with open('llm_conv_eval.jsonl', 'w', encoding='utf-8') as f:
    for conversation in eval_results_conv:
        json.dump(conversation, f, indent=2, ensure_ascii=False)
        f.write('\n')

Evaluating conversations: 100%|██████████| 179/179 [2:09:59<00:00, 43.57s/it]   


In [None]:
pred_conversations = []

with open('llm_conv_eval.jsonl', 'r', encoding='utf-8') as f:
        for line in f:
            pred_conversations = json.loads(line.strip())
            pred_conversations.append(conversation)
    

In [32]:
print("="*20, "AVERAGE F1 SCORE", "="*20)
print("SCORE: ", avg_f1_conv)

SCORE:  0.2727486136325308


## 5. Discussion Questions


#### Let's first address the gap between the average score for the first questions and the average score of the full conversations - 0.338 vs 0.272.

In a way, this result is expected; In line with standard conversations humans have, the longer it goes on, the more context the conversation itself holds beyond what's being directly stated at that very moment. In regards to this dataset, this means that the later questions within each conversation are essentially "saddled" with prior context that shapes the subsequent responses of either party, whether it be the teacher or the student (Though in our case, we care more about emulating the teacher's responses given the student's). 

This issue makes the task of emulating a cooperative answer progressively harder, and it is expected that the resulting score would also be progressively lower.

Let's look at a bit of an extreme example of this:

- "q": "tell me about season 1 - 4?",

- "a": "As of 2016, five books have been published in the A Song of Ice and Fire series, and author George R.R. Martin has predicted that there will be two more (though he is struggling not to let the final book run long, in which case it would have to be split, for a total of eight books).  The third novel, A Storm of Swords , was so long that it pushed the limits of how large a published book could physically be without pages falling out. Because the third novel was so long, the production team realiz",

- "predicted_answer": "Season 1 of Game of Thrones introduces the complex political intrigue in Westeros, focusing on the Stark and Lannister families, the fight for the Iron Throne, and Daenerys Targaryen's beginnings in Essos. Seasons 2 through 4 escalate the conflicts with the War of the Five Kings, key battles, betrayals, and Daenerys's conquests, building toward major character developments that set up the rest of the series—you might also want to explore how these seasons influence later storylines in Seasons 5 and beyond.",

- "updated_summary": "This conversation began with the student asking if Game of Thrones is real, and I explained it's a fictional TV series based on George R.R. Martin's books, highlighting its fantasy elements. They then inquired about the number of seasons and episodes (8 seasons with varying counts), suggested resources like streaming platforms, and asked about the cast size, describing the large ensemble and recommending IMDb. For Peter Dinklage's character, I confirmed he played Tyrion Lannister, a clever nobleman central to the plot, and for Lord Tywin Lannister, I described him as the powerful head of House Lannister with Andal culture, suggesting related episodes and comparisons to other cultures. Most recently, the student asked about Essos, which I defined as a vast continent east of Westeros with diverse regions and cultural influences, and noted its relevance to character interactions. Now, the student asked about Seasons 1-4, which I summarized as introducing political intrigue and escalating ",

The score for this predicted answer was a surprising 0, despite it actually answering the question in a literal manner, while providing an avenue for further discussion by tying in the later seasons. By all means, if taken out of context of the entire conversation, this is a pretty good answer. So why did it get a score of 0?

Looking at the gold answer (which was sadly kind of cut off for some reason), it's quite obvious that the topic of the conversation had veered into the realm of the Song of Ice and Fire books (Unless the teacher decided to take a rather sharp turn in discussion and mention the books out of nowhere), which is why the answer focused on the books. By the end, when the production team was mentioned, the answer was cut off, meaning the evaluator could find no real mention of the TV series in the ground truth, and thus found very little relation between it and the predicted answer, giving it an understandable score of 0.

Admittedly, this is kind of my own fault, as I'd decided to NOT forward context to the evaluator model. I don't know if it would've made a substantial difference in scoring, but logically speaking, forwarding the context would've allowed the evaluator to avoid this pitfall.

However, this also justifies what I've explained above; as the conversation progresses, the topic at hand diverges from the original, and the questions become more 'loaded' than they literally are. A language model, without sufficient context, is simply ill-equipped to handle that.

#### Comparing to the traditional model

Compared to the DISTILBERT model, our RAG score with the LLM model is significantly higher - an average F1 score of 0.338.

This score approaches the scores achieved by DISTILBERT using the given information spans for literal (0.4) and pragmatic (0.37) answers. The closeness in score indicates that the information the LLM worked with and supplied in its answers is not too far apart from what the spans had. Meaning, the retriever was likely successful in finding relevant information, and the LLM was successful at sieving through that information to find the information needed to answer the questions. 

The semantic evaluation is a bit iffy, since it's hard for me to tell if the score is low because the answer was strictly bad, or if it was a generally good answer, just very different from the gold answer - some gold answers can consist of some random anecdotes or strong opinions that a model isn't really supposed to be able to replicate. This means that a good answer by the model can be scored low because of this difference.

#### So which approach is better?

I think the answer to this question is pretty self evident. The LLM model performed much better using retrieved context (0.338 by the LLM vs ~0.11 by DISTILBERT), and while we do see that the DISTILBERT model had better scores on the first questions with the dataset context spans, it means practically nothing. Given that the traditional model only extracts spans that it finds relevant to the question, and it doesn't really do any word processing, all it can output are dry answers that are inherently uncooperative. There is no reason to believe it has advantages over the LLM beyond exceedingly niche use cases like "I only want the model to output a succint, literal answer when I already have a short passage containing all the necessary information" - if I already have the relevant context neatly summarized like so, I have very little need for a model to extract the answers, so it's not really a use case I bother considering.



### Theory of Mind

So let's go over whether our model exhibits aspects of ToM. Immediately, I want to say 'no', however, let's examine this a bit more thoroughly - after all, a good part of the reason why I've decided to include the 'summary' field is because it really helps with following the model's so-called "understanding" of the conversation, and by extension, the student's intentions and beliefs. I'll take a few random examples (I'm just going to scroll through the jsonl file until I find something interesting; I'm not going to use some elaborate filtering scheme or anything of the sort) and try to analyze them:

- "updated_summary": "This conversation began with the student inquiring about Jo Frost's role in Supernanny, including techniques like the Naughty Bench for behavioral issues, episodes on tantrums and diets, and the show's global success, while expressing nostalgia and seeking streaming options. Discussions emphasized its practical parenting advice, such as the Naughty Step with age-based timeouts, and addressed effectiveness by recommending adaptations and positive reinforcement. In the current exchange, the student, a fan of the timeout chair, asked about other methods Jo Frost advocated, and the response highlighted the \"Go to Bed Early\" technique for bedtime issues while suggesting exploration of positive discipline strategies like reward charts.",

So, while it seems like the summary really is just a summary of the questions and subsequent answers by the model (not the gold answers), we can spot something really important: It described the student as a fan of the timeout chair! Presumably, the model inferred this through curisoity on behalf of the student in regards to the timeout chair, rather than explicit declaration of them being a fan of that timeout chair. I'm actually quite surprised I found something like this, so I will look back and see if the student actually made such a declaration.

To my great disappointment, this was the question: "q": "I am a big fan of the timeout chair.. Any other methods of parenting she advocated?"

So clearly, in this case, the model doesn't really exhibit any aspects of the ToM; it did not really attribute any mental state to the student by its own volition. Eather, it simply utilized the information supplied literally within the question. No inference seems to have happened, and there's nothing else that exhibits ToM  within the summary, in my opinion.

Another example:

- "updated_summary": "In this ongoing discussion on Alexander Hamilton, the student previously asked about his significance as a Founding Father, roles like Secretary of the Treasury, influence on American finance, cultural impact through the musical \"Hamilton\" by Lin-Manuel Miranda, family life, and the duel with Aaron Burr, with follow-ups on the musical's hip-hop style and the American Revolution's role in its storyline. The response explained the Revolution as the 1775-1783 war central to Act One, focusing on Hamilton's rise and themes like ambition, and later noted Hamilton's death from duel wounds in 1804. Now, the student asked why Aaron Burr wished to kill Hamilton, with the answer highlighting Burr's motivations from Hamilton's public insults and political rivalries, as portrayed in the musical, potentially leading to questions about Burr's aftermath or dueling's societal effects."

Once again, we do not actually see inference on the model's side regarding the student's potential interests, let alone something harder to grasp like their inclinations or how they seem to favor answers (Though without feedback, a model can't really determine the latter). This could once again be my fault, as I should've outlined the goals of the summary better, perhaps by using the very same keywords (Interests, inclinations) that I've mentioned here.

Last example:

- "updated_summary": "This conversation about Popeye started with questions on the first episode's year (1978 for The All-New Popeye Hour, with origins in 1929 comics and 1930s shorts) and the character's creation from Elzie Crisler Segar's experiences. The student then asked about the idea's origin, which was linked to Segar's background, and now inquires if the cartoon referenced any controversial issues of the day. Based on available context, Popeye touched on themes like nutrition without major controversies, and this builds on the student's interest in Popeye's history, origins, and cultural influences, potentially leading to discussions on character development or societal themes.",

Surprisingly, this summary actually directly addresses the student's interest. Admittedly, this is a very surface level "inference", and I'm hesitant to consider this a proper example of ToM... 

Overall, I'd consider this "sophisticated pattern-matching" as you'd referred to it. It is very likely that the model added the line about interest after the student directly stated something along the lines of "I'm interested to know if..." in one of the earlier questions. However, we must take into account the possibility that my own prompt steered the model towards a more objective analysis / summarization of interarction, rather than one that allows inference to occur freely. I would not make an assumption based on my singular experiment. 

It would be much better to run this experiment in small batches with multiple prompt strategies to actually reach a satisfactory conclusion.
