## Part 0: Dataset Analysis

### Motivation, Contributions and Methodology

Existing datasets in the open-domain QA, up until the release of PragmaticCQA, largely focused on evaluating QA systems' accuracy regarding the literal answers to given questions. They did not examine or evaluate a system's ability to infer the questioner's unmentioned needs from context; whether they be in the form of follow-up questions, or relevant information that the questioner is not even necessarily aware of, due to the lack of knowledge in the topic being questioned. This ability to grasp intent is key to efficient and productive conversations, and the fact that it so far went largely ignored by common metrics and evaluation datasets is what motivated the paper's authors to create this dataset, along with the corresponding metrics, to allow for a meaningful examination of this ability in NLP models. 

Specifically, what they've produced is:

* A crowd-sourcing framework that achieves "incentive alignment". The authors claim that many of the recent datasets that are crowd-sourced suffer from this "incentive misalignment", which essentially means that annotators are rewarded for producing as many examples as they can, and so they create 'basic' examples that they can churn out quickly. These examples tend to lack nuance, or are often similar, and thus allow the model to learn surface-level patterns in order to achieve good results. This naturally goes against the intent of the dataset creators, as it does not truly test for a model's reasoning abilities, which is where the supposed "incentive misalignment" stems from.     
The authors claim to have solved this issue by allowing the annotators to work on topics they're interested in, and having actually discuss these topics between themselves, which ends up increasing their engagement with the task and producing examples that feel natural and resemble standard human interaction better than other datasets.

* An open-domain ConvQA dataset that follows the prior framework, and features pragmatic answers and metrics that allow for the evaluation of pragmatic reasoning.

* An analysis of their dataset that shows that it presents a challenge to existing models, proving its relevance in the field.

I will now focus on the third point, which is their analysis of the dataset, and why it proves to be challenging to current NLP nodels.

The split datasets each contains separate topics, meaning there's no overlap between the topics and thus no overlap of information between questions of different topics. This forces the model to actually generalize and rely on its internal reasoning capabilities.

The answers in the dataset tend to be constructed with information from different elements, substantially more so than other, commonly used datasets in the field. This proves to be quite hard for the NLP, as it requires it to collate information from a large number of different sources.

The answers are also often formed of a combination of small factoids, and a larger narrative that ties these factoids together with the answer to the original questions. The model should be able to replicate this, and that is more complex than giving literal answers, such as labelling or providing a direct, literal answer to a question without further consideration.

These aspects tell us the dataset seeks specific pragmatic phenomena:
* The recognition of potential follow-up questions and the inclusion of their answers.
* Being cooperative in the conversation: A model should attempt to keep the conversation flowing with the provided answers, whether it be by including relevant information that allows for further discussion, or other such methods that humans employ (Another one would be trying to return the question, or other follow up questions to the student after providing sufficient information, but I'm not sure if the dataset actually covers this case as well).
* Being selective in providing information: A model shouldn't just provide a list of connected data, but rather consider the question, the context in which it's asked, like the background of the questioner, and providing relevant information based on these elements.

These aspects all serve to complicate the task for NLP models, and thus challenge them.


### Sample Analysis

1) The topic is 'Vampires', with the starting question being "So, what is a vampire, exactly?"

   The literal answer we'd expect from a non-cooperative teacher would be something along the lines of "An undead monster", which doesn't elaborate greatly on what differentiates vampires from the plethora of other undead monsters in various fictions, like zombies, or even ghosts. 

   The given answer in the dataset was as follows: "Vampires are a kind of undead monster that feeds on the life essence of living creatures like humans."
   This answer provides the full information we expect to see from a literal interpretation of the question - "A kind of undead monster", and further specifies that it 'feeds on the life essence of living creatures', prompting the student to equate them to beasts of prey, as they feed on other living beings. This lets the student distinguish between vampires and say, ghosts, that are depicted as malevolent, metaphysical beings that exist to haunt people.

   * I've got to say that I don't actually like this answer since life essence is such a weird term. When has anyone ever seen a depiction of vampires that sustain themselves on something other than blood? just say blood...

2) The topic is 'The Wheel of Time', with the starting question being "who was the writer of the wheel of time?"

    The literal answer we'd expect from a non-cooperative teacher could simply state that the writer was Robert Jordan, as he is both an author and the one who wrote the majority of the books, as the student asked about a singular writer. However, it is known that Brandon Sanderson is the one who wrote the latter books, so we expect a pragmatic answer to include this fact, along with the reason why Brandon Sanderson ended up writing the last few books instead of Robert Jordan (Robert Jordan died).
    

    The given answer in the dataset was as follows: "Robert Jordan is the author but he sadly passed away and his books were finished by Brandon Sanderson."
    This answer is more pragmatic as we can easily see the added information as something necessary - Most people wouldn't think beforehand that a book series was written by more than one person, and they would default to asking about a singular writer or author. This is despite them actually wanting to know about all the potential writers, if there were indeed multiple writers. We expect this basic level of inference in daily conversation, and this answer provides that. The literal answer, however, does not.


3) The topic is 'Cats Musical Wiki', with the starting question being "I am a student and know nothing about cats musical wiki".

    This is an interesting 'question' as it's not phrased as a question, but rather, it is a simple statement when taken literally. If used to initiate a conversation, the other party would recognize this as a request to learn about the topic, or an attempt to make small talk by giving the teacher leeway to introduce tidbits of information of their own choosing to the conversation, thus steering it in their desired direction. 

    Surprisingly enough, both the literal and pragmatic answer spans do not even mention the 'wiki', but rather just talk about the cats musical itself.

    The text provided by the dataset for the literal answer just includes details about the creator, the source material, and a range of dates and locations when and where it was played.

    The answer was as follows: "Cats was one of the longest running plays ever, starting in London and running for 21 years. I was lucky enough to sit in the audience in New York city for a performance once."

    We can see that the teacher actually chose to share his own experience regarding the musical, despite not being prompted for such a thing; they actively chose to steer the conversation to talk about their own experience, and thus proving to be a cooperative conversationalist (And the conversation actually continued down that path, with questions like "Where were you seated" and so on), rather than simply providing basic details and closing off the conversation, like we'd expect from a literal answer by a non-cooperative teacher.

4) The topic is 'Edward Elric', with the starting question being "Who is Edward Elric?".

    So a literal answer to this question would be quite succint, such as "A fictional alchemist in 'The Fullmetal Alchemist'", "The main protagonist of 'The Fullmetal Alchemist' series" and so on. We'd expect a pragmatic answer to both combine these details, and then enrich the answer by giving context; sharing information about the character's traits or background.

    The answer given does indeed fulfill these expectations: "Edward Elric is the main protagonist of the Fullmetal Alchemist series. Edward lost hist right arm and left leg due to a failed Human Transplantation attempt and became the youngest State Alchemist in history at the age of twelve."

    The answer includes further details than we'd expect from a purely literal interpretation, in line with what one would likely want to know when asking about a fictional character - like a background that hints at the character's motivation, and thus also the main plot of the story.

I will stop here since I think this section is wordy enough already.









## Part 1: The "Traditional" NLP Approach

In [1]:
import json
import os
import torch
import dspy
import numpy as np
from transformers import DistilBertForQuestionAnswering, DistilBertTokenizer
from sentence_transformers import SentenceTransformer
from bs4 import BeautifulSoup
from dspy.evaluate import SemanticF1 #no longer necessary
import configparser
from dspy.evaluate.auto_evaluation import (
    SemanticRecallPrecision,
    DecompositionalSemanticRecallPrecision
)
from dspy.predict.chain_of_thought import ChainOfThought
import warnings
import logging
import time

config = configparser.ConfigParser()
config.read('grok_key.ini')
api_key = config['DEFAULT']['XAI_API_KEY']

In [2]:
def get_first_questions(filepath='../PragmatiCQA/data/val.jsonl'):

    questions = []
    
    with open(filepath, 'r', encoding='utf-8') as f:
        for line in f:
            conv = json.loads(line)
            first_qa = conv['qas'][0]

            #the spans will only include the text strings, not the keys
            literal_spans = []
            pragmatic_spans = []

            if 'literal_obj' in first_qa['a_meta']:
                for span_obj in first_qa['a_meta']['literal_obj']:
                    literal_spans.append(span_obj['text'])
                    
            if 'pragmatic_obj' in first_qa['a_meta']:
                for span_obj in first_qa['a_meta']['pragmatic_obj']:
                    pragmatic_spans.append(span_obj['text'])

            questions.append({
            'question': first_qa['q'],
            'gold_answer': first_qa['a'],
            'topic': conv['topic'],
            'genre': conv.get('genre', ''),
            'community': conv.get('community', ''),
            'literal_spans': literal_spans,
            'pragmatic_spans': pragmatic_spans
            })

    return questions

questions = get_first_questions()

In [3]:
for i in range(10):
    print(questions[i])

{'question': 'who is freddy krueger?', 'gold_answer': "Freddy Kruger is the nightmare in nighmare on Elm street. Please note, and to be very clear, the system that loads up wiki is not allowing access to Adam Prag, to the page... so I'll have to go from memory.  Normally you can paste things and back up what you are saying, but today that's not happening. alas.", 'topic': 'A Nightmare on Elm Street (2010 film)', 'genre': 'Movies', 'community': 'A Nightmare on Elm Street', 'literal_spans': ['Cannot GET /wiki/A%20N'], 'pragmatic_spans': ['Cannot GET /wiki/A%20N']}
{'question': 'who was the star on this movie?', 'gold_answer': "Robert Englund IS Freddy Kruger, the bad guy for these films. Note to you and to Adam, the Pragmatic one, the link here is broken and I can't paste relevant things, as has always been Nightmare's case, I'm perfectly good with answering your questions and will quickly do it, but have to open a tab in another window separate from the hit, I WILL go quickly and answer

As can be seen from this excerpt, there are a few questions with no literal or pragmatic spans at all, and this is not an issue on my end as even the teachers themselves state that they cannot access these wikis in their answers (see first, third and fourth questions). 
Considering that, and considering that the NLP model requires context, I'm left with two choices:

1) Filter out the problematic questions
2) Ignore them and set the context to be the same as the question, which will likely lead to errors and underplay distilbert's performance.

We'll go with the filtering:

In [4]:
def filter_valid_questions(questions):
    
    valid_questions = []
    
    for q in questions:
        # Check if any spans are invalid (start with "Cannot GET /wiki/")
        invalid_literal = any(span.startswith("Cannot GET /wiki/") for span in q['literal_spans'])
        invalid_pragmatic = any(span.startswith("Cannot GET /wiki/") for span in q['pragmatic_spans'])
        
        # Keep only questions with valid spans in both configurations
        if not invalid_literal and not invalid_pragmatic:
            valid_questions.append(q)
    
    return valid_questions

# Execute filtering
print(f"Original questions: {len(questions)}")
questions = filter_valid_questions(questions)
print(f"Valid questions: {len(questions)}")

Original questions: 179
Valid questions: 174


Five questions have been filtered, which is not a very substantial amount, so it shouldn't really affect our testing.

Below is our model which will handle all three contexts - literal, pragmatic and retrieved spans.

In [5]:
class DistilbertRAG:
    def __init__(self):
        self.tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased-distilled-squad')
        self.model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-cased-distilled-squad')

        retriever = SentenceTransformer("sentence-transformers/static-retrieval-mrl-en-v1", device="cpu")
        self.embedder = dspy.Embedder(retriever.encode)

        self.search_dict = {}  # a cache for retrievers

    def create_search(self, community, topk_docs_to_retrieve=5):
        
        if community in self.search_dict:
            return self.search_dict[community]

        if not community:
            return "No community given."
        
        directory = f'../PragmatiCQA-sources/{community}'
        corpus = []
        #just the read_html from rag.ipynb 
        for filename in os.listdir(directory):
            if filename.endswith(".html"):
                with open(os.path.join(directory, filename), 'r', encoding='utf-8') as file:
                    soup = BeautifulSoup(file, 'html.parser')
                    corpus.append(soup.get_text())

        search = dspy.retrievers.Embeddings(embedder=self.embedder, corpus=corpus, k=topk_docs_to_retrieve)
        self.search_dict[community] = search

        return search
    
    def answer(self, question, context_type):
        if context_type == 'literal':
            context = " ".join(question['literal_spans'])

        elif context_type == 'pragmatic':
            context = " ".join(question['pragmatic_spans'])

        elif context_type == 'rag':
            search = self.create_search(question['community'], topk_docs_to_retrieve=3)
            result = search(question['question'])
            
            # truncating each passage since we want to include multiple docs within a small token limit
            truncated_passages = []
            for passage in result.passages:
                truncated_passages.append(passage[:500] + "..." if len(passage) > 500 else passage)
            
            context = " ".join(truncated_passages)

        else:
            return "[Invalid context_type]"  

        # calculate available space for context, this is necessary since we want to minimize context token length since we feed it to the LLM afterwards... and that costs money.
        question_tokens = self.tokenizer.encode(question['question'], add_special_tokens=False)
        max_context_length = 512 - len(question_tokens) - 3  # 3 for special tokens [CLS], [SEP], [SEP]
        
        context_tokens = self.tokenizer.encode(context, add_special_tokens=False)
        if len(context_tokens) > max_context_length:
            context_tokens = context_tokens[:max_context_length]
            context = self.tokenizer.decode(context_tokens)
        
        inputs = self.tokenizer.encode_plus(
            question['question'], 
            context,
            add_special_tokens=True,
            max_length=512, #the model can't handle more than 512 tokens as input anyway, so we cap it to prevent errors.
            truncation='only_second',
            return_tensors='pt'
        )
        
        with torch.no_grad():
            outputs = self.model(**inputs)
        start_idx = torch.argmax(outputs.start_logits).item()
        end_idx = torch.argmax(outputs.end_logits).item()
        
        if end_idx < start_idx:
            end_idx = start_idx
            
        tokens = inputs['input_ids'][0][start_idx:end_idx+1]
        answer = self.tokenizer.decode(tokens, skip_special_tokens=True)
        
        return {
            "answer": answer.strip() if answer.strip() else "[No answer found]",
            "context": context
        }

So I have checked out SemanticF1 - it does not compute all three scores, it really only computes the F1 score, and that's it. I checked all of its attributes, and found nothing else regarding precision and recall. So I'll also be using the function SemanticF1 calls (According to the documentation) instead.

Below is a small test with two evaluator LMs as I wanted to see if I can handle using a local evaluator on my 6GB VRAM GFX. 

Spoilers: The local model performed pretty badly.

In [None]:
SigClass = DecompositionalSemanticRecallPrecision
sig_module = ChainOfThought(SigClass)

def semantic_scores(example, pred):
    scores = sig_module(
        question=example.question,
        ground_truth=example.response,
        system_response=pred.response
    )
    precision = scores.precision
    recall = scores.recall
    f1 = 2 * precision * recall / (precision + recall + 1e-8)
    return {"precision": precision, "recall": recall, "f1": f1}

def evaluate_model(model, n=5):
    for i, q in enumerate(questions[:n]):
        print("="*60)
        print(f"Question {i+1}: {q['question']}")
        print("Gold Answer:", q['gold_answer'])

        for context_type in ["literal", "pragmatic", "rag"]:
            ans = model.answer(q, context_type)

            gold_ex = dspy.Example(
                question=q['question'],
                response=q['gold_answer'],
                inputs={'context': ans['context']}
            )
            pred_ex = dspy.Example(response=ans['answer'])

            scores = semantic_scores(gold_ex, pred_ex)

            print(f"\n{context_type.capitalize()} Answer: {ans['answer']}")
            print(f"  Precision: {scores['precision']:.2f}, Recall: {scores['recall']:.2f}, F1: {scores['f1']:.2f}\n")

In [None]:

model = DistilbertRAG()

print("="*30, "EVALUATION WITH QWEN 2.5", "="*30)

evaluator_lm_qwen = dspy.LM('ollama_chat/qwen2.5:3b', api_base='http://localhost:11434', api_key='')
dspy.configure(lm=evaluator_lm_qwen)

evaluate_model(model, n=3)

print("="*30, "EVALUATION WITH GROK-3-MINI", "="*30)

evaluator_lm_grok = dspy.LM('xai/grok-3-mini', api_key=api_key)
dspy.configure(lm=evaluator_lm_grok)

evaluate_model(model, n=3)

Question 1: Is the Batman comic similar to the movies?
Gold Answer: I would say the movie and comics has same story line, as Batmans parents were the most wealthy folks in Gotham city, and they were killd while returning from a function by a small time criminal called Joe Chill

Literal Answer: Bruce Wayne is born to Dr. Thomas Wayne and his wife Martha Kane, two very wealthy and charitable Gotham City socialites
  Precision: 0.25, Recall: 0.25, F1: 0.25

Pragmatic Answer: his parents were killed by a small - time criminal named Joe Chill
  Precision: 0.25, Recall: 0.50, F1: 0.33





Rag Answer: The Batman film franchise consists of a total of nine theatrical live - action films and two live - action serials featuring the DC Comics superhero Batman
  Precision: 0.00, Recall: 0.00, F1: 0.00

Question 2: what is batman's real name?
Gold Answer: Batman was created by Bob Kane and Bill Finger. His real identity is Bruce Wayne.

Literal Answer: Bruce Wayne
  Precision: 1.00, Recall: 0.33, F1: 0.50

Pragmatic Answer: Bruce Wayne
  Precision: 1.00, Recall: 0.33, F1: 0.50





Rag Answer: Bruce Wayne Aliases
  Precision: 1.00, Recall: 0.33, F1: 0.50

Question 3: How old was batman when he first became batman?
Gold Answer: I don't know. It is not clear when Bruce Wayne becomes Batman, but he becomes Batman sometime after his parents die.

Literal Answer: I don't know
  Precision: 0.00, Recall: 0.00, F1: 0.00

Pragmatic Answer: Bruce
  Precision: 0.00, Recall: 0.00, F1: 0.00

Rag Answer: February 23, 1948
  Precision: 0.00, Recall: 0.00, F1: 0.00

Question 1: Is the Batman comic similar to the movies?
Gold Answer: I would say the movie and comics has same story line, as Batmans parents were the most wealthy folks in Gotham city, and they were killd while returning from a function by a small time criminal called Joe Chill

Literal Answer: Bruce Wayne is born to Dr. Thomas Wayne and his wife Martha Kane, two very wealthy and charitable Gotham City socialites
  Precision: 0.50, Recall: 0.25, F1: 0.33

Pragmatic Answer: his parents were killed by a small - time c

As can be seen above, the qwen 2.5 model is not good at evaluating. Frankly, neither is grok-3-mini. Answering "Bruce" to the question "How old was batman when he first became batman" should get a precision of near-zero, if not zero. Definitely not '1'. Anyway, I'll be proceeding with grok-3-mini as the evaluator model. 
Woe to the grok budget...

In [19]:
def evaluate_all_questions(model, questions):
    import time
    import warnings
    # i'm suppressing the output format warnings since they're frequent and annoying.
    warnings.filterwarnings("ignore", message="Failed to use structured output format")
    
    results = {"literal": [], "pragmatic": [], "rag": []}
    detailed_results = {"literal": [], "pragmatic": [], "rag": []}
    
    print(f"Evaluating {len(questions)} questions...")
    
    for i, q in enumerate(questions):
        print(f"Processing question {i + 1}/{len(questions)}.")
            
        for context_type in ["literal", "pragmatic", "rag"]:
            
            ans = model.answer(q, context_type)
            
            #creating an example like in the semanticf1 example. not sure if passing the context is strictly necessary; it'll be a massive waste of tokens in part 2.
            gold_ex = dspy.Example(
                question=q['question'],
                response=q['gold_answer'],
                inputs={'context': ans['context']}
            )
            pred_ex = dspy.Example(response=ans['answer'])
            
            scores = semantic_scores(gold_ex, pred_ex)
            results[context_type].append(scores)
            
            # storing data for future analysis
            detailed_results[context_type].append({
                "question": q['question'],
                "gold_answer": q['gold_answer'],
                "predicted_answer": ans['answer'],
                "context": ans['context'],
                "scores": scores
            })
            
            # short delay to avoid hitting limits
            time.sleep(3)
    
    # note that i can't use dspy.Evaluate since i need all three (or at least the first two) metrics individually...
    avg_results = {}
    for context_type in results:
        if results[context_type]:  
            avg_results[context_type] = {
                "precision": sum(s["precision"] for s in results[context_type]) / len(results[context_type]),
                "recall": sum(s["recall"] for s in results[context_type]) / len(results[context_type]),
                "f1": sum(s["f1"] for s in results[context_type]) / len(results[context_type])
            }
    
    json_data = {
        "summary": avg_results,
        "detailed_results": detailed_results
    }
    
    with open('part_1_eval.json', 'w') as f:
        json.dump(json_data, f)
    
    
    return avg_results

def print_results_table(avg_results):
    
    print("\nEVALUATION RESULTS")
    print("=" * 50)
    print(f"{'Context Type':<20} {'Precision':<12} {'Recall':<12} {'F1':<12}")
    print("-" * 50)
    
    for context_type in ["literal", "pragmatic", "rag"]:
        if context_type in avg_results:
            precision = avg_results[context_type]["precision"]
            recall = avg_results[context_type]["recall"]
            f1 = avg_results[context_type]["f1"]
            print(f"{context_type.capitalize():<20} {precision:<12f} {recall:<12f} {f1:<12}")
        else:
            print(f"{context_type.capitalize():<20} {'N/A':<12} {'N/A':<12} {'N/A':<12}")

In [20]:
evaluator_lm_grok = dspy.LM('xai/grok-3-mini', api_key=api_key)
dspy.configure(lm=evaluator_lm_grok)

model = DistilbertRAG()

avg_results = evaluate_all_questions(model, questions)


Evaluating 174 questions...
Processing question 1/174.
Processing question 2/174.
Processing question 3/174.
Processing question 4/174.
Processing question 5/174.
Processing question 6/174.
Processing question 7/174.
Processing question 8/174.
Processing question 9/174.
Processing question 10/174.
Processing question 11/174.
Processing question 12/174.
Processing question 13/174.
Processing question 14/174.
Processing question 15/174.
Processing question 16/174.
Processing question 17/174.
Processing question 18/174.
Processing question 19/174.
Processing question 20/174.
Processing question 21/174.
Processing question 22/174.
Processing question 23/174.
Processing question 24/174.
Processing question 25/174.
Processing question 26/174.
Processing question 27/174.
Processing question 28/174.
Processing question 29/174.
Processing question 30/174.
Processing question 31/174.
Processing question 32/174.
Processing question 33/174.
Processing question 34/174.
Processing question 35/174.
P

#### Results:

In [21]:
print_results_table(avg_results)


EVALUATION RESULTS
Context Type         Precision    Recall       F1          
--------------------------------------------------
Literal              0.816571     0.289534     0.40379242303802887
Pragmatic            0.744828     0.268122     0.3715092992171465
Rag                  0.247222     0.085625     0.11761999779137157


### Analysis

Below I'll be printing some excerpts of questions, and then analyzing them

In [23]:
with open('part_1_eval.json', 'r') as f:
    data = json.load(f)

print("ANALYSIS OF EVALUATION RESULTS")
print("=" * 60)

# 1. Find RAG examples with F1 score of 0.0
print("\n1. RAG EXAMPLES WITH F1 SCORE = 0.0")
print("-" * 40)

rag_zero_examples = []
for i, result in enumerate(data['detailed_results']['rag']):
    if result['scores']['f1'] == 0.0:
        rag_zero_examples.append((i, result))

print(f"Found {len(rag_zero_examples)} RAG examples with F1 = 0.0")
print("\nShowing first 5 examples:\n")

for idx, (i, example) in enumerate(rag_zero_examples[:5]):
    print(f"Example {idx + 1} (Question #{i + 1}):")
    print(f"Question: {example['question']}")
    print(f"Gold Answer: {example['gold_answer']}")
    print(f"RAG Answer: {example['predicted_answer']}")
    print(f"F1 Score: {example['scores']['f1']}")
    print("-" * 40)

# 2. Find Pragmatic examples with F1 score <= 0.2
print("\n2. PRAGMATIC EXAMPLES WITH F1 SCORE <= 0.2")
print("-" * 40)

pragmatic_low_examples = []
for i, result in enumerate(data['detailed_results']['pragmatic']):
    if result['scores']['f1'] <= 0.2:
        pragmatic_low_examples.append((i, result))

print(f"Found {len(pragmatic_low_examples)} Pragmatic examples with F1 <= 0.2")
print("\nShowing first 5 examples:\n")

for idx, (i, example) in enumerate(pragmatic_low_examples[:5]):
    print(f"Example {idx + 1} (Question #{i + 1}):")
    print(f"Question: {example['question']}")
    print(f"Gold Answer: {example['gold_answer']}")
    print(f"Pragmatic Answer: {example['predicted_answer']}")
    print(f"Precision: {example['scores']['precision']:.3f}")
    print(f"Recall: {example['scores']['recall']:.3f}")
    print(f"F1 Score: {example['scores']['f1']:.3f}")
    print("-" * 40)


ANALYSIS OF EVALUATION RESULTS

1. RAG EXAMPLES WITH F1 SCORE = 0.0
----------------------------------------
Found 126 RAG examples with F1 = 0.0

Showing first 5 examples:

Example 1 (Question #1):
Question: Is the Batman comic similar to the movies?
Gold Answer: I would say the movie and comics has same story line, as Batmans parents were the most wealthy folks in Gotham city, and they were killd while returning from a function by a small time criminal called Joe Chill
RAG Answer: The Batman film franchise consists of a total of nine theatrical live - action films and two live - action serials featuring the DC Comics superhero Batman
F1 Score: 0.0
----------------------------------------
Example 2 (Question #3):
Question: How old was batman when he first became batman?
Gold Answer: I don't know. It is not clear when Bruce Wayne becomes Batman, but he becomes Batman sometime after his parents die.
RAG Answer: February 23, 1948
F1 Score: 0.0
----------------------------------------
Exa