# Part 0

## Summary of Key Motivations and Contributions

* Most QA systems today answer literally, providing facts but not necessarily what's most helpful in conversation. For example, if you ask "Is there water on Mars?" a literal answer would be "Yes", but a friendly answer would be "Yes, but only in the form of ice caps near its poles.".

* Recent open-domain question answering (QA) datasets still fall short at two crucial desiderata.
    1. Existing QA datasets only test whether systems can find literal answers, not whether they understand what you really want to know
    2. Most of these datasets are crowd-sourced, which leaves them vulnerable to
    the problem of incentive misalignment between annotators and potential users.

* The contributions of the paper are:
    1. PRAGMATICQA dataset: They created the first conversational QA dataset that includes pragmatic answers (going beyond literal responses) and developed new metrics to measure how well AI systems can do pragmatic reasoning.
    2. Improved data collection: They designed a crowdsourcing method that fixes the incentive misalignment problem - where crowdworkers aren't motivated like real users would be. This produced more realistic, high-quality, and diverse conversation data.
    3. Current system analysis: They tested the dataset and showed it poses unique and important challenges that today's conversational QA systems can't handle well.

## Why this dataset challenging?

* PragmatiCQA is hard for NLP models because it tests whether models can have actual conversations, not just answer questions robotically. The dataset focuses on two key human conversational skills: anticipating what someone really wants to know (like explaining that Mars has water "but only as ice at the poles" instead of just saying "yes") and offering helpful context that keeps the conversation going (mentioning that 23 places in our solar system have water). What makes this super challenging is that AI has to guess what the person already knows, predict their likely follow-up questions, decide what extra info would actually be interesting - basically doing the kind of helpful, context responding that humans do but current NLP systems are terrible at, missing over 90% of the useful extra information humans naturally provide.

* It targets also few pragmatic phenomena
    - **Implicature**: Providing information that is implied or suggested, not just what is explicitly asked.
    - **Presupposition**: Recognizing and addressing background assumptions in the question.
    - **Relevance**: Selecting information that is most useful or interesting to the user, not just factually correct.
    - **Anticipation of follow-up**: Including information that preempts likely follow-up questions, making the conversation smoother and more natural.
    - **Disambiguation**: Clarifying ambiguous questions by inferring the user's intent.

## Dataset samples

1. **The Legend of Zelda**
    - **Question**: "What year did the Legend of Zelda come out?"
    - **Literal Answer**: "1986"
    - **Pragmatic Answer**: "The Legend of Zelda came out as early as 1986 for the Famicom in Japan, and was later released in the western world, including Europe and the US in 1987. Would you like to know about the story?"
    - **Enrichment**: The pragmatic answer not only provides the year but also adds information about the release in different places in the world and also anticipates a follow-up question and asking if the user wants to know more about the story.
2. **Po**
    - **Question**: "who is the main character ?"
    - **Literal Answer**: "Po"
    - **Pragmatic Answer**: "The main character of the movie is a panda named Po. He is also known as the Dragon Warrior. The Dragon Warrior is a prodigy of legend who is described in the first film to know the secret behind the Dragon Scroll. The Dragon Scroll is a legendary scroll that contains the secret to \"limitless power\" and had been written by Oogway."
    - **Enrichment**: The pragmatic answer provides the name of the main character, Po, but also adds context about his role as the Dragon Warrior and the significance of the Dragon Scroll, which is relevant to understanding the character's importance in the story.
3. **Snoop Dogg**
    - **Question**: "Who is Snoop Dogg?"
    - **Literal Answer**: "Snoop Dogg is an American rapper."
    - **Pragmatic Answer**: "Snoop Dogg is an American rapper who has sold over 23 millions albums nationally and over 35 million albums worldwide. Snoop Dogg sings and writes his own songs and produces records and even acts in films and television with his own personality."
    - **Enrichment**: The pragmatic answer provides the basic fact that Snoop Dogg is a rapper but also adds significant context about his achievements, including album sales and his multifaceted career in music and entertainment, which makes the answer more informative and engaging.
4. **Scuderia Ferrari**
    - **Question**: "Who is Scuderia Ferrari?"
    - **Literal Answer**: "Scuderia Ferrari is the racing division of Ferrari."
    - **Pragmatic Answer**: "Scuderia Ferrari is the racing division of Ferrari, and they are one of the most successsful teams in F1. They've won the drivers' title 15 times since the 1950s, and started making their own cars since 1947."
    - **Enrichment**: The pragmatic answer not only identifies Scuderia Ferrari as the racing division of Ferrari but also provides context about their success in Formula 1, including their championship wins and history, which adds depth to the answer.

5. **King Julien**
    - **Question**: "Who is King Julien?"
    - **Literal Answer**: "King Julien is a character in the Madagascar franchise."
    - **Pragmatic Answer**: "King Julien XIII is one of the main characters in the Madagascar franchise. He is the King of the Kingdom of Madagascar, ruling the lemur kingdom since his uncle, King Julien XII abdicated in King Me.",
    - **Enrichment**: The pragmatic answer provides the basic fact that King Julien is a character in the Madagascar franchise but also adds context about his role as the king of the lemur kingdom and his lineage, which enriches the understanding of his character within the story.

# Part 1

In [19]:
import os
import dspy
import json
import types
import hashlib
from dspy.evaluate import SemanticF1
from bs4 import BeautifulSoup
from dotenv import load_dotenv
from transformers import pipeline
from sentence_transformers import SentenceTransformer

# Load environment variables and set up the language model
load_dotenv("grok_key.ini")
lm = dspy.LM('xai/grok-3-mini', api_key=os.environ['XAI_API_KEY'], max_tokens=6000, temperature=0.1, top_p=0.9)
dspy.configure(lm=lm)

# Set up the retriever from rag.ipynb
model = SentenceTransformer("sentence-transformers/static-retrieval-mrl-en-v1", device="cpu")
embedder = dspy.Embedder(model.encode)

# Load the pre-traind QA model
qa_pipeline = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")

def list_folders(directory="../PragmatiCQA-sources"):
    return [d for d in os.listdir(directory) if os.path.isdir(os.path.join(directory, d))]
folders = list_folders()

# Traverse a directory and read html files - extract text from the html files
def read_html_files(dir_name, directory="../PragmatiCQA-sources"):
    texts = []
    for filename in os.listdir(os.path.join(directory, dir_name)):
        if filename.endswith(".html"):
            with open(os.path.join(directory, dir_name, filename), 'r', encoding='utf-8') as file:
                soup = BeautifulSoup(file, 'html.parser')
                texts.append(soup.get_text())
    return texts

# Create retriever for a specific topic
def make_search(topic):
    corpus = read_html_files(topic)
    max_characters = 10000 
    topk_docs_to_retrieve = 5  # number of documents to retrieve per search query
    return dspy.retrievers.Embeddings(embedder=embedder, corpus=corpus, k=topk_docs_to_retrieve, brute_force_threshold=max_characters)

# Load PragmatiCQA dataset
def read_data(filename, dataset_dir="../PragmatiCQA/data"):
    corpus = []
    with open(os.path.join(dataset_dir, filename), 'r') as f:
        for line in f:
            corpus.append(json.loads(line))
    return corpus

class TraditionalQA:
    def __init__(self):
        self.qa_pipeline = qa_pipeline

    def answer_from_context(self, question, context):
        result = self.qa_pipeline(question=question, context=context)
        return result['answer']
    
    def answer_from_retriever(self, question, search):
        passages = search(question).passages
        context = " ".join(passages)
        return self.answer_from_context(question, context)
    
# Load the datasets
test_data = read_data("test.jsonl")
val_data = read_data("val.jsonl")

# Initialize the model
traditional_qa = TraditionalQA()

# Function to evaluate a configuration
def evaluate_configuration(dataset, config):
    examples = []
    predictions = []

    # Cache file for retrieved results
    cache_file = "retriever_cache.json"
    # Load cache if exists
    if os.path.exists(cache_file):
        with open(cache_file, "r", encoding="utf-8") as f:
            retriever_cache = json.load(f)
    else:
        retriever_cache = {}

    def get_cache_key(topic, question):
        key_str = f"{topic}|{question}"
        return hashlib.md5(key_str.encode()).hexdigest()

    for conversation in dataset:
        topic = conversation['topic']
        if topic not in folders:
            continue
        first_qa = conversation['qas'][0]

        question = first_qa['q']
        gold_answer = first_qa['a']

        if config == "Literal":
            lit_spans = [l['text'] for l in first_qa['a_meta']['literal_obj']]
            context = ' '.join(lit_spans)
            pred_answer = traditional_qa.answer_from_context(question, context)
        elif config == "Pragmatic":
            prag_spans = [l['text'] for l in first_qa['a_meta']['pragmatic_obj']]
            context = ' '.join(prag_spans)
            pred_answer = traditional_qa.answer_from_context(question, context)
        else:  # Retrieved]
            cache_key = get_cache_key(topic, question)
            if cache_key in retriever_cache:
                pred_answer = retriever_cache[cache_key]
            else:
                search = make_search(topic)
                passages = search(question).passages
                context = " ".join(passages)
                pred_answer = traditional_qa.answer_from_context(question, context)
                retriever_cache[cache_key] = pred_answer
                
                with open(cache_file, "w", encoding="utf-8") as f:
                    json.dump(retriever_cache, f, ensure_ascii=False, indent=2)

        example = dspy.Example(question=question, response=gold_answer)
        pred = dspy.Example(question=question, response=pred_answer)

        examples.append(example)
        predictions.append(pred)

    return examples, predictions

Device set to use cpu


In [20]:
print("Running the model on the PRAGMATICQA test set...")
test_predictions = {}
for config in ["Literal", "Pragmatic", "Retrieved"]:
    print(f"Evaluating configuration: {config}")
    examples, predictions = evaluate_configuration(test_data, config)
    test_predictions[config] = {
        'examples': examples,
        'predictions': predictions,
    }

# Evaluate the predictions using SemanticF1
print("\nEvaluating predictions using SemanticF1...")
metric = SemanticF1(decompositional=True)

test_results = {}

for config in ["Literal", "Pragmatic", "Retrieved"]:
    examples = test_predictions[config]['examples']
    predictions = test_predictions[config]['predictions']

    f1_scores = []

    for example, prediction in zip(examples, predictions):
        score = metric(example, prediction)
        f1_scores.append(score)

    avg_f1 = sum(f1_scores) / len(f1_scores)
    test_results[config] = {
        'f1': avg_f1, 
        'count': len(examples)
    }

# Display test results table
print("\nTest set results:")
print("Configuration | F1 Score | Count")
print("-" * 35)
for config, results in test_results.items():
    print(f"{config:<13} | {results['f1']:.4f}   | {results['count']}")

Running the model on the PRAGMATICQA test set...
Evaluating configuration: Literal
Evaluating configuration: Pragmatic
Evaluating configuration: Retrieved

Evaluating predictions using SemanticF1...

Test set results:
Configuration | F1 Score | Count
-----------------------------------
Literal       | 0.4274   | 120
Pragmatic     | 0.3544   | 120
Retrieved     | 0.1223   | 120


In [21]:
print("Running the model on the PRAGMATICQA val set...")
val_predictions = {}
for config in ["Literal", "Pragmatic", "Retrieved"]:
    print(f"Evaluating configuration: {config}")
    examples, predictions = evaluate_configuration(val_data, config)
    val_predictions[config] = {
        'examples': examples,
        'predictions': predictions,
    }

def f1_score(precision, recall):
    if precision + recall == 0:
        return 0.0
    return 2 * precision * recall / (precision + recall)

# Wrap the forward method to extract scores manually
def patched_forward(self, example, pred, trace=None):
    scores = self.module(
        question=example.question,
        ground_truth=example.response,
        system_response=pred.response
    )
    f1 = f1_score(scores.precision, scores.recall)
    return {
        "precision": scores.precision,
        "recall": scores.recall,
        "f1": f1
    }

# Evaluate the predictions using SemanticF1
print("\nEvaluating predictions using SemanticF1...")
metric = SemanticF1(decompositional=True)
metric.forward = types.MethodType(patched_forward, metric)

val_results = {}

score_cache_file = "val_score_cache.json"
if os.path.exists(score_cache_file):
    with open(score_cache_file, "r", encoding="utf-8") as f:
        score_cache = json.load(f)
else:
    score_cache = {}

for config in ["Literal", "Pragmatic", "Retrieved"]:
    print(f"Evaluating configuration: {config}")
    examples = val_predictions[config]['examples']
    predictions = val_predictions[config]['predictions']

    precision_scores = []
    recall_scores = []
    f1_scores = []

    for example, prediction in zip(examples, predictions):
        # Create a unique key for each example-prediction pair
        cache_key = f"{config}|{example.question}|{prediction.response}"

        if cache_key not in score_cache:
            result = metric(example, prediction)
            score = [result["precision"], result["recall"], result["f1"]]
            score_cache[cache_key] = score
            # Save the scores to a JSON file
            with open(score_cache_file, "w", encoding="utf-8") as f:
                json.dump(score_cache, f, ensure_ascii=False, indent=2)
        else:
            score = score_cache[cache_key]

        # Extract precision, recall, F1 if available in decomposed format
        precision = score[0]
        recall = score[1]
        f1 = score[2]

        precision_scores.append(precision)
        recall_scores.append(recall)
        f1_scores.append(f1)

    avg_precision = sum(precision_scores) / len(precision_scores)
    avg_recall = sum(recall_scores) / len(recall_scores)
    avg_f1 = sum(f1_scores) / len(f1_scores)

    val_results[config] = {
        'precision': avg_precision,
        'recall': avg_recall,
        'f1': avg_f1,
        'count': len(examples)
    }

# Display validation results table
print("\nValidation set results:")
print("Configuration | Precision | Recall | F1 Score | Count")
print("-" * 60)
for config, results in val_results.items():
    print(f"{config:<13} | {results['precision']:.4f}   | {results['recall']:.4f}  | {results['f1']:.4f}   | {results['count']}")

print("\nAnalysis:")
best_test_config = max(test_results, key=lambda k: test_results[k]['f1'])
best_val_config = max(val_results, key=lambda k: val_results[k]['f1'])

print(f"Best configuration on test set: {best_test_config} with F1 score {test_results[best_test_config]['f1']:.4f}")
print(f"Best configuration on validation set: {best_val_config} with F1 score {val_results[best_val_config]['f1']:.4f}")

cost = sum([x['cost'] for x in lm.history if x['cost'] is not None])
print(f"Total Cost: {cost:.2f} usd")

Running the model on the PRAGMATICQA val set...
Evaluating configuration: Literal
Evaluating configuration: Pragmatic
Evaluating configuration: Retrieved

Evaluating predictions using SemanticF1...
Evaluating configuration: Literal
Evaluating configuration: Pragmatic
Evaluating configuration: Retrieved

Validation set results:
Configuration | Precision | Recall | F1 Score | Count
------------------------------------------------------------
Literal       | 0.8617   | 0.2738  | 0.4003   | 135
Pragmatic     | 0.8037   | 0.2585  | 0.3714   | 135
Retrieved     | 0.0914   | 0.0324  | 0.0394   | 135

Analysis:
Best configuration on test set: Literal with F1 score 0.4274
Best configuration on validation set: Literal with F1 score 0.4003
Total Cost: 0.13 usd
