# ImpPres with LLM

You have to implement in this notebook a better ImpPres classifier using an LLM.
This classifier must be implemented using DSPy.


In [2]:
# Configure the DSPy environment with the language model - for grok the parameters must be:
# env variable should be in os.environ['XAI_API_KEY']
# "xai/grok-3-mini"

import os
import dspy

from dotenv import load_dotenv
load_dotenv("grok_key.ini") 

lm = dspy.LM('xai/grok-3-mini', api_key=os.environ['XAI_API_KEY'])
# for ollama 
# lm = dspy.LM('ollama_chat/devstral', api_base='http://localhost:11434', api_key='')
dspy.configure(lm=lm)

In [3]:
from typing import Literal, List

## Implement the DSPy classifier program.
class ParadigmClassifier(dspy.Signature):
    pairs: str = dspy.InputField(desc="All premise-hypothesis pairs, numbered and separated by |")
    predictions: List[Literal['entailment', 'neutral', 'contradiction']] = dspy.OutputField(desc="List of predictions for each pair")

classifier = dspy.Predict(ParadigmClassifier)

def classify(paradigm_pairs):
    pairs = []
    for i, pair in enumerate(paradigm_pairs):
        s = f"{i + 1}. Premise: {pair['premise']}, Hypothesis: {pair['hypothesis']}"
        pairs.append(s)

    pairs_str = " | ".join(pairs)
    results = classifier(pairs=pairs_str)

    return results.predictions

## Load ImpPres Dataset

In [4]:
from datasets import load_dataset

sections = ['presupposition_all_n_presupposition', 
            'presupposition_both_presupposition', 
            'presupposition_change_of_state', 
            'presupposition_cleft_existence', 
            'presupposition_cleft_uniqueness', 
            'presupposition_only_presupposition', 
            'presupposition_possessed_definites_existence', 
            'presupposition_possessed_definites_uniqueness', 
            'presupposition_question_presupposition']

dataset = {}
for section in sections:
    print(f"Loading dataset for section: {section}")
    dataset[section] = load_dataset("facebook/imppres", section)

Loading dataset for section: presupposition_all_n_presupposition
Loading dataset for section: presupposition_both_presupposition
Loading dataset for section: presupposition_change_of_state
Loading dataset for section: presupposition_cleft_existence
Loading dataset for section: presupposition_cleft_uniqueness
Loading dataset for section: presupposition_only_presupposition
Loading dataset for section: presupposition_possessed_definites_existence
Loading dataset for section: presupposition_possessed_definites_uniqueness
Loading dataset for section: presupposition_question_presupposition


## Evaluate Metrics

Let's use the huggingface `evaluate` package to compute the performance of the baseline.


In [5]:
from evaluate import load

accuracy = load("accuracy")
precision = load("precision")
recall = load("recall")
f1 = load("f1")


In [6]:
import evaluate
clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])

In [7]:
from tqdm import tqdm
from collections import defaultdict
import random

def evaluate_paradigm_section(dataset):
    random.seed(42)
    
    results = []
    label_names = ["entailment", "neutral", "contradiction"]
    paradigm_scores = []

    # Make paradigm groups
    paradigms = []
    for i in range(0, len(dataset), 19):
        if i + 19 <= len(dataset):
            paradigm = []
            for j in range(19):
                example = dataset[i + j]
                paradigm.append({
                    'premise': example['premise'],
                    'hypothesis': example['hypothesis'],
                    'gold_label': label_names[example['gold_label']]
                })
            random.shuffle(paradigm)
            paradigms.append(paradigm)

    for paradigm in tqdm(paradigms):
        pairs = [{'premise': pair['premise'], 'hypothesis': pair['hypothesis']} for pair in paradigm]
        predictions = classify(pairs)

        # Calculate accuracy score
        correct = 0
        for pred, gold in zip(predictions, paradigm):
            if pred == gold['gold_label']:
                correct += 1
        accuracy = correct / len(predictions)
    
        # Calculate consistency score - the proportion of the most common prediction
        pred_counts = defaultdict(int)
        for pred in predictions:
            pred_counts[pred] += 1
        consistency = max(pred_counts.values()) / len(predictions)

        combined_score = (accuracy + consistency) / 2
        paradigm_scores.append({
            'consistency': consistency,
            'combined': combined_score
        })

        paradigm_results = []
        for pred, gold in zip(predictions, paradigm):
            paradigm_results.append({
                'pred_label': pred,
                'gold_label': gold['gold_label']
            })  
        results.append(paradigm_results)

    return results, paradigm_scores                        

In [8]:
import pandas as pd
from IPython.display import display

accuracies = []
precisions = []
recalls = []
f1s = []
results_table = []
combined_scores = []
transformation_results = []

for section in sections:
    print(f"Working on section: {section}")
    sec = section[15:]
    data = dataset[section][sec]

    results, paradigm_scores = evaluate_paradigm_section(data)

    # Calculate metrics
    predictions = []
    references = []

    for paradigm in results:
        for result in paradigm:
            predictions.append(result['pred_label'])
            references.append(result['gold_label'])

    label_to_int = {"entailment": 0, "neutral": 1, "contradiction": 2}
    predictions_int = [label_to_int[pred] for pred in predictions]
    references_int = [label_to_int[ref] for ref in references]

    accuracy_score = accuracy.compute(predictions=predictions_int, references=references_int)['accuracy']
    f1_score = f1.compute(predictions=predictions_int, references=references_int, average='macro')['f1']
    precision_score = precision.compute(predictions=predictions_int, references=references_int, average='macro', zero_division=0)['precision']
    recall_score = recall.compute(predictions=predictions_int, references=references_int, average='macro', zero_division=0)['recall']

    accuracies.append(accuracy_score)
    precisions.append(precision_score)
    recalls.append(recall_score)
    f1s.append(f1_score)

    consistency = sum([p['consistency'] for p in paradigm_scores]) / len(paradigm_scores)
    combined_score = sum([p['combined'] for p in paradigm_scores]) / len(paradigm_scores)

    # Calculate metrics for each transformation type
    transformation_metrics = []
    for type in range(19):
        type_predictions = []
        type_references = []

        for paradigm in results:
            if type < len(paradigm):
                type_predictions.append(paradigm[type]['pred_label'])
                type_references.append(paradigm[type]['gold_label'])

        type_predictions_int = [label_to_int[pred] for pred in type_predictions]
        type_references_int = [label_to_int[ref] for ref in type_references]

        type_accuracy = accuracy.compute(predictions=type_predictions_int, references=type_references_int)['accuracy']
        type_f1 = f1.compute(predictions=type_predictions_int, references=type_references_int, average='macro')['f1']
        type_precision = precision.compute(predictions=type_predictions_int, references=type_references_int, average='macro', zero_division=0)['precision']
        type_recall = recall.compute(predictions=type_predictions_int, references=type_references_int, average='macro', zero_division=0)['recall']

        transformation_metrics.append({
            'Accuracy': f"{type_accuracy:.2f}",
            'Precision': f"{type_precision:.2f}",
            'Recall': f"{type_recall:.2f}",
            'F1': f"{type_f1:.2f}"
        })

    transformation_results.append(transformation_metrics)

    results_table.append({
        'Section': section,
        'Accuracy': f"{accuracy_score:.2f}",
        'Precision': f"{precision_score:.2f}",
        'Recall': f"{recall_score:.2f}",
        'F1': f"{f1_score:.2f}",
        'Consistency': f"{consistency:.2f}",
        'Combined': f"{combined_score:.2f}"
    })

# Calculate overall metrics
accuracy_all = sum(accuracies) / len(accuracies)
precision_all = sum(precisions) / len(precisions)
recall_all = sum(recalls) / len(recalls)
f1_all = sum(f1s) / len(f1s)
consistency_all = sum([float(r['Consistency']) for r in results_table]) / len(results_table)
combined_all = sum([float(r['Combined']) for r in results_table]) / len(results_table)

results_table.append({
    'Section': 'Overall',
    'Accuracy': f"{accuracy_all:.2f}",
    'Precision': f"{precision_all:.2f}",
    'Recall': f"{recall_all:.2f}",
    'F1': f"{f1_all:.2f}",
    'Consistency': f"{consistency_all:.2f}",
    'Combined': f"{combined_all:.2f}"
})

# Display section results as a table
print("\nSection Results:")
results_df = pd.DataFrame(results_table)
styled_df = results_df.style.set_properties(**{'text-align': 'center'})
styled_df = styled_df.set_table_styles([dict(selector='th', props=[('text-align', 'center')])])
display(styled_df)

# Display transformation results as a table
transformations = []
num_sections = len(transformation_results)
num_transformations = 19

for t in range(num_transformations):
    acc_sum = 0
    prec_sum = 0
    rec_sum = 0
    f1_sum = 0
    for section_metrics in transformation_results:
        metrics = section_metrics[t]
        acc_sum += float(metrics['Accuracy'])
        prec_sum += float(metrics['Precision'])
        rec_sum += float(metrics['Recall'])
        f1_sum += float(metrics['F1'])
    transformations.append({
        'Accuracy': f"{acc_sum / num_sections:.2f}",
        'Precision': f"{prec_sum / num_sections:.2f}",
        'Recall': f"{rec_sum / num_sections:.2f}",
        'F1': f"{f1_sum / num_sections:.2f}"
    })

print("Transformation Results:")
transformation_df = pd.DataFrame(transformations)
styled_df = transformation_df.style.set_properties(**{'text-align': 'center'})
styled_df = styled_df.set_table_styles([dict(selector='th', props=[('text-align', 'center')])])
display(styled_df)

Working on section: presupposition_all_n_presupposition


100%|██████████| 100/100 [00:00<00:00, 501.25it/s]


Working on section: presupposition_both_presupposition


100%|██████████| 100/100 [00:00<00:00, 558.00it/s]


Working on section: presupposition_change_of_state


100%|██████████| 100/100 [00:00<00:00, 620.10it/s]


Working on section: presupposition_cleft_existence


100%|██████████| 100/100 [00:00<00:00, 605.18it/s]


Working on section: presupposition_cleft_uniqueness


100%|██████████| 100/100 [00:00<00:00, 590.38it/s]


Working on section: presupposition_only_presupposition


100%|██████████| 100/100 [00:00<00:00, 569.96it/s]


Working on section: presupposition_possessed_definites_existence


100%|██████████| 100/100 [00:00<00:00, 628.39it/s]


Working on section: presupposition_possessed_definites_uniqueness


100%|██████████| 100/100 [00:00<00:00, 651.91it/s]


Working on section: presupposition_question_presupposition


100%|██████████| 100/100 [00:00<00:00, 429.88it/s]



Section Results:


Unnamed: 0,Section,Accuracy,Precision,Recall,F1,Consistency,Combined
0,presupposition_all_n_presupposition,0.82,0.85,0.81,0.83,0.54,0.68
1,presupposition_both_presupposition,0.7,0.83,0.67,0.69,0.68,0.69
2,presupposition_change_of_state,0.54,0.6,0.48,0.45,0.8,0.67
3,presupposition_cleft_existence,0.64,0.8,0.58,0.58,0.75,0.7
4,presupposition_cleft_uniqueness,0.51,0.78,0.44,0.39,0.89,0.7
5,presupposition_only_presupposition,0.64,0.82,0.58,0.6,0.76,0.7
6,presupposition_possessed_definites_existence,0.85,0.88,0.83,0.85,0.55,0.7
7,presupposition_possessed_definites_uniqueness,0.47,0.71,0.39,0.31,0.93,0.7
8,presupposition_question_presupposition,0.65,0.8,0.6,0.62,0.73,0.69
9,Overall,0.65,0.79,0.6,0.59,0.74,0.69


Transformation Results:


Unnamed: 0,Accuracy,Precision,Recall,F1
0,0.65,0.79,0.59,0.59
1,0.65,0.82,0.58,0.58
2,0.67,0.8,0.62,0.62
3,0.63,0.79,0.62,0.6
4,0.72,0.8,0.61,0.63
5,0.62,0.79,0.57,0.56
6,0.67,0.78,0.62,0.61
7,0.61,0.74,0.6,0.58
8,0.65,0.74,0.6,0.59
9,0.66,0.75,0.59,0.59


## Approach Explanation

**Batch Processing**: I concatenated all the pairs in the paradigm and used the `ParadigmClassifier` to predict the labels for all pairs at once. This allows the model to see all the transformations of the same presupposition simultaneously, potentially enabling it to recognize patterns. If I processed them one by one, the model would lose this contextual information and couldn't understand the relationships between transformations.

**Equal weight scoring**: I gave equal weight to the consistency and the accuracy in the combined score because they are both important for this task. Consistency ensures that the model is not making random predictions, while accuracy ensures that the model is making correct predictions. If I had given more weight to one over the other, it could have led to a biased model that either overfits or underfits the data.

## Results Analysis

**Overall Performance**: The model achieved 0.65 accuracy with 0.74 consistency, giving a combined score of 0.69. This shows the model performs reasonably well and is consistent in its predictions across transformations.

**Section Performance**: Performance varies significantly across presupposition types. `possessed_definites_existence` (0.85 accuracy) and `all_n_presupposition` (0.82 accuracy) performed best, while `possessed_definites_uniqueness` (0.47 accuracy) and `cleft_uniqueness` (0.51 accuracy) were most challenging.

**Consistency vs Accuracy Trade-off**: Sections with lower accuracy often show higher consistency.               For example, `possessed_definites_uniqueness` has only 0.47 accuracy but 0.93 consistency, and `possessed_definites_existence` has 0.85 accuracy with 0.55 consistency. 

**Transformation Analysis**: All 19 transformation types showed similar performance (0.61-0.72 accuracy), with transformation type 4 performing best (0.72) and type 1 performing worst (0.61). This suggests that the model is generally effective across different transformation types, but some transformations are more challenging than others.
