# ANLI Baseline

This model illustrates how to use the DeBERTa-v3-base-mnli-fever-anli model to perform specialized inference on the ANLI dataset.
This dataset has 184M parameters. It was trained in 2021 on the basis of a BERT-like embedding approach: 
* The premise and the hypothesis are encoded using the DeBERTa-v3-base contextual encoder
* The encodings are then compared on a fine-tuned model to predict a distribution over the classification labels (entailment, contradiction, neutral)

Reported accuracy on ANLI is 0.495 (see https://huggingface.co/MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli) 



In [3]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

model_name = "MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

In [4]:
premise = "I first thought that I liked the movie, but upon second thought it was actually disappointing."
hypothesis = "The movie was good."

input = tokenizer(premise, hypothesis, truncation=True, return_tensors="pt")
output = model(input["input_ids"].to(device))  # device = "cuda:0" or "cpu"
prediction = torch.softmax(output["logits"][0], -1).tolist()
label_names = ["entailment", "neutral", "contradiction"]
prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
print(prediction)


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'entailment': 6.6, 'neutral': 17.3, 'contradiction': 76.1}


In [5]:
def evaluate(premise, hypothesis):
    input = tokenizer(premise, hypothesis, truncation=True, return_tensors="pt")
    output = model(input["input_ids"].to(device))
    prediction = torch.softmax(output["logits"][0], -1).tolist()
    prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
    return prediction

In [6]:
evaluate("The weather is nice today.", "It is sunny outside.")

{'entailment': 0.1, 'neutral': 99.8, 'contradiction': 0.0}

In [7]:
def get_prediction(pred_dict):
    if pred_dict["entailment"] > pred_dict["contradiction"]  and pred_dict["entailment"] > pred_dict["neutral"]:
        return "entailment"
    elif pred_dict["contradiction"] > pred_dict["entailment"]  and pred_dict["contradiction"] > pred_dict["neutral"]:
        return "contradiction"
    else:
        return "neutral"

In [8]:
get_prediction(evaluate("The weather is nice today.", "It is sunny outside."))

'neutral'

In [9]:
get_prediction(evaluate("It is sunny outside.", "The weather is nice today."))

'entailment'

In [10]:
get_prediction(evaluate("It is sunny outside.", "The weather is terrible today."))

'contradiction'

## Load ANLI dataset

In [11]:
from datasets import load_dataset

dataset = load_dataset("facebook/anli")
dataset = dataset.filter(lambda x: x['reason'] != None and x['reason'] != "")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [12]:
dataset

DatasetDict({
    train_r1: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 2923
    })
    dev_r1: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1000
    })
    test_r1: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1000
    })
    train_r2: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 4861
    })
    dev_r2: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1000
    })
    test_r2: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1000
    })
    train_r3: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 13375
    })
    dev_r3: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1200


In [13]:
# Evaluate the model on the ANLI dataset
from tqdm import tqdm
def evaluate_on_dataset(dataset):
    results = []
    label_names = ["entailment", "neutral", "contradiction"]
    for example in tqdm(dataset):
        premise = example['premise']
        hypothesis = example['hypothesis']
        prediction = evaluate(premise, hypothesis)
        results.append({
            'premise': premise,
            'hypothesis': hypothesis,
            'prediction': prediction,
            'pred_label': get_prediction(prediction),
            'gold_label': label_names[example['label']],
            'reason': example['reason']
        })
    return results

In [14]:
pred_test_r3 = evaluate_on_dataset(dataset['test_r3'])

100%|██████████| 1200/1200 [02:28<00:00,  8.10it/s]


In [15]:
pred_test_r3[:5]  # Display the first 5 predictions

[{'premise': "It is Sunday today, let's take a look at the most popular posts of the last couple of days. Most of the articles this week deal with the iPhone, its future version called the iPhone 8 or iPhone Edition, and new builds of iOS and macOS. There are also some posts that deal with the iPhone rival called the Galaxy S8 and some other interesting stories. The list of the most interesting articles is available below. Stay tuned for more rumors and don't forget to follow us on Twitter.",
  'hypothesis': 'The day of the passage is usually when Christians praise the lord together',
  'prediction': {'entailment': 2.4, 'neutral': 97.4, 'contradiction': 0.2},
  'pred_label': 'neutral',
  'gold_label': 'entailment',
  'reason': "Sunday is considered Lord's Day"},
 {'premise': 'By The Associated Press WELLINGTON, New Zealand (AP) — All passengers and crew have survived a crash-landing of a plane in a lagoon in the Federated States of Micronesia. WELLINGTON, New Zealand (AP) — All passeng

## Evaluate Metrics

Let's use the huggingface `evaluate` package to compute the performance of the baseline.


In [16]:
from evaluate import load

accuracy = load("accuracy")
precision = load("precision")
recall = load("recall")
f1 = load("f1")


In [17]:
from evaluate import combine
clf_metrics = combine(["accuracy", "f1", "precision", "recall"])

In [18]:
clf_metrics.compute(predictions=[0, 1, 0], references=[0, 1, 1])

{'accuracy': 0.6666666666666666,
 'f1': 0.6666666666666666,
 'precision': 1.0,
 'recall': 0.5}

## Your Turn

Compute the classification metrics on the baseline model on each section of the ANLI dataset.

https://www.kaggle.com/code/faijanahamadkhan/llm-evaluation-framework-hugging-face provides good documentation on how to use the Huggingface evaluate library.

In [19]:
pred_test_r1 = evaluate_on_dataset(dataset['test_r1'])
# pred_test_r1[:5]


100%|██████████| 1000/1000 [02:05<00:00,  7.99it/s]


In [20]:
pred_test_r2 = evaluate_on_dataset(dataset['test_r2'])
# pred_test_r2[:5]

100%|██████████| 1000/1000 [02:05<00:00,  7.99it/s]


In [21]:
pred_test_r3 = evaluate_on_dataset(dataset['test_r3'])
# pred_test_r3[:5]


100%|██████████| 1200/1200 [02:31<00:00,  7.92it/s]


In [22]:
from evaluate import load

# load each metric
accuracy  = load("accuracy")
precision = load("precision")
recall    = load("recall")
f1        = load("f1")

label2id = {"entailment": 0, "neutral": 1, "contradiction": 2}

def compute_section_metrics(pred_list):
    preds = [label2id[ex['pred_label']] for ex in pred_list]
    refs  = [label2id[ex['gold_label']] for ex in pred_list]

    return {
        "accuracy" : accuracy.compute(predictions=preds, references=refs)["accuracy"],
        "precision": precision.compute(predictions=preds,
                                       references=refs,
                                       average="macro")["precision"],
        "recall"   : recall.compute(predictions=preds,
                                    references=refs,
                                    average="macro")["recall"],
        "f1"       : f1.compute(predictions=preds,
                                references=refs,
                                average="macro")["f1"],
    }

for split_name, preds in [
    ("test_r1", pred_test_r1),
    ("test_r2", pred_test_r2),
    ("test_r3", pred_test_r3),
]:
    m = compute_section_metrics(preds)
    print(f"\n=== Metrics on {split_name} ({len(preds)} examples) ===")
    print(f"Accuracy : {m['accuracy']:.4f}")
    print(f"Precision: {m['precision']:.4f}")
    print(f"Recall   : {m['recall']:.4f}")
    print(f"F1       : {m['f1']:.4f}")


=== Metrics on test_r1 (1000 examples) ===
Accuracy : 0.7120
Precision: 0.7135
Recall   : 0.7120
F1       : 0.7119

=== Metrics on test_r2 (1000 examples) ===
Accuracy : 0.5470
Precision: 0.5472
Recall   : 0.5470
F1       : 0.5465

=== Metrics on test_r3 (1200 examples) ===
Accuracy : 0.4950
Precision: 0.4985
Recall   : 0.4946
F1       : 0.4943


In [23]:
import random
import pandas as pd

all_predictions = pred_test_r1 + pred_test_r2 + pred_test_r3

errors = [p for p in all_predictions if p['pred_label'] != p['gold_label']]

print(f"Found {len(errors)} errors out of {len(all_predictions)} total predictions.")

if len(errors) >= 20:
    error_samples = random.sample(errors, 20)
else:
    error_samples = errors 

print(f"Investigating {len(error_samples)} sampled errors...")

pd.set_option('display.max_colwidth', None)
error_df = pd.DataFrame(error_samples)[['premise', 'hypothesis', 'pred_label', 'gold_label', 'reason']]
# error_df


Found 1347 errors out of 3200 total predictions.
Investigating 20 sampled errors...


In [24]:
report = pd.DataFrame({
    'error_reason': [
        "Missed link between enjoyment and stress relief.",
        "Assumed producer caused success without proof.",
        "Confused partial title with full one.",
        "Misread 'auto' as motorcycle-related.",
        "Brought in unrelated historical info.",
        "Failed to see contradiction in death dates.",
        "Didn’t connect walkie-talkies to communication.",
        "Took negative tone as contradiction.",
        "Misjudged shape vs. human features.",
        "Mistook protest for support due to cheering.",
        "Assumed criminality not mentioned in text.",
        "Guessed group inclusion without mention.",
        "Assumed gender without evidence.",
        "Miscounted word frequency in text.",
        "Assumed theatrical release by default.",
        "Missed foreign country implication.",
        "Guessed result not stated in text.",
        "Assumed stadium use without proof.",
        "Misread ‘always have’ as confirmed independence.",
        "Took argument as statement of fact."
    ]
})

# REASONING


In [26]:
df_reasoning = pd.concat([error_df, report], axis=1)
df_reasoning

Unnamed: 0,premise,hypothesis,pred_label,gold_label,reason,error_reason
0,"West Palm Beach Municipal Stadium, referred to as ""Municipal Stadium"", located at 755 Hank Aaron Drive, was a ballpark in West Palm Beach, Florida and the long-time spring training home for the Milwaukee and Atlanta Braves and Montreal Expos. The Braves played spring training games at the stadium from 1963 to 1997, while the Expos played there from 1969 to 1972 and from 1981 to 1997.",The Braves played at Municipal Stadium in the fall.,contradiction,neutral,The Braves definitely played there in the spring but it doesn't say whether or not they ever played a game there during other seasons.,Missed link between enjoyment and stress relief.
1,Red food dye<br>Tom was a middle school student. His teacher asked each student to bring food and drinks in. He brought in sprite with red food dye in it. Everyone loved the unique drink. Tom was happy about his decision.,Tom brought a red food.,entailment,contradiction,"This is false, he brought a red drink",Assumed producer caused success without proof.
2,"The 1941 Cabo San Lucas hurricane is considered one of the worst tropical cyclones on record to affect Cabo San Lucas. The hurricane was first reported on September 8 off the coast of Mexico. It slowly moved northwestward while intensifying. After peaking in intensity, it entered the Gulf of California, and weakened rapidly. It dissipated on September 13.",The 1941 Cabo San Lucas hurricane was not a weather formation that one would consider taking precautions with,neutral,contradiction,System did not understand that a strong storm is one that one would consider taking precautions with,Confused partial title with full one.
3,"Drew Barrymore and Justin Long have been cast in a romantic comedy called Going the Distance. The New Line Cinema film is to be directed by documentary filmmaker Nanette Burstein, who made the films The Kid Stays in the Picture and American Teen. It will be Burstein's feature film debut. Going the Distance, written by New Line staffer Geoff LaTulippe, will focus on a couple dealing with challenges arising from a cross-country romance. Media reports did not indicate a release date had been determined.",all the movies were directed by the same person,contradiction,entailment,The three movies listed in the paragraph have the same director. This makes my statement correct. The model seems to have trouble with vague statements.,Misread 'auto' as motorcycle-related.
4,"Vincent Edward ""Bo"" Jackson (born November 30, 1962) is a former baseball and American football player. He is one of the few athletes to be named an All-Star in two major sports, and the only one to do so in both baseball and football. He is widely considered one of the greatest athletes of all time.",Many professional sports players have been named All-Star in separate sports.,neutral,contradiction,This was a rewording of something in the context. The system doesn't do well when synonyms are used.,Brought in unrelated historical info.
5,"Completed in 1796, the Pawtucket Canal was originally built as a transportation canal to circumvent the Pawtucket Falls of the Merrimack River in East Chelmsford, Massachusetts. In the early 1820s it became a major component of the Lowell power canal system. with the founding of the textile industry at what became Lowell.",Transportation was more readily conducted after the construction of the Pawtucket Canal.,neutral,entailment,This is the reason the Pawtucket Canal was built.,Failed to see contradiction in death dates.
6,Lost phone<br>Hailey was a typical teenager attached to her cell phone all the time. She brought it everywhere she went and used it all the time. One day she wasn't feeling well and had to leave school early. She got home and was upset to find she didn't have her phone with her. She went to school to find her friend was holding on to her phone.,Hailey never found her phone.,neutral,contradiction,Hailey did find her phone when she went to school and found that her friend had it.,Didn’t connect walkie-talkies to communication.
7,"How to bathe your pet rabbit<br>Brush the rabbit to remove bits of dirt. Many rabbits loved to be brushed, and it's a great way to help them keep their fur clean. Buy a brush made specifically for rabbit fur (often finer-toothed than brushes intended for dogs).",Rabbits like to get baths.,entailment,neutral,I think it is a neither definitely correct or definitely incorrect statement to say rabbits like to get baths because it fits with the context and it does not say if they like baths or not..,Took negative tone as contradiction.
8,"Van Cleef & Arpels is a French jewelry, watch, and perfume company. It was founded in 1896 by Alfred Van Cleef and his uncle Salomon Arpels in Paris. Their pieces often feature flowers, animals, and fairies, and have been worn by style icons such as Farah Pahlavi, the Duchess of Windsor, Grace Kelly, and Elizabeth Taylor.",Van Cleef & Arpels was favoured by royalty,neutral,entailment,"The context says that the their pieces was worn by the Duchess of Windsor, and Grace Kelly, who are both members of European royalty. So it is true to say they are favoured by royalty. It may be hard for the system as it may not know that the Duchess and Windor and Grace Kelly have royal connections",Misjudged shape vs. human features.
9,"According to Naeye, the newfangled telescopes will be able to peer so far back in space and, thus, time that they ""will see the first galaxies assembling a few hundred million years after the Big Bang.""",The telescope is more powerful than previous versions of it,neutral,entailment,This telescope allows for viewing galaxies unseen to humans before,Mistook protest for support due to cheering.


# Preparation for anli_llm_baseline

In [27]:
import json

# Save test_r3 results for LLM comparison
def save_deberta_results_for_comparison():
    """
    Save DeBERTa test_r3 results in a format that can be loaded by LLM baseline
    """
    
    # Make sure pred_test_r3 exists (you should have this from your existing code)
    if 'pred_test_r3' not in globals():
        print("ERROR: pred_test_r3 not found. Make sure you've run the DeBERTa evaluation first.")
        return
    
    # Convert to simple format for comparison
    comparison_results = []
    
    for result in pred_test_r3:
        comparison_results.append({
            'premise': result['premise'],
            'hypothesis': result['hypothesis'],
            'pred_label': result['pred_label'],    # DeBERTa prediction
            'gold_label': result['gold_label'],    # True label
            'correct': result['pred_label'] == result['gold_label']  # Whether correct
        })
    
    # Save as JSON (easier to load than pickle)
    with open('deberta_test_r3_results.json', 'w') as f:
        json.dump(comparison_results, f, indent=2)
    
    # Also save as CSV for easy inspection
    df = pd.DataFrame(comparison_results)
    df.to_csv('deberta_test_r3_results.csv', index=False)
    
    print(f"Saved {len(comparison_results)} DeBERTa results to:")
    print("- deberta_test_r3_results.json")
    print("- deberta_test_r3_results.csv")
    
    # Show sample
    print(f"\nSample of saved data:")
    print(df.head())
    
    return comparison_results

# Call the function to save results
deberta_comparison_results = save_deberta_results_for_comparison()

Saved 1200 DeBERTa results to:
- deberta_test_r3_results.json
- deberta_test_r3_results.csv

Sample of saved data:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             premise  \
0  It is Sunday today, let's take a look at the most popular posts of the last couple of days. Most of the articles this week deal with the iPhone, its future version called the iPhone 8 or iPhone Edition, and new builds of iOS and macOS. There are also some posts that deal with the iPhone rival called the Galaxy S8 and some other interesting stories. The list of the most interesting articles i