# ANLI Baseline

This model illustrates how to use the DeBERTa-v3-base-mnli-fever-anli model to perform specialized inference on the ANLI dataset.
This dataset has 184M parameters. It was trained in 2021 on the basis of a BERT-like embedding approach: 
* The premise and the hypothesis are encoded using the DeBERTa-v3-base contextual encoder
* The encodings are then compared on a fine-tuned model to predict a distribution over the classification labels (entailment, contradiction, neutral)

Reported accuracy on ANLI is 0.495 (see https://huggingface.co/MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli) 



In [1]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

model_name = "MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

tokenizer_config.json: 0.00B [00:00, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/369M [00:00<?, ?B/s]

In [2]:
premise = "I first thought that I liked the movie, but upon second thought it was actually disappointing."
hypothesis = "The movie was good."

input = tokenizer(premise, hypothesis, truncation=True, return_tensors="pt")
output = model(input["input_ids"].to(device))  # device = "cuda:0" or "cpu"
prediction = torch.softmax(output["logits"][0], -1).tolist()
label_names = ["entailment", "neutral", "contradiction"]
prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
print(prediction)


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'entailment': 6.6, 'neutral': 17.3, 'contradiction': 76.1}


In [26]:
def evaluate_with_NLI(premise, hypothesis):
    input = tokenizer(premise, hypothesis, truncation=True, return_tensors="pt")
    output = model(input["input_ids"].to(device))
    prediction = torch.softmax(output["logits"][0], -1).tolist()
    prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
    return prediction

In [27]:
evaluate_with_NLI("The weather is nice today.", "It is sunny outside.")

{'entailment': 0.1, 'neutral': 99.8, 'contradiction': 0.0}

In [28]:
def get_prediction(pred_dict):
    if pred_dict["entailment"] > pred_dict["contradiction"]  and pred_dict["entailment"] > pred_dict["neutral"]:
        return "entailment"
    elif pred_dict["contradiction"] > pred_dict["entailment"]  and pred_dict["contradiction"] > pred_dict["neutral"]:
        return "contradiction"
    else:
        return "neutral"

In [29]:
get_prediction(evaluate_with_NLI("The weather is nice today.", "It is sunny outside."))

'neutral'

In [30]:
get_prediction(evaluate_with_NLI("It is sunny outside.", "The weather is nice today."))

'entailment'

In [31]:
get_prediction(evaluate_with_NLI("It is sunny outside.", "The weather is terrible today."))

'contradiction'

## Load ANLI dataset

In [9]:
from datasets import load_dataset

dataset = load_dataset("facebook/anli")
dataset = dataset.filter(lambda x: x['reason'] != None and x['reason'] != "")

README.md: 0.00B [00:00, ?B/s]

plain_text/train_r1-00000-of-00001.parqu(…):   0%|          | 0.00/3.14M [00:00<?, ?B/s]

plain_text/dev_r1-00000-of-00001.parquet:   0%|          | 0.00/351k [00:00<?, ?B/s]

plain_text/test_r1-00000-of-00001.parque(…):   0%|          | 0.00/353k [00:00<?, ?B/s]

plain_text/train_r2-00000-of-00001.parqu(…):   0%|          | 0.00/6.53M [00:00<?, ?B/s]

plain_text/dev_r2-00000-of-00001.parquet:   0%|          | 0.00/351k [00:00<?, ?B/s]

plain_text/test_r2-00000-of-00001.parque(…):   0%|          | 0.00/362k [00:00<?, ?B/s]

plain_text/train_r3-00000-of-00001.parqu(…):   0%|          | 0.00/14.3M [00:00<?, ?B/s]

plain_text/dev_r3-00000-of-00001.parquet:   0%|          | 0.00/434k [00:00<?, ?B/s]

plain_text/test_r3-00000-of-00001.parque(…):   0%|          | 0.00/435k [00:00<?, ?B/s]

Generating train_r1 split:   0%|          | 0/16946 [00:00<?, ? examples/s]

Generating dev_r1 split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating test_r1 split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating train_r2 split:   0%|          | 0/45460 [00:00<?, ? examples/s]

Generating dev_r2 split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating test_r2 split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating train_r3 split:   0%|          | 0/100459 [00:00<?, ? examples/s]

Generating dev_r3 split:   0%|          | 0/1200 [00:00<?, ? examples/s]

Generating test_r3 split:   0%|          | 0/1200 [00:00<?, ? examples/s]

Filter:   0%|          | 0/16946 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/45460 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/100459 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1200 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1200 [00:00<?, ? examples/s]

In [32]:
dataset

DatasetDict({
    train_r1: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 2923
    })
    dev_r1: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1000
    })
    test_r1: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1000
    })
    train_r2: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 4861
    })
    dev_r2: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1000
    })
    test_r2: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1000
    })
    train_r3: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 13375
    })
    dev_r3: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1200


In [33]:
# Evaluate the model on the ANLI dataset
from tqdm import tqdm
def evaluate_on_dataset(dataset):
    results = []
    label_names = ["entailment", "neutral", "contradiction"]
    for example in tqdm(dataset):
        premise = example['premise']
        hypothesis = example['hypothesis']
        prediction = evaluate_with_NLI(premise, hypothesis)
        results.append({
            'premise': premise,
            'hypothesis': hypothesis,
            'prediction': prediction,
            'pred_label': get_prediction(prediction),
            'gold_label': label_names[example['label']],
            'reason': example['reason']
        })
    return results

In [34]:
pred_test_r3 = evaluate_on_dataset(dataset['test_r3'])

100%|██████████| 1200/1200 [06:16<00:00,  3.19it/s]


In [35]:
pred_test_r3[:5]  # Display the first 5 predictions

[{'premise': "It is Sunday today, let's take a look at the most popular posts of the last couple of days. Most of the articles this week deal with the iPhone, its future version called the iPhone 8 or iPhone Edition, and new builds of iOS and macOS. There are also some posts that deal with the iPhone rival called the Galaxy S8 and some other interesting stories. The list of the most interesting articles is available below. Stay tuned for more rumors and don't forget to follow us on Twitter.",
  'hypothesis': 'The day of the passage is usually when Christians praise the lord together',
  'prediction': {'entailment': 2.4, 'neutral': 97.4, 'contradiction': 0.2},
  'pred_label': 'neutral',
  'gold_label': 'entailment',
  'reason': "Sunday is considered Lord's Day"},
 {'premise': 'By The Associated Press WELLINGTON, New Zealand (AP) — All passengers and crew have survived a crash-landing of a plane in a lagoon in the Federated States of Micronesia. WELLINGTON, New Zealand (AP) — All passeng

## Evaluate Metrics

Let's use the huggingface `evaluate` package to compute the performance of the baseline.


In [36]:
from evaluate import load

accuracy = load("accuracy")
precision = load("precision")
recall = load("recall")
f1 = load("f1")


In [37]:
import evaluate
clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])

In [45]:
clf_metrics.compute(predictions=[0, 1, 0], references=[0, 1, 1])

{'accuracy': 0.6666666666666666,
 'f1': 0.6666666666666666,
 'precision': 1.0,
 'recall': 0.5}

## Your Turn

Compute the classification metrics on the baseline model on each section of the ANLI dataset.

https://www.kaggle.com/code/faijanahamadkhan/llm-evaluation-framework-hugging-face provides good documentation on how to use the Huggingface evaluate library.

1.1

Implement the part of the evaluation on the ANLI samples that have a non-empty 'reason' field on the 'test'
parts of the dataset (there are three such sections test_r1, test_r2 and test_r3).

In [39]:
pred_test_r1 = evaluate_on_dataset(dataset['test_r1'])
pred_test_r2 = evaluate_on_dataset(dataset['test_r2'])

pred_test_r1[:5] 
pred_test_r2[:5]

100%|██████████| 1000/1000 [05:46<00:00,  2.89it/s]
100%|██████████| 1000/1000 [05:32<00:00,  3.01it/s]


[{'premise': 'There is a little Shia community in El Salvador. There is an Islamic Library operated by the Shia community, named "Fatimah Az-Zahra". They published the first Islamic magazine in Central America: "Revista Biblioteca Islámica". Additionally, they are credited with providing the first and only Islamic library dedicated to spreading Islamic culture in the country.',
  'hypothesis': 'The community is south of the United States.',
  'prediction': {'entailment': 94.5, 'neutral': 1.7, 'contradiction': 3.8},
  'pred_label': 'entailment',
  'gold_label': 'entailment',
  'reason': 'The community is in El Salvador which is south of the US.'},
 {'premise': '"Look at Me (When I Rock Wichoo)" is a song by American indie rock band Black Kids, taken from their debut album "Partie Traumatic". It was released in the UK by Almost Gold Recordings on September 8, 2008 and debuted on the Top 200 UK Singles Chart at number 175.',
  'hypothesis': 'The song was released in America in September 2

In [42]:
label_map = {
    "entailment": 0,
    "neutral": 1,
    "contradiction": 2
}

In [60]:
def display_evaluation_metrics(test_predictions, test_references):
    print("Accuracy for test dataset")
    print(accuracy.compute(predictions=test_predictions, references=test_references))

    print("Precision micro for test dataset")
    print(precision.compute(predictions=test_predictions, references=test_references, average = "micro"))

    print("Recall micro for test dataset")
    print(recall.compute(predictions=test_predictions, references=test_references, average = "micro"))

    print("F1 micro for test dataset")
    print(f1.compute(predictions=test_predictions, references=test_references, average = "micro"))

In [61]:
test_r1_predictions = [label_map[e["pred_label"]] for e in pred_test_r1]
test_r1_gold = [label_map[e["gold_label"]] for e in pred_test_r1]

display_evaluation_metrics(test_r1_predictions, test_r1_gold)

Accuracy for test dataset
{'accuracy': 0.712}
Precision micro for test dataset
{'precision': 0.712}
Recall micro for test dataset
{'recall': 0.712}
F1 micro for test dataset
{'f1': 0.712}


In [62]:
test_r2_predictions = [label_map[e["pred_label"]] for e in pred_test_r2]
test_r2_gold = [label_map[e["gold_label"]] for e in pred_test_r2]

display_evaluation_metrics(test_r2_predictions, test_r2_gold)

Accuracy for test dataset
{'accuracy': 0.547}
Precision micro for test dataset
{'precision': 0.547}
Recall micro for test dataset
{'recall': 0.547}
F1 micro for test dataset
{'f1': 0.547}


In [63]:
test_r3_predictions = [label_map[e["pred_label"]] for e in pred_test_r3]
test_r3_gold = [label_map[e["gold_label"]] for e in pred_test_r3]

display_evaluation_metrics(test_r3_predictions, test_r3_gold)

Accuracy for test dataset
{'accuracy': 0.495}
Precision micro for test dataset
{'precision': 0.495}
Recall micro for test dataset
{'recall': 0.495}
F1 micro for test dataset
{'f1': 0.495}


1.2

Investigate Errors of the NLI Model.

Sample 20 errors from the baseline model, and investigate the reasons the model made a mistake.


In [70]:
def sample_n_errors(pred_test, n):
    samples = []
    for test_result in pred_test:
        if test_result['pred_label'] != test_result['gold_label']:
            samples.append(test_result)
            if len(samples) >= n:
                break
    return samples

baseline_error_samples = sample_n_errors(pred_test_r1,7) + sample_n_errors(pred_test_r2,7) + sample_n_errors(pred_test_r3,6)

for item in baseline_error_samples:
    print(item)

{'premise': "Shadowboxer is a 2005 crime thriller film directed by Lee Daniels and starring Academy Award winners Cuba Gooding Jr., Helen Mirren, and Mo'Nique. It opened in limited release in six cities: New York, Los Angeles, Washington, D.C., Baltimore, Philadelphia, and Richmond, Virginia.", 'hypothesis': "Shadowboxer was written and directed by Lee Daniels and was starring Academy Award winners Cuba Gooding Jr., Helen Mirren, and Mo'Nique.", 'prediction': {'entailment': 53.2, 'neutral': 46.1, 'contradiction': 0.8}, 'pred_label': 'entailment', 'gold_label': 'neutral', 'reason': 'It is not know who wrote the Shadowboxer. The system can get confused if a small detail is added for a person while many correct details are written.'}
{'premise': "Edmond (or Edmund) Halley, FRS (pronounced ; 8 November [O.S. 29 October] 1656 – 25 January 1742 [O.S. 14 January 1741] ) was an English astronomer, geophysicist, mathematician, meteorologist, and physicist who is best known for computing the orb

Common patterns behind errors:

Failure to recognize missing or extra information → predicting entailment when it should be neutral.

Lack of external knowledge → geography, word implications (e.g. “defending champion”).

Overgeneralization or assumption-based reasoning.

Misinterpretation of named roles or entities.

Importing test_r3 results for 1.3 evaluation:

In [71]:
import json

with open("pred_test_r3.json", "w") as f:
    json.dump(pred_test_r3, f, indent=2)