# ANLI Baseline

This model illustrates how to use the DeBERTa-v3-base-mnli-fever-anli model to perform specialized inference on the ANLI dataset.
This dataset has 184M parameters. It was trained in 2021 on the basis of a BERT-like embedding approach: 
* The premise and the hypothesis are encoded using the DeBERTa-v3-base contextual encoder
* The encodings are then compared on a fine-tuned model to predict a distribution over the classification labels (entailment, contradiction, neutral)

Reported accuracy on ANLI is 0.495 (see https://huggingface.co/MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli) 



In [65]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

model_name = "MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

In [66]:
premise = "I first thought that I liked the movie, but upon second thought it was actually disappointing."
hypothesis = "The movie was good."

input = tokenizer(premise, hypothesis, truncation=True, return_tensors="pt")
output = model(input["input_ids"].to(device))  # device = "cuda:0" or "cpu"
prediction = torch.softmax(output["logits"][0], -1).tolist()
label_names = ["entailment", "neutral", "contradiction"]
prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
print(prediction)


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'entailment': 6.6, 'neutral': 17.3, 'contradiction': 76.1}


In [84]:
def evaluate(premise, hypothesis):
    input = tokenizer(premise, hypothesis, truncation=True, return_tensors="pt")
    output = model(input["input_ids"].to(device))
    prediction = torch.softmax(output["logits"][0], -1).tolist()
    prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
    return prediction

In [68]:
evaluate("The weather is nice today.", "It is sunny outside.")

{'entailment': 0.1, 'neutral': 99.8, 'contradiction': 0.0}

In [69]:
def get_prediction(pred_dict):
    if pred_dict["entailment"] > pred_dict["contradiction"]  and pred_dict["entailment"] > pred_dict["neutral"]:
        return "entailment"
    elif pred_dict["contradiction"] > pred_dict["entailment"]:
        return "contradiction"
    else:
        return "neutral"

In [70]:
get_prediction(evaluate("The weather is nice today.", "It is sunny outside."))

'neutral'

In [71]:
get_prediction(evaluate("It is sunny outside.", "The weather is nice today."))

'entailment'

In [72]:
get_prediction(evaluate("It is sunny outside.", "The weather is terrible today."))

'contradiction'

## Load ANLI dataset

In [73]:
from datasets import load_dataset

dataset = load_dataset("facebook/anli")
dataset = dataset.filter(lambda x: x['reason'] != None and x['reason'] != "")

In [74]:
dataset

DatasetDict({
    train_r1: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 2923
    })
    dev_r1: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1000
    })
    test_r1: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1000
    })
    train_r2: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 4861
    })
    dev_r2: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1000
    })
    test_r2: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1000
    })
    train_r3: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 13375
    })
    dev_r3: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1200


In [85]:
# Evaluate the model on the ANLI dataset
from tqdm import tqdm
def evaluate_on_dataset(dataset):
    results = []
    label_names = ["entailment", "neutral", "contradiction"]
    for example in tqdm(dataset):
        premise = example['premise']
        hypothesis = example['hypothesis']
        prediction = evaluate(premise, hypothesis)
        results.append({
            'premise': premise,
            'hypothesis': hypothesis,
            'prediction': prediction,
            'pred_label': get_prediction(prediction),
            'gold_label': label_names[example['label']],
            'reason': example['reason']
        })
    return results

In [76]:
pred_test_r3 = evaluate_on_dataset(dataset['test_r3'])

100%|██████████| 1200/1200 [06:56<00:00,  2.88it/s]


In [77]:
pred_test_r3[:5]  # Display the first 5 predictions

[{'premise': "It is Sunday today, let's take a look at the most popular posts of the last couple of days. Most of the articles this week deal with the iPhone, its future version called the iPhone 8 or iPhone Edition, and new builds of iOS and macOS. There are also some posts that deal with the iPhone rival called the Galaxy S8 and some other interesting stories. The list of the most interesting articles is available below. Stay tuned for more rumors and don't forget to follow us on Twitter.",
  'hypothesis': 'The day of the passage is usually when Christians praise the lord together',
  'prediction': {'entailment': 2.4, 'neutral': 97.4, 'contradiction': 0.2},
  'pred_label': 'neutral',
  'gold_label': 'entailment',
  'reason': "Sunday is considered Lord's Day"},
 {'premise': 'By The Associated Press WELLINGTON, New Zealand (AP) — All passengers and crew have survived a crash-landing of a plane in a lagoon in the Federated States of Micronesia. WELLINGTON, New Zealand (AP) — All passeng

## Evaluate Metrics

Let's use the huggingface `evaluate` package to compute the performance of the baseline.


In [78]:
from evaluate import load

accuracy = load("accuracy")
precision = load("precision")
recall = load("recall")
f1 = load("f1")


In [79]:
import evaluate
clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])

In [80]:
clf_metrics.compute(predictions=[0, 1, 0], references=[0, 1, 1])

{'accuracy': 0.6666666666666666,
 'f1': 0.6666666666666666,
 'precision': 1.0,
 'recall': 0.5}

## Your Turn

Compute the classification metrics on the baseline model on each section of the ANLI dataset.

https://www.kaggle.com/code/faijanahamadkhan/llm-evaluation-framework-hugging-face provides good documentation on how to use the Huggingface evaluate library.

## 1.1. Execute the NLI Notebook

In [86]:
def evaluate_all_test_sections(dataset):
    results = {}
    sections = ['test_r1', 'test_r2', 'test_r3']

    for section in sections:
        section_results = evaluate_on_dataset(dataset[section])
        results[section] = section_results

        predictions = [result['pred_label'] for result in section_results]
        references = [result['gold_label'] for result in section_results]

        #  Convert the lables to numerical values for evaluation
        label_to_int = {"entailment": 0, "neutral": 1, "contradiction": 2}
        predictions_int = [label_to_int[pred] for pred in predictions]
        references_int = [label_to_int[ref] for ref in references]

        accuracy_score = accuracy.compute(predictions=predictions_int, references=references_int)['accuracy']
        f1_score = f1.compute(predictions=predictions_int, references=references_int, average='macro')['f1']
        precision_score = precision.compute(predictions=predictions_int, references=references_int, average='macro')['precision']
        recall_score = recall.compute(predictions=predictions_int, references=references_int, average='macro')['recall']

        print(f"Results for section {section}:")
        print(f"\tAccuracy: {accuracy_score:.3f}")
        print(f"\tF1: {f1_score:.3f}")
        print(f"\tPrecision: {precision_score:.3f}")
        print(f"\tRecall: {recall_score:.3f}")
        print("-" * 50)

    return results

test_results = evaluate_all_test_sections(dataset)

100%|██████████| 1000/1000 [05:56<00:00,  2.81it/s]


Results for section test_r1:
	Accuracy: 0.619
	F1: 0.605
	Precision: 0.633
	Recall: 0.619
--------------------------------------------------


100%|██████████| 1000/1000 [05:32<00:00,  3.01it/s]


Results for section test_r2:
	Accuracy: 0.504
	F1: 0.489
	Precision: 0.508
	Recall: 0.504
--------------------------------------------------


100%|██████████| 1200/1200 [06:47<00:00,  2.95it/s]


Results for section test_r3:
	Accuracy: 0.481
	F1: 0.463
	Precision: 0.465
	Recall: 0.482
--------------------------------------------------


## 1.2. Investigate Errors of the NLI Model

In [88]:
def sample_errors(test_results):
    errors = []
    
    for section in ['test_r1', 'test_r2', 'test_r3']:
        for result in test_results[section]:
            if result['pred_label'] != result['gold_label']:
                errors.append(result)

            if len(errors) == 20:
                return errors

    return errors

error_samples = sample_errors(test_results)
for i, error in enumerate(error_samples):
    print(f"Error {i+1}:")
    print(f"\tPremise: {error['premise']}")
    print(f"\tHypothesis: {error['hypothesis']}")
    print(f"\tPredicted: {error['pred_label']}")
    print(f"\tGold Label: {error['gold_label']}")
    print(f"\tReason: {error['reason']}")
    print("-" * 50)

Error 1:
	Premise: Shadowboxer is a 2005 crime thriller film directed by Lee Daniels and starring Academy Award winners Cuba Gooding Jr., Helen Mirren, and Mo'Nique. It opened in limited release in six cities: New York, Los Angeles, Washington, D.C., Baltimore, Philadelphia, and Richmond, Virginia.
	Hypothesis: Shadowboxer was written and directed by Lee Daniels and was starring Academy Award winners Cuba Gooding Jr., Helen Mirren, and Mo'Nique.
	Predicted: entailment
	Gold Label: neutral
	Reason: It is not know who wrote the Shadowboxer. The system can get confused if a small detail is added for a person while many correct details are written.
--------------------------------------------------
Error 2:
	Premise: Michael T. Scuse (born 1954) is an American public official. He was the acting United States Deputy Secretary of Agriculture, and, following the resignation of Tom Vilsack on January 13, 2017, was acting United States Secretary of Agriculture until Donald Trump took office as 

factors: 

1. Lexical and Paraphrastic Variation

2. Syntactic Complexity and Ambiguity

3. World Knowledge and Commonsense Reasoning

4. Vague Cases and Graded Entailment

5. Hidden Assumptions and Event Coreference

## Error Analysis table

### Based on the analysis of 20 error samples from the ANLI baseline model, here are the key observations:

| **#** | **Error Summary** | **Reason Summary** | **Factor** |
|-------|-------------------|--------------------|------------|
|   1   | Assumed writing from directing | Small detail added, many other details correct → confused model | Hidden Assumptions |
|   2   | Secret sevice assignment | Not stated in text, model assumed cabinet role implies protection | World Knowledge |
|   3   | Halley born outside UK | English nationality ≠ confirmed UK birth | World Knowledge |
|   4   | Plant found all over the world | Native to North America, but global presence not mentioned | World Knowledge |
|   5   | Film broadcast assumptions | Broadcasting not mentioned, the AI can’t assume | Hidden Assumptions |
|   6   | Champion in 3rd edition | "Defending" implies prior win, the model didn’t infer that | Hidden Assumptions |
|   7   | Armenian Film Festival | 	No evidence of 2002 event, model over-inferred | Vague Cases |