# ANLI Baseline

This model illustrates how to use the DeBERTa-v3-base-mnli-fever-anli model to perform specialized inference on the ANLI dataset.
This dataset has 184M parameters. It was trained in 2021 on the basis of a BERT-like embedding approach: 
* The premise and the hypothesis are encoded using the DeBERTa-v3-base contextual encoder
* The encodings are then compared on a fine-tuned model to predict a distribution over the classification labels (entailment, contradiction, neutral)

Reported accuracy on ANLI is 0.495 (see https://huggingface.co/MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli) 



In [29]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

model_name = "MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

In [30]:
premise = "I first thought that I liked the movie, but upon second thought it was actually disappointing."
hypothesis = "The movie was good."

input = tokenizer(premise, hypothesis, truncation=True, return_tensors="pt")
output = model(input["input_ids"].to(device))  # device = "cuda:0" or "cpu"
prediction = torch.softmax(output["logits"][0], -1).tolist()
label_names = ["entailment", "neutral", "contradiction"]
prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
print(prediction)


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'entailment': 6.6, 'neutral': 17.3, 'contradiction': 76.1}


In [31]:
def evaluate(premise, hypothesis):
    input = tokenizer(premise, hypothesis, truncation=True, return_tensors="pt")
    output = model(input["input_ids"].to(device))
    prediction = torch.softmax(output["logits"][0], -1).tolist()
    prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
    return prediction

In [32]:
evaluate("The weather is nice today.", "It is sunny outside.")

{'entailment': 0.1, 'neutral': 99.8, 'contradiction': 0.0}

In [33]:
def get_prediction(pred_dict):
    if pred_dict["entailment"] > pred_dict["contradiction"]  and pred_dict["entailment"] > pred_dict["neutral"]:
        return "entailment"
    elif pred_dict["contradiction"] > pred_dict["entailment"]  and pred_dict["contradiction"] > pred_dict["neutral"]:
        return "contradiction"
    else:
        return "neutral"

In [34]:
get_prediction(evaluate("The weather is nice today.", "It is sunny outside."))

'neutral'

In [35]:
get_prediction(evaluate("It is sunny outside.", "The weather is nice today."))

'entailment'

In [36]:
get_prediction(evaluate("It is sunny outside.", "The weather is terrible today."))

'contradiction'

## Load ANLI dataset

In [37]:
from datasets import load_dataset

dataset = load_dataset("facebook/anli")
dataset = dataset.filter(lambda x: x['reason'] != None and x['reason'] != "")

In [38]:
dataset

DatasetDict({
    train_r1: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 2923
    })
    dev_r1: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1000
    })
    test_r1: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1000
    })
    train_r2: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 4861
    })
    dev_r2: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1000
    })
    test_r2: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1000
    })
    train_r3: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 13375
    })
    dev_r3: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1200


In [39]:
# Evaluate the model on the ANLI dataset
from tqdm import tqdm
def evaluate_on_dataset(dataset):
    results = []
    label_names = ["entailment", "neutral", "contradiction"]
    for example in tqdm(dataset):
        premise = example['premise']
        hypothesis = example['hypothesis']
        prediction = evaluate(premise, hypothesis)
        results.append({
            'premise': premise,
            'hypothesis': hypothesis,
            'prediction': prediction,
            'pred_label': get_prediction(prediction),
            'gold_label': label_names[example['label']],
            'reason': example['reason']
        })
    return results

In [40]:
pred_test_r3 = evaluate_on_dataset(dataset['test_r3'])

100%|██████████| 1200/1200 [01:58<00:00, 10.11it/s]


In [41]:
pred_test_r3[:5]  # Display the first 5 predictions

[{'premise': "It is Sunday today, let's take a look at the most popular posts of the last couple of days. Most of the articles this week deal with the iPhone, its future version called the iPhone 8 or iPhone Edition, and new builds of iOS and macOS. There are also some posts that deal with the iPhone rival called the Galaxy S8 and some other interesting stories. The list of the most interesting articles is available below. Stay tuned for more rumors and don't forget to follow us on Twitter.",
  'hypothesis': 'The day of the passage is usually when Christians praise the lord together',
  'prediction': {'entailment': 2.4, 'neutral': 97.4, 'contradiction': 0.2},
  'pred_label': 'neutral',
  'gold_label': 'entailment',
  'reason': "Sunday is considered Lord's Day"},
 {'premise': 'By The Associated Press WELLINGTON, New Zealand (AP) — All passengers and crew have survived a crash-landing of a plane in a lagoon in the Federated States of Micronesia. WELLINGTON, New Zealand (AP) — All passeng

## Evaluate Metrics

Let's use the huggingface `evaluate` package to compute the performance of the baseline.


In [42]:
from evaluate import load

accuracy = load("accuracy")
precision = load("precision")
recall = load("recall")
f1 = load("f1")


In [43]:
from evaluate import combine
clf_metrics = combine(["accuracy", "f1", "precision", "recall"])

In [44]:
clf_metrics.compute(predictions=[0, 1, 0], references=[0, 1, 1])

{'accuracy': 0.6666666666666666,
 'f1': 0.6666666666666666,
 'precision': 1.0,
 'recall': 0.5}

## Your Turn

Compute the classification metrics on the baseline model on each section of the ANLI dataset.

https://www.kaggle.com/code/faijanahamadkhan/llm-evaluation-framework-hugging-face provides good documentation on how to use the Huggingface evaluate library.

In [49]:
pred_test_r1 = evaluate_on_dataset(dataset['test_r1'])
# pred_test_r1[:5]


100%|██████████| 1000/1000 [01:48<00:00,  9.20it/s]


In [50]:
pred_test_r2 = evaluate_on_dataset(dataset['test_r2'])
# pred_test_r2[:5]

100%|██████████| 1000/1000 [01:53<00:00,  8.78it/s]


In [51]:
pred_test_r3 = evaluate_on_dataset(dataset['test_r3'])
# pred_test_r3[:5]


100%|██████████| 1200/1200 [02:12<00:00,  9.06it/s]


In [65]:
from evaluate import load

# load each metric
accuracy  = load("accuracy")
precision = load("precision")
recall    = load("recall")
f1        = load("f1")

label2id = {"entailment": 0, "neutral": 1, "contradiction": 2}

def compute_section_metrics(pred_list):
    preds = [label2id[ex['pred_label']] for ex in pred_list]
    refs  = [label2id[ex['gold_label']] for ex in pred_list]

    return {
        "accuracy" : accuracy.compute(predictions=preds, references=refs)["accuracy"],
        "precision": precision.compute(predictions=preds,
                                       references=refs,
                                       average="macro")["precision"],
        "recall"   : recall.compute(predictions=preds,
                                    references=refs,
                                    average="macro")["recall"],
        "f1"       : f1.compute(predictions=preds,
                                references=refs,
                                average="macro")["f1"],
    }

for split_name, preds in [
    ("test_r1", pred_test_r1),
    ("test_r2", pred_test_r2),
    ("test_r3", pred_test_r3),
]:
    m = compute_section_metrics(preds)
    print(f"\n=== Metrics on {split_name} ({len(preds)} examples) ===")
    print(f"Accuracy : {m['accuracy']:.4f}")
    print(f"Precision: {m['precision']:.4f}")
    print(f"Recall   : {m['recall']:.4f}")
    print(f"F1       : {m['f1']:.4f}")


=== Metrics on test_r1 (1000 examples) ===
Accuracy : 0.7120
Precision: 0.7135
Recall   : 0.7120
F1       : 0.7119

=== Metrics on test_r2 (1000 examples) ===
Accuracy : 0.5470
Precision: 0.5472
Recall   : 0.5470
F1       : 0.5465

=== Metrics on test_r3 (1200 examples) ===
Accuracy : 0.4950
Precision: 0.4985
Recall   : 0.4946
F1       : 0.4943


In [71]:
import random
import pandas as pd

all_predictions = pred_test_r1 + pred_test_r2 + pred_test_r3

errors = [p for p in all_predictions if p['pred_label'] != p['gold_label']]

print(f"Found {len(errors)} errors out of {len(all_predictions)} total predictions.")

if len(errors) >= 20:
    error_samples = random.sample(errors, 20)
else:
    error_samples = errors 

print(f"Investigating {len(error_samples)} sampled errors...")

pd.set_option('display.max_colwidth', None)
error_df = pd.DataFrame(error_samples)[['premise', 'hypothesis', 'pred_label', 'gold_label', 'reason']]
# error_df


Found 1347 errors out of 3200 total predictions.
Investigating 20 sampled errors...


In [73]:
error_df.to_csv("anli_errors.csv", index=False)
error_df

Unnamed: 0,premise,hypothesis,pred_label,gold_label,reason
0,"How to deal with stress about current events<br>Avoid the news in the morning. Negative news in the morning may affect your mood for the rest of the day. When you wake up, try to keep yourself positive.",You should find something enjoyable in the morning.,neutral,entailment,You should stay positive in the morning.
1,"Thank You Happy Birthday is the second studio album by American rock band Cage the Elephant. It was released on January 11, 2011 to positive critical reception. The album was produced by Jay Joyce, who worked in the same capacity on the band's eponymous debut album.",Cage the Elephant had positive reviews thanks to the work of Jay Joyce who produced it.,entailment,neutral,How can you possibly know from above if the producer is the reason the album was successful?
2,"It Takes a Village: And Other Lessons Children Teach Us is a book published in 1996 by First Lady of the United States Hillary Rodham Clinton. In it, Clinton presents her vision for the children of America. She focuses on the impact individuals and groups outside the family have, for better or worse, on a child's well-being, and advocates a society which meets all of a child's needs.","The rest of the title of ""It Takes a Village"" is ""The Impact Individuals and Groups Outside the Family Have"".",entailment,contradiction,It's incorrect because the rest of the title is something else. It fooled the system because I used part of a sentence in the info.
3,"The 2012 Supercheap Auto Bathurst 1000 was an Australian touring car motor race for V8 Supercars, the twenty-first race of the 2012 International V8 Supercars Championship. It was held on Sunday, 7 October 2012 at the Mount Panorama Circuit on the outskirts of Bathurst, New South Wales, in Australia.",The 2012 Supercheap Auto Bathurst 1000 has never been won by a motorcycle.,neutral,entailment,"It's a supercar race, not a motorcycle race."
4,"Mahalakshmi (Tamil: மகாலட்சுமி ) is an 2017 Indian-Tamil Language Family soap opera starring Kavya Shastry, Vallab, Anjali Rav and Lokesh. It replaced Nijangal and it broadcast on Sun TV on Monday to Saturday from 6 March 2017 at 12:30PM (IST). It was produced by Vision Time India Pvt Ltd and directed by Shan Karthik and M.K.Arunthavaraja.",in 2017 Mahalakshmi was broadcast for the first time when it replaced Nijangal,neutral,entailment,The show Mahalakshmi was broadcast first in 2017 after it replaced Nijangal. I think it was hard for the computer because i used parts of the context that were separated by big parts of text and changed some words
5,"In Italy, big protests by students and university staff against government reforms to higher education brought parts of central Rome to a standstill on Tuesday.",Big student and university staff protests brought parts of the Roman Empire to a standstill on Tuesday.,entailment,contradiction,It is common knowledge that the Roman Empire no longer exists so this is incorrect.
6,"Giovanni Visconti — according to Lorenzo Cardella nephew of Pope Gregory X. He was ostensibly created cardinal-bishop of Sabina by his uncle in 1275 and in 1276 was named judge in the case concerning the translation of bishop Giovanni of Potenza to the archbishopric of Monreale, postulated by the cathedral chapter of Monreale. He died in 1277 or 1278.",Giovanni Visconti died in both 1277 and 1278.,entailment,contradiction,"He died in one of the two listed years, not both, as reflected by the word ""or"" in the text."
7,The Walkie Talkie<br>The boys loved playing outside. They had walkie talkies that the used to talk to each other. They would run through the woods and keep in contact with them. They played this game all summer. THey couldn't wait until next summer to play it again!,The walkie talkies made it easy for the boys to keep in contact with each other,neutral,entailment,The whole context is about how the boys used the walkie talkies to talk to each other the whole summer. I think the system didn't guess right because I didn't use a lot of the same words from the context in my answer.
8,"Edward M. ""Teddy"" Sears (born April 6, 1977) is an American actor, known for his roles as Richard Patrick Woolsley on the TNT legal drama series ""Raising the Bar"", Patrick on the first season of FX anthology horror drama ""American Horror Story"" (retroactively titled """"), Dr. Austin Langham on the ""Showtime"" period drama series ""Masters of Sex"", and Hunter Zolomon/Zoom in ""The Flash"".",Edward was not good enough for movies.,contradiction,neutral,There was no mention of movies so that is not known.
9,"Scooter is an animated character used by Fox Sports during Major League Baseball games. The character, a baseball with human facial characteristics, is voiced by Tom Kenny (best known for his work as the voice of SpongeBob SquarePants) and was designed by Fox to explain different types of pitches with the education of children in mind.",Scooter is a sphere-shaped character with human facial characteristics.,contradiction,entailment,Baseball's are sphere-shaped and Scooter is a baseball with human facial characteristics. I think this was difficult for the system because they would have to know that a baseball is sphere-shaped.
