# ANLI Baseline

This model illustrates how to use the DeBERTa-v3-base-mnli-fever-anli model to perform specialized inference on the ANLI dataset.
This dataset has 184M parameters. It was trained in 2021 on the basis of a BERT-like embedding approach: 
* The premise and the hypothesis are encoded using the DeBERTa-v3-base contextual encoder
* The encodings are then compared on a fine-tuned model to predict a distribution over the classification labels (entailment, contradiction, neutral)

Reported accuracy on ANLI is 0.495 (see https://huggingface.co/MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli) 



In [1]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from random import sample,seed

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

model_name = "MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

In [2]:
premise = "I first thought that I liked the movie, but upon second thought it was actually disappointing."
hypothesis = "The movie was good."
model.to(device) #Lotem: added to explicitly move the model to GPU.
input = tokenizer(premise, hypothesis, truncation=True, return_tensors="pt")
output = model(input["input_ids"].to(device))  # device = "cuda:0" or "cpu"
prediction = torch.softmax(output["logits"][0], -1).tolist()
label_names = ["entailment", "neutral", "contradiction"]
prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
print(prediction)


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'entailment': 6.6, 'neutral': 17.3, 'contradiction': 76.1}


In [3]:
def evaluate(premise, hypothesis):
    input = tokenizer(premise, hypothesis, truncation=True, return_tensors="pt")
    output = model(input["input_ids"].to(device))
    prediction = torch.softmax(output["logits"][0], -1).tolist()
    prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
    return prediction

In [4]:
evaluate("The weather is nice today.", "It is sunny outside.")

{'entailment': 0.1, 'neutral': 99.8, 'contradiction': 0.0}

In [5]:
def get_prediction(pred_dict):
    if pred_dict["entailment"] > pred_dict["contradiction"]  and pred_dict["entailment"] > pred_dict["neutral"]:
        return "entailment"
    elif pred_dict["contradiction"] > pred_dict["entailment"] and pred_dict["contradiction"] > pred_dict["neutral"]:
        return "contradiction"
    else:
        return "neutral"

In [6]:
get_prediction(evaluate("The weather is nice today.", "It is sunny outside."))

'neutral'

In [7]:
get_prediction(evaluate("It is sunny outside.", "The weather is nice today."))

'entailment'

In [8]:
get_prediction(evaluate("It is sunny outside.", "The weather is terrible today."))

'contradiction'

## Load ANLI dataset

In [9]:
from datasets import load_dataset

dataset = load_dataset("facebook/anli")
dataset = dataset.filter(lambda x: x['reason'] != None and x['reason'] != "")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [10]:
dataset

DatasetDict({
    train_r1: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 2923
    })
    dev_r1: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1000
    })
    test_r1: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1000
    })
    train_r2: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 4861
    })
    dev_r2: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1000
    })
    test_r2: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1000
    })
    train_r3: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 13375
    })
    dev_r3: Dataset({
        features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
        num_rows: 1200


In [11]:
# Evaluate the model on the ANLI dataset
from tqdm import tqdm
def evaluate_on_dataset(dataset):
    results = []
    label_names = ["entailment", "neutral", "contradiction"]
    for example in tqdm(dataset):
        premise = example['premise']
        hypothesis = example['hypothesis']
        prediction = evaluate(premise, hypothesis)
        results.append({
            'premise': premise,
            'hypothesis': hypothesis,
            'prediction': prediction,
            'pred_label': get_prediction(prediction),
            'gold_label': label_names[example['label']],
            'reason': example['reason']
        })
    return results

In [12]:
pred_test_r3 = evaluate_on_dataset(dataset['test_r3'])

100%|██████████| 1200/1200 [01:39<00:00, 12.04it/s]


In [13]:
pred_test_r3[:5]  # Display the first 5 predictions

[{'premise': "It is Sunday today, let's take a look at the most popular posts of the last couple of days. Most of the articles this week deal with the iPhone, its future version called the iPhone 8 or iPhone Edition, and new builds of iOS and macOS. There are also some posts that deal with the iPhone rival called the Galaxy S8 and some other interesting stories. The list of the most interesting articles is available below. Stay tuned for more rumors and don't forget to follow us on Twitter.",
  'hypothesis': 'The day of the passage is usually when Christians praise the lord together',
  'prediction': {'entailment': 2.4, 'neutral': 97.4, 'contradiction': 0.2},
  'pred_label': 'neutral',
  'gold_label': 'entailment',
  'reason': "Sunday is considered Lord's Day"},
 {'premise': 'By The Associated Press WELLINGTON, New Zealand (AP) — All passengers and crew have survived a crash-landing of a plane in a lagoon in the Federated States of Micronesia. WELLINGTON, New Zealand (AP) — All passeng

## Evaluate Metrics

Let's use the huggingface `evaluate` package to compute the performance of the baseline.


In [14]:
from evaluate import load,combine

accuracy = load("accuracy")
precision = load("precision")
recall = load("recall")
f1 = load("f1")


In [15]:
clf_metrics = combine(["accuracy", "f1", "precision", "recall"])

In [16]:
clf_metrics.compute(predictions=[0, 1, 0], references=[0, 1, 1])

{'accuracy': 0.6666666666666666,
 'f1': 0.6666666666666666,
 'precision': 1.0,
 'recall': 0.5}

## Your Turn

Compute the classification metrics on the baseline model on each section of the ANLI dataset.

https://www.kaggle.com/code/faijanahamadkhan/llm-evaluation-framework-hugging-face provides good documentation on how to use the Huggingface evaluate library.

In [17]:
dataset = dataset.filter(lambda x: x['reason'] != None and x['reason'] != "")

pred_test_r1 = evaluate_on_dataset(dataset['test_r1'])
pred_test_r2 = evaluate_on_dataset(dataset['test_r2'])
pred_test_r3 = evaluate_on_dataset(dataset['test_r3'])

100%|██████████| 1000/1000 [01:37<00:00, 10.27it/s]
100%|██████████| 1000/1000 [01:20<00:00, 12.46it/s]
100%|██████████| 1200/1200 [01:38<00:00, 12.24it/s]


In [18]:
pred_labels_r1 = [x['pred_label'] for x in pred_test_r1]
gold_labels_r1 = [x['gold_label'] for x in pred_test_r1]
pred_labels_r2 = [x['pred_label'] for x in pred_test_r2]
gold_labels_r2 = [x['gold_label'] for x in pred_test_r2]
pred_labels_r3 = [x['pred_label'] for x in pred_test_r3]
gold_labels_r3 = [x['gold_label'] for x in pred_test_r3]

In [19]:
from sklearn.metrics import classification_report
#Classification report on r1_test
print("Classification report on r1_test")
print(classification_report(gold_labels_r1, pred_labels_r1))
print("Classification report on r2_test")
print(classification_report(gold_labels_r2, pred_labels_r2))
print("Classification report on r3_test")
print(classification_report(gold_labels_r3, pred_labels_r3))

Classification report on r1_test
               precision    recall  f1-score   support

contradiction       0.75      0.68      0.71       333
   entailment       0.70      0.73      0.71       334
      neutral       0.70      0.73      0.71       333

     accuracy                           0.71      1000
    macro avg       0.71      0.71      0.71      1000
 weighted avg       0.71      0.71      0.71      1000

Classification report on r2_test
               precision    recall  f1-score   support

contradiction       0.55      0.50      0.53       333
   entailment       0.54      0.57      0.55       334
      neutral       0.55      0.57      0.56       333

     accuracy                           0.55      1000
    macro avg       0.55      0.55      0.55      1000
 weighted avg       0.55      0.55      0.55      1000

Classification report on r3_test
               precision    recall  f1-score   support

contradiction       0.51      0.42      0.46       396
   entailment 

## Investigate Errors of the NLI Model
First, we will sample 20 errors from the baseline model:

In [20]:
pred_test_r1_errors = [x for x in pred_test_r1 if x['pred_label'] != x['gold_label']]
pred_test_r2_errors = [x for x in pred_test_r2 if x['pred_label'] != x['gold_label']]
pred_test_r3_errors = [x for x in pred_test_r3 if x['pred_label'] != x['gold_label']]

We will now print some statistics about each class

In [21]:
print(f"Amount of mistakes in test_r1:\t{len(pred_test_r1_errors)}")
print(f"Amount of mistakes in test_r2:\t{len(pred_test_r2_errors)}")
print(f"Amount of mistakes in test_r3:\t{len(pred_test_r3_errors)}")

Amount of mistakes in test_r1:	288
Amount of mistakes in test_r2:	453
Amount of mistakes in test_r3:	606


In [22]:
seed(0)
samples = sample(pred_test_r1_errors,6)+sample(pred_test_r2_errors,7)+sample(pred_test_r3_errors,7)

We will now nicely print the 20 samples

In [23]:
import textwrap

def print_nli_errors(samples, width=100):
    for i, sample in enumerate(samples, 1):
        print(f"\n{'='*width}")
        print(f"🧠 Sample #{i}")
        print(f"{'-'*width}")

        print("📘 Premise:")
        print(textwrap.fill(sample['premise'], width=width))

        print("\n📗 Hypothesis:")
        print(textwrap.fill(sample['hypothesis'], width=width))

        print("\n🔍 Prediction Scores:")
        for label, score in sample['prediction'].items():
            print(f"  {label.capitalize():<13}: {score}%")

        print(f"\n✅ Predicted Label: {sample['pred_label']}")
        print(f"🎯 Gold Label     : {sample['gold_label']}")

        print("\n💡 Reasoning:")
        print(textwrap.fill(sample['reason'], width=width))
        print(f"{'='*width}")

# Then use:
print_nli_errors(samples)



🧠 Sample #1
----------------------------------------------------------------------------------------------------
📘 Premise:
Sully Diaz (July 12, 1960; New York City) is a Spanish actress and singer born to Sephardic parents
from, Puerto Rico. Sully's career started in Puerto Rican television with her first starring role as
Coralito in the "novela" called "Coralito". "Coralito" was her first starring role. Sully was
invited to star in various soap operas in Puerto Rico, Venezuela and Argentina.

📗 Hypothesis:
Sully Diaz was born in the United States.

🔍 Prediction Scores:
  Entailment   : 1.0%
  Neutral      : 0.6%
  Contradiction: 98.3%

✅ Predicted Label: contradiction
🎯 Gold Label     : entailment

💡 Reasoning:
Excerpt states she was born in NYC, which is in the USA. Model maybe got confused by the fact her
parents were from Puerto Rico.

🧠 Sample #2
----------------------------------------------------------------------------------------------------
📘 Premise:
The Stinky Puffs were 

One of the runs provided us the following results:
## NLI Model Error Samples

---

### 🧠 Sample #1

**📘 Premise:**
Sully Diaz (July 12, 1960; New York City) is a Spanish actress and singer born to Sephardic parents from Puerto Rico. Sully's career started in Puerto Rican television with her first starring role as Coralito in the "novela" called "Coralito". "Coralito" was her first starring role. Sully was invited to star in various soap operas in Puerto Rico, Venezuela and Argentina.

**📗 Hypothesis:**
Sully Diaz was born in the United States.

**🔍 Prediction Scores:**

* Entailment: 1.0%
* Neutral: 0.6%
* Contradiction: 98.3%

**✅ Predicted Label:** contradiction
**🎯 Gold Label:** entailment

**💡 Reasoning:**
Excerpt states she was born in NYC, which is in the USA. Model maybe got confused by the fact her parents were from Puerto Rico.

---

### 🧠 Sample #2

**📘 Premise:**
The Stinky Puffs were an early 90's rock band started by then seven-year-old Simon Fair Timony, then-stepson of Jad Fair, and by Cody Linn Ranaldo, son of Sonic Youth guitarist Lee Ranaldo. After a 7" single an LP followed in 1995 titled "A Little Tiny Smelly Bit of...the Stinky Puffs" and an EP in 1996 titled "Songs and Advice for Kids Who Have Been Left Behind".

**📗 Hypothesis:**
Jad Fair married Simon Fair Timony's mother.

**🔍 Prediction Scores:**

* Entailment: 0.1%
* Neutral: 99.7%
* Contradiction: 0.2%

**✅ Predicted Label:** neutral
**🎯 Gold Label:** entailment

**💡 Reasoning:**
Jad Fair was Simon Fair Timony's stepfather, so he would have married his mother. Difficult because the mother was never mentioned.

---

### 🧠 Sample #3

**📘 Premise:**
.lgbt is a sponsored top-level domain for the LGBT community, sponsored by Afilias. The domain name was delegated to the Root Zone on 18 July 2014. The creation of .lgbt is meant to promote diversity and LGBT businesses, and is open to LGBT businesses, organizations, and anyone wishing to reach the LGBT community.

**📗 Hypothesis:**
The .lgbt domain is for lgbt small businesses.

**🔍 Prediction Scores:**

* Entailment: 1.7%
* Neutral: 98.0%
* Contradiction: 0.2%

**✅ Predicted Label:** neutral
**🎯 Gold Label:** entailment

**💡 Reasoning:**
It never directly said if it was for small businesses or all businesses in general.

---

### 🧠 Sample #4

**📘 Premise:**
Julian Peter McDonald Clary (born 25 May 1959) is an English comedian and novelist. Openly gay, Clary began appearing on television in the mid-1980s and became known for his deliberately stereotypical camp style. Since then he has also acted in films, television and stage productions, and was the winner of "Celebrity Big Brother 10" in 2012.

**📗 Hypothesis:**
Julian Peter McDonald Clary was less than 30 years old when he began his television career.

**🔍 Prediction Scores:**

* Entailment: 3.2%
* Neutral: 0.8%
* Contradiction: 96.0%

**✅ Predicted Label:** contradiction
**🎯 Gold Label:** entailment

**💡 Reasoning:**
The statement is true because Clary was around 26 when he started in television. The model was likely confused by vagueness or calculation.

---

### 🧠 Sample #5

**📘 Premise:**
The Best of David Bowie 1974/1979 is a compilation album by David Bowie released in 1998. It includes material released between 1974–1979.

**📗 Hypothesis:**
David Bowie didn't only release an album in 1998 but also in 1979.

**🔍 Prediction Scores:**

* Entailment: 98.6%
* Neutral: 0.7%
* Contradiction: 0.7%

**✅ Predicted Label:** entailment
**🎯 Gold Label:** neutral

**💡 Reasoning:**
We don't know if he released an album in 1979. Maybe the years confused the system.

---

### 🧠 Sample #6

**📘 Premise:**
Gyula Trebitsch (3 November 1914 - 12 December 2005) was a German film producer born in Budapest, Hungary. He was nominated in 1956 for the Academy Award for Best Foreign Language Film along with Walter Koppel for their film "The Captain of Kopenick".

**📗 Hypothesis:**
Gyula Trebitsch was nominated for the Academy Award for Best Foreign Language Film for his work on "The Captain of Kopenick" at the age of 43.

**🔍 Prediction Scores:**

* Entailment: 33.7%
* Neutral: 53.0%
* Contradiction: 13.3%

**✅ Predicted Label:** neutral
**🎯 Gold Label:** contradiction

**💡 Reasoning:**
He was nominated in 1956 and didn't turn 43 until 1957. The system has trouble with ages and birthdays.

---

### 🧠 Sample #7

**📘 Premise:**
The Samsung Galaxy Tab 8.9 is an Android-based tablet computer introduced in 2011. It features an 8.9-inch display and a 1 GHz dual-core Nvidia Tegra 2 processor.

**📗 Hypothesis:**
Samsung Galaxy has about 9 inch display

**🔍 Prediction Scores:**

* Entailment: 26.9%
* Neutral: 15.0%
* Contradiction: 58.1%

**✅ Predicted Label:** contradiction
**🎯 Gold Label:** entailment

**💡 Reasoning:**
The display is about 8.9 inches, which is approximately 9 inches.

---

### 🧠 Sample #8

**📘 Premise:**
"May the Bird of Paradise Fly Up Your Nose" is a 1965 novelty song by Little Jimmy Dickens. It was on the chart for a total of 18 weeks.

**📗 Hypothesis:**
"May the Bird of Paradise Fly Up Your Nose" was not on the charts for 6 months

**🔍 Prediction Scores:**

* Entailment: 10.8%
* Neutral: 1.3%
* Contradiction: 87.9%

**✅ Predicted Label:** contradiction
**🎯 Gold Label:** entailment

**💡 Reasoning:**
18 weeks is less than 6 months. The hypothesis is true.

---

### 🧠 Sample #9

**📘 Premise:**
Alexandre "Xande" Ribeiro (born January 20, 1981) is a Brazilian Jiu-Jitsu champion.

**📗 Hypothesis:**
Alexandre "Xande" Ribeiro is 38 Years old

**🔍 Prediction Scores:**

* Entailment: 26.8%
* Neutral: 1.2%
* Contradiction: 72.0%

**✅ Predicted Label:** contradiction
**🎯 Gold Label:** entailment

**💡 Reasoning:**
Using his birthdate and the assumed evaluation year, he would be 38 years old.

---

### 🧠 Sample #10

**📘 Premise:**
Bronwen is a Welsh feminine given name introduced to the English-speaking public by a character in the novel "How Green Was My Valley" (1939).

**📗 Hypothesis:**
Bronwen was a named based on a novel

**🔍 Prediction Scores:**

* Entailment: 5.3%
* Neutral: 51.1%
* Contradiction: 43.5%

**✅ Predicted Label:** neutral
**🎯 Gold Label:** entailment

**💡 Reasoning:**
It was popularized by the novel; this qualifies as being based on it in context.

---

### 🧠 Sample #11

**📘 Premise:**
On 10 September 2016, a man attacked another man and later sought to attack police.

**📗 Hypothesis:**
The perpetrator sought to attack the police from the beginning

**🔍 Prediction Scores:**

* Entailment: 0.1%
* Neutral: 2.4%
* Contradiction: 97.5%

**✅ Predicted Label:** contradiction
**🎯 Gold Label:** neutral

**💡 Reasoning:**
Text says he sought to attack the police, but doesn't say it was from the beginning.

---

### 🧠 Sample #12

**📘 Premise:**
The Panama Canal was completed between 1870–1914. The author David McCullough published a book about it in 1977.

**📗 Hypothesis:**
The Panama Canal was completed before David McCullough was born.

**🔍 Prediction Scores:**

* Entailment: 15.0%
* Neutral: 1.3%
* Contradiction: 83.7%

**✅ Predicted Label:** contradiction
**🎯 Gold Label:** neutral

**💡 Reasoning:**
The excerpt doesn’t say when the author was born, so the statement is unconfirmed.

---

### 🧠 Sample #13

**📘 Premise:**
"Anna Sun" is a song from the 2010 album "I Want! I Want!" by Walk the Moon.

**📗 Hypothesis:**
"Anna Sun" is from the 2010 album "I Want! I Want! I Want!"

**🔍 Prediction Scores:**

* Entailment: 73.5%
* Neutral: 2.9%
* Contradiction: 23.5%

**✅ Predicted Label:** entailment
**🎯 Gold Label:** contradiction

**💡 Reasoning:**
Album title is slightly misquoted; it's "I Want! I Want!", not triple.

---

### 🧠 Sample #14

**📘 Premise:**
Pit bulls can make great companions. Care is needed to properly raise them.

**📗 Hypothesis:**
Pit Bulls are a cat’s best friend

**🔍 Prediction Scores:**

* Entailment: 0.0%
* Neutral: 0.3%
* Contradiction: 99.7%

**✅ Predicted Label:** contradiction
**🎯 Gold Label:** neutral

**💡 Reasoning:**
No mention is made of their relationship to cats. Not contradicted, just unsupported.

---

### 🧠 Sample #15

**📘 Premise:**
Sandler and Schneider met as struggling comedians in Los Angeles.

**📗 Hypothesis:**
Comedians in Los Angeles struggle

**🔍 Prediction Scores:**

* Entailment: 97.8%
* Neutral: 2.1%
* Contradiction: 0.1%

**✅ Predicted Label:** entailment
**🎯 Gold Label:** neutral

**💡 Reasoning:**
This generalization isn't supported; it was just true for two individuals.

---

### 🧠 Sample #16

**📘 Premise:**
A customer came into a business, ordered and paid for a drink.

**📗 Hypothesis:**
The customer is out of money

**🔍 Prediction Scores:**

* Entailment: 0.1%
* Neutral: 5.5%
* Contradiction: 94.4%

**✅ Predicted Label:** contradiction
**🎯 Gold Label:** neutral

**💡 Reasoning:**
No info on remaining money. Payment doesn’t imply being broke.

---

### 🧠 Sample #17

**📘 Premise:**
Section 93 is rooted in 18th-century provisions like the Treaty of Paris (1763).

**📗 Hypothesis:**
Section 93 gives the details of Treaty of Paris of 1763

**🔍 Prediction Scores:**

* Entailment: 0.2%
* Neutral: 99.4%
* Contradiction: 0.4%

**✅ Predicted Label:** neutral
**🎯 Gold Label:** entailment

**💡 Reasoning:**
If Section 93 is rooted in the treaty, it entails having its details.

---

### 🧠 Sample #18

**📘 Premise:**
Beth sings “Birds in their little nests agree” and resolves an argument.

**📗 Hypothesis:**
Beth and Jo detest birds in their little nests

**🔍 Prediction Scores:**

* Entailment: 0.2%
* Neutral: 0.5%
* Contradiction: 99.3%

**✅ Predicted Label:** contradiction
**🎯 Gold Label:** neutral

**💡 Reasoning:**
The quote is metaphorical. There's no evidence about how they feel about birds.

---

### 🧠 Sample #19

**📘 Premise:**
ACE Ltd beat analysts’ profit expectations.

**📗 Hypothesis:**
The analysts were disappointed by the quarterly profits

**🔍 Prediction Scores:**

* Entailment: 39.6%
* Neutral: 56.1%
* Contradiction: 4.3%

**✅ Predicted Label:** neutral
**🎯 Gold Label:** contradiction

**💡 Reasoning:**
If the company beat expectations, analysts would not be disappointed.

---

### 🧠 Sample #20

**📘 Premise:**
Instructions for dressing up a boy like a girl include shaving legs, armpits, and face.

**📗 Hypothesis:**
Have him take a shower and shave his lower arms, upper back, and face.

**🔍 Prediction Scores:**

* Entailment: 0.7%
* Neutral: 92.4%
* Contradiction: 6.9%

**✅ Predicted Label:** neutral
**🎯 Gold Label:** contradiction

**💡 Reasoning:**
The specified body parts do not match. Therefore it contradicts the instruction.


### Reasons for the models to fail
During our runs, we identify five reasons why the model may fail.
#### World-knowledge
In some samples, we observed that the model misses facts a human can recall (or look up) instantly.
For example, in the first sample, the model treated "born in NYC" as *not* evidence of being born in the United States – likely because it fixated on "…parents from Puerto Rico" instead of city → country mapping.
This was demonstrated in samples 1, 2, 3, 10, and 17.
We believe that the model's inference is inconsistent, as explicit city, kinship, or source relations are sometimes overlooked when presented in excessive detail.
####  Numerical & date arithmetic
In cases where a **sub-second calculation** was needed, the calculation was brittle.
For example, in sample number 4, the model needed to calculate that 959 → mid-1980 is approximately 26 years, but instead, it labeled it as a contradiction rather than an entailment.
This was shown off in samples 4,5,6,7,8,9.
We therefore believe that rounding ("about 9-inch", 7) and "less than/greater than" comparisons trip it up, off-by-one birthday boundaries (6) are common, and it rarely "counts months" (8) or "years between dates" (5).
#### Scope & quantifier problems
We observed that in some samples (for example, 5, 11, 12, 14, 20) there were mistakes where **"only", "all", "from the beginning", "about"** changed meaning.
For example, in sample 5, the model interprets the presence of any 1974-79 material as evidence of a 1979 release.
#### Near-duplicate strings
Minor lexical edits or typos mislead the softmax. This was clearly shown in sample 13, where the model didn't seem to differentiate between "I Want! I Want! **I Want!**" and "I Want! I Want!"
#### Pragmatic/commonsense slips
In some samples, it was demonstrated that the model lacks common sense beyond explicit text. For example, in sample 19, the model failed causal intuition since "Beat analysts' estimates" ⇒ analysts not disappointed (quarterly beats = positive surprise)
This was shown in samples 15, 16, 18, and 19.

### Error Observations Table (Task 1.2)

| Error Type                  | Description                                                                 | Examples (Sample #s) |
|-----------------------------|-----------------------------------------------------------------------------|----------------------|
| World-knowledge             | The model misses real-world facts (e.g., city-country mappings or kinship relations) due to fixation on irrelevant details, leading to inconsistent inference. | 1, 2, 3, 10, 17     |
| Numerical & date arithmetic | Brittle handling of calculations like ages, durations, or approximations (e.g., off-by-one errors, failure to round or compare dates/quantities). | 4, 5, 6, 7, 8, 9    |
| Scope & quantifier problems | Misinterprets qualifiers like "only", "all", "from the beginning", or "about", altering the intended scope or meaning. | 5, 11, 12, 14, 20   |
| Near-duplicate strings      | Confused by minor lexical changes or typos, treating near-matches as equivalents (e.g., slight title variations). | 13                  |
| Pragmatic/commonsense slips | Lacks implicit causal reasoning or common sense beyond explicit text (e.g., implications of "beating expectations" or generalizations). | 15, 16, 18, 19      |