#### CommitmentBank (CB): Overview
The **CommitmentBank (CB)** is a benchmark dataset designed to evaluate a model's ability to perform **logical entailment** and **reasoning** over pairs of sentences. It focuses on understanding how well a model can discern the logical relationships between two sentences. Specifically, it tests a model's ability to determine whether the second sentence (the hypothesis) logically follows or contradicts the first sentence (the premise).

#### Key Points About CommitmentBank (CB):
- **Logical Entailment**: The model is asked to determine if the hypothesis logically follows from the premise.  
  Example:  
  - **Premise**: "John is walking in the park."  
  - **Hypothesis**: "John is outdoors."  
  - The model should recognize that the hypothesis logically follows from the premise (**entailment**).

- **Contradiction**: The model is also asked to identify when the hypothesis contradicts the premise, meaning the hypothesis cannot be true if the premise is true.  
  Example:  
  - **Premise**: "It is raining outside."  
  - **Hypothesis**: "The ground is dry."  
  - The model should recognize that this is a contradiction, as rain implies wet ground.

- **Neutrality**: The model should determine when the hypothesis neither follows nor contradicts the premise, i.e., it is independent or neutral.  
  Example:  
  - **Premise**: "The sun is shining."  
  - **Hypothesis**: "The grass is green."  
  - This could be neutral, as the premise doesn't directly support or contradict the hypothesis.

#### Purpose of the Dataset:
- **Assessing Reasoning Abilities**: The main aim is to assess a model's reasoning capabilities, particularly in understanding **entailment** and **contradiction** relationships.
- **Evaluating Language Understanding**: It tests how well the model can interpret and make judgments about relationships between two sentences, which is a key component of **Natural Language Inference (NLI)** tasks.

#### Usage:
- **Model Evaluation**: CB can be used to benchmark a model's logical reasoning skills, especially in tasks where **sentence pair relationships** must be determined (e.g., in **NLI** tasks).
- **Fine-tuning**: You can fine-tune pre-trained models on datasets like CommitmentBank to improve their performance on entailment and reasoning tasks.

#### Summary:
**CommitmentBank (CB)** provides a comprehensive framework to evaluate a model's ability to perform logical reasoning, a critical task for many NLP applications involving sentence understanding, inference, and reasoning.


In [1]:
from datasets import load_dataset

In [2]:
# Load the CommitmentBank (CB) dataset
dataset = load_dataset("super_glue", "cb")

Downloading data:   0%|          | 0.00/75.5k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/250 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/56 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/250 [00:00<?, ? examples/s]

In [4]:
# Access train, validation, and test splits
train_data      = dataset['train']
validation_data = dataset['validation']
test_data       = dataset['test']

In [5]:
dataset

DatasetDict({
    train: Dataset({
        features: ['premise', 'hypothesis', 'idx', 'label'],
        num_rows: 250
    })
    validation: Dataset({
        features: ['premise', 'hypothesis', 'idx', 'label'],
        num_rows: 56
    })
    test: Dataset({
        features: ['premise', 'hypothesis', 'idx', 'label'],
        num_rows: 250
    })
})

- The CommitmentBank (CB) dataset is indeed small, with approximately:

    - Train: ~250 examples
    - Validation: ~56 examples
    - Test: ~250 examples
    
- This small size is intentional, as the dataset's primary goal is to evaluate a model's ability to perform nuanced logical entailment and reasoning tasks. 

- The focus is not on training a model from scratch but on assessing its generalization capabilities.

- The CB dataset is particularly useful for benchmarking pre-trained models fine-tuned on similar tasks, like those in the SuperGLUE benchmark, and testing their ability to transfer knowledge to smaller, more challenging datasets.

In [6]:
import pandas as pd

# Set display options for Pandas
pd.set_option('display.max_colwidth', None)  # No truncation of column content
pd.set_option('display.width', None)         # No truncation of DataFrame display width

In [7]:
# Convert the train, validation, and test splits to Pandas DataFrames
train_df      = pd.DataFrame(train_data)
validation_df = pd.DataFrame(validation_data)
test_df       = pd.DataFrame(test_data)

In [9]:
train_df.sample(10)

Unnamed: 0,premise,hypothesis,idx,label
83,This was a sheer waste of time. He would probably land and then tell them to walk back. When she glanced at him again he looked very grim and she wondered if she should have told Mitch that he might well have a lot of opportunity to photograph Spain - on foot as he walked back to Malaga.,Mitch might well have a lot of opportunity to photograph Spain,83,0
220,"A: So, we're comparable. B: Yeah. A: As a matter of fact, I just paid my Richardson taxes because I live in Richardson and supplemented the Robin Hoods very thoroughly, I think. B: Yeah, I think Yeah, we have got it on the line, don't we.",they have got it on the line,220,0
78,"How do you know? she was going to ask, but his smile was answer enough. If DeVore said there was going to be a vacancy there would be a vacancy.",there was going to be a vacancy,78,0
175,"A: I spend a lot of time reading about these things. I'm quite interested. I find it very exciting for the coverage we have now, today. B: Yes and I think we do get pretty good coverage. I don't feel that the American people is being shortchanged by uh, the news coverage.",the American people are being shortchanged by the news coverage,175,1
35,"Miss Martindale had had a school, but her rigid ideas and stern manner had frightened the children, and their parents had taken them away. And gradually the school declined, until she had to give it up and retire to end her days in the white cottage with the inevitable cat as her only companion. Breeze had never imagined that digging was such hard work.",digging was such hard work,35,0
234,"B: I do not know. I wonder where he gets it? You know, you must, I think TV is bad. Because they, uh, show all sorts of violence on, A: That and I do not think a lot of parents, I mean, I do not know how it is in the Air Force base. But, uh, I just do not think a lot of people, because of the economy, both need to work, you know. I just do not think a lot of parents are that involved any more.",a lot of parents are that involved,234,1
231,"B: No, it was, I didn't like the way it ended. A: I know, well the only reason I know why it ended is on Arsenio Hall one night, Christopher Reeves told, that, you know, B: Uh-huh. A: I can't believe they killed them.",they killed them,231,0
60,"I'm sorry, I 've put you in an invidious position. If you're being run by Morton, he 'll want to hear all this. It won't do any harm but I 'd rather not give him food for thought because I consider him an idiot and I don't think he's capable of interpreting it correctly.",Morton is capable of interpreting this food for thought correctly,60,1
188,"B: Right, you know, like In packaging A: Yeah. B: and, uh, you know, just goodness. A: Yeah, I don't think they do the packaging at this plant,",they do the packaging at this plant,188,1
134,"B: Yeah. Well, that's the guy that counts. A: Yes. But, maybe we'll get your guy. B: Oh, I don't think Jim Kelly is about to be swayed away from the Bills any time.",Jim Kelly is about to be swayed away from the Bills any time,134,1


#### Output Columns in the DataFrames

The **CommitmentBank (CB)** dataset includes the following fields:

- **premise**: The main statement or context.
- **hypothesis**: The statement to validate against the premise.
- **label**: 
  - `0` for Entailment  
  - `1` for Neutral  
  - `2` for Contradiction  
- **idx**: A unique identifier for each example.


#### Pretrained Model

- "textattack/bert-base-uncased-SuperGLUE-CB"
- This is a fine-tuned BERT model trained on the CB dataset as part of the SuperGLUE benchmark.

In [10]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

In [12]:
access_token ='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'

In [16]:
# Load the tokenizer and model
model_name = "textattack/distilbert-base-uncased-RTE"
tokenizer  = AutoTokenizer.from_pretrained(model_name)
model      = AutoModelForSequenceClassification.from_pretrained(model_name, token= access_token)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/489 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [27]:
# Dummy data (Premise and Hypothesis)
premise    = "The sky is blue."
hypothesis = "The sky is not green."

In [28]:
# Tokenize the inputs
inputs = tokenizer(premise, hypothesis, return_tensors="pt", padding=True, truncation=True)


In [29]:
# Perform the inference (model's forward pass)
outputs = model(**inputs)

In [30]:
# Get the logits (raw scores)
logits = outputs.logits

In [31]:
# Convert logits to probabilities (softmax)
probabilities = torch.nn.functional.softmax(logits, dim=-1)

In [32]:
# Get predicted class (0: Entailment, 1: Neutral, 2: Contradiction)
pred_class = logits.argmax().item()

# Get prediction score (confidence)
score = probabilities.max().item()

# Print results
print(f"Premise: {premise}")
print(f"Hypothesis: {hypothesis}")
print(f"Predicted Class: {pred_class} (0: Entailment, 1: Neutral, 2: Contradiction)")
print(f"Prediction Score: {score}")

Premise: The sky is blue.
Hypothesis: The sky is not green.
Predicted Class: 0 (0: Entailment, 1: Neutral, 2: Contradiction)
Prediction Score: 0.7878133058547974
