# Pre-trained Language Models: SubTask B

This project implements SubTask B of the ComVE (SemEval-2020) challenge. The goal is to identify which of three candidate explanations best describes why a given nonsensical statement violates commonsense. Each sample contains:

* One nonsensical statement.
* Three possible reasons (A, B, C).
* A label indicating the correct reason.

Example:

* **Statement:** *He put an elephant into the fridge.*
* **Reason A:** An elephant is much bigger than a fridge.
* **Reason B:** Elephants are usually white while fridges are usually white.
* **Reason C:** An elephant cannot eat a fridge.
* **Correct Label:** A

This is framed as a multiple-choice classification task using a pre-trained language model.

The implementation uses the Hugging Face **Transformers** ecosystem to fine-tune **RoBERTa**, chosen for its strong performance and stable pre-training recipe. Resource-constrained environments rely on a reduced dataset and the base RoBERTa model to keep training manageable.


In [4]:
shrink_dataset = False
base_model = False
colab = True

The example outputs in the notebook were based on `shrink_dataset=True`, `base_model=True`, and `colab=False`. These settings did not affect the test cases but were used to generate the reference outputs.

Full training required a cloud environment such as **Google Colab**, since GPU or TPU acceleration was necessary for large-scale fine-tuning. For this setup, `shrink_dataset` and `base_model` were set to `False`, and `colab` was set to `True`, following the execution steps provided for running the notebook on Colab.

Running the notebook in Colab also required uploading the `datacollator.py` file from the assignment repository.

In [5]:
if colab:
    ! pip install transformers==4.28.0 datasets evaluate
    import os
    if not os.path.exists("SemEval2020-Task4-Data/ALL data/Training  Data/subtaskA_data_all.csv"):
        ! git clone https://github.com/wangcunxiang/SemEval2020-Task4-Commonsense-Validation-and-Explanation.git SemEval2020-Task4-Data



Following objects and functions were used:

In [6]:
import pandas as pd
import evaluate
from datasets import Dataset
from transformers import (AutoTokenizer, AutoModelForMultipleChoice,
                          TrainingArguments, Trainer,
                          enable_full_determinism)
from datacollator import DataCollatorForMultipleChoice

Randomized operations in neural network training such as weight initialization, data shuffling, and sample selection introduced variability across runs. To ensure reproducibility, the random number generator was initialized with a fixed seed value. In Transformers, the seed was set before training so that identical runs produced identical results.

In [7]:
enable_full_determinism(seed=42)

Complex neural network models were still susceptible to minor reproducibility shifts caused by software versions and hardware differences, so identical results were not guaranteed even with a fixed seed. Model configuration depended on a set of hyperparameters, and identifying suitable values typically required multiple training and evaluation cycles on the development set. Hyperparameter tuning was computationally expensive, but this project relied on predetermined values.

In [8]:
epochs = 3  # Number of epochs to train the model
train_batch_size = 8  # Number of examples used per gradient update
learning_rate = 1e-5  # The learning rate for the optimizer
max_length = 50  # Maximum lenght of the input sequence
output_dir = "modelB"  # The output directory where the model will be written to

## Loading the Pre-trained Model 

The process began by loading a pre-trained model and its tokenizer. The Transformers library supported a broad range of models through dedicated classes, but AutoClasses provided a more flexible mechanism that allowed retrieving a model directly from its name or path. For downstream tasks, the original classification head had to be replaced with a task-specific output layer, and AutoClasses handled this automatically. For multiple-choice reasoning, `AutoModelForMultipleChoice` initialized the model with the appropriate top layer. The `load_model` function was completed to load the selected pre-trained model and return both the tokenizer and the modified model.

In [9]:
def load_model(model_name):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForMultipleChoice.from_pretrained(model_name)
    return model, tokenizer

In [10]:
model_name = "roberta-base" if base_model else "roberta-large"
model, tokenizer = load_model(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaForMultipleChoice: ['lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaForMultipleChoice from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMultipleChoice from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForMultipleChoice were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.weight', 'classifier.bias', 'roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream

## Data Pre-processing 

The ComVE dataset contained 9997 nonsensical statements with three candidate reasons in the training split, 997 in the development split, and 1000 in the test split. Each statement was annotated with a label A, B, or C indicating the correct explanation. The dataset was loaded into three DataFrames corresponding to train, development, and test.

In [11]:
def load_data(data_csv, answers_csv, labels):
    data = pd.read_csv(data_csv).dropna()
    answers = pd.read_csv(answers_csv, header=None).rename(columns={0: "id", 1: "label"})
    answers["label"] = answers["label"].apply(lambda x: labels.index(x))
    return pd.merge(data, answers, on="id")

In [12]:
labels = ["A", "B", "C"]
train_data_csv = "SemEval2020-Task4-Data/ALL data/Training  Data/subtaskB_data_all.csv"
train_answers_csv = "SemEval2020-Task4-Data/ALL data/Training  Data/subtaskB_answers_all.csv"
train_data = load_data(train_data_csv, train_answers_csv, labels)
dev_data_csv = "SemEval2020-Task4-Data/ALL data/Dev Data/subtaskB_dev_data.csv"
dev_answers_csv = "SemEval2020-Task4-Data/ALL data/Dev Data/subtaskB_gold_answers.csv"
dev_data = load_data(dev_data_csv, dev_answers_csv, labels)
test_data_csv = "SemEval2020-Task4-Data/ALL data/Test Data/subtaskB_test_data.csv"
test_answers_csv = "SemEval2020-Task4-Data/ALL data/Test Data/subtaskB_gold_answers.csv"
test_data = load_data(test_data_csv, test_answers_csv, labels)
if shrink_dataset:
    train_data = train_data.sample(n=100, random_state=42)
    dev_data = dev_data.sample(n=100, random_state=42)
    test_data = test_data.sample(n=100, random_state=42)
train_data

Unnamed: 0,id,FalseSent,OptionA,OptionB,OptionC,label
0,0,He poured orange juice on his cereal.,Orange juice is usually bright orange.,Orange juice doesn't taste good on cereal.,Orange juice is sticky if you spill it on the ...,1
1,1,He drinks apple.,Apple juice are very tasty and milk too,Apple can not be drunk,Apple cannot eat a human,1
2,2,"Jeff ran 100,000 miles today","100,000 miles is way to long for one person to...","Jeff is a four letter name and 100,000 has six...","100,000 miles is longer than 100,000 km.",0
3,3,I sting a mosquito,A human is a mammal,A human is omnivorous,A human has not stings,2
4,4,A giraffe is a person.,Giraffes can drink water from a lake.,A giraffe is not a human being.,.Giraffes usually eat leaves.,1
...,...,...,...,...,...,...
9992,9995,Mark ate a big bitter cherry pie,Mark is bad at making cherry pie,a cherry pie should be big,a cherry pie should be sweet,2
9993,9996,Gloria wears a cat on her head,a hat cannot be worn on a cat's head,a cat cannot be worn on a person's head,the cat is too heavy to be worn on her head,1
9994,9997,Harry went to the barbershop to have his glass...,a barbershop usually don't provide the service...,a barbershop usually repairs computers instead...,the barbershop lacked the necessary tools to r...,0
9995,9998,Reilly is sleeping on the window,the window is open and a person cannot lay on it,the window is too cold to sleep on it,a person cannot sleep on a window,2


The `load_data` function converted the original labels A, B, and C into numerical indices 0, 1, and 2. The **Datasets** library was used to handle the data efficiently. It provided a `Dataset` class built on Apache Arrow, storing each example as a row and each field as a column, similar to a pandas DataFrame. Data could be loaded into a `Dataset` from multiple sources, including directly from a DataFrame.

In [13]:
train_dataset = Dataset.from_pandas(train_data)
dev_dataset = Dataset.from_pandas(dev_data)
test_dataset = Dataset.from_pandas(test_data)
train_dataset[0]

{'id': 0,
 'FalseSent': 'He poured orange juice on his cereal.',
 'OptionA': 'Orange juice is usually bright orange.',
 'OptionB': "Orange juice doesn't taste good on cereal.",
 'OptionC': 'Orange juice is sticky if you spill it on the table.',
 'label': 1}

The `map` function in the **Datasets** library was used to pre-process the dataset in batches. It accepted a callable and applied it to every row in the `Dataset`. The task required implementing a `preprocess_data` function to tokenize the statement–reason pairs that were later passed to `map`.

The `preprocess_data` function took a batch of examples, the tokenizer returned by `load_model`, and the `max_length` hyperparameter. For each input example, the function created three copies of the nonsensical statement in the `FalseSent` field and paired each copy with one of the three reasons in `OptionA`, `OptionB`, and `OptionC`. These statement–reason pairs were then tokenized jointly. The tokenizer padded and truncated all sequences to the specified `max_length` value, following the behavior described in the Transformers preprocessing and tokenizer documentation.

The tokenizer returned a `BatchEncoding` object containing two fields for each tokenized pair:
input_ids, the token indices used as model input, and
attention_mask, the mask indicating which tokens the model attended to.

After tokenization, the `preprocess_data` function unflattened the results so that, for each example, `input_ids` became a list of three tokenized sequences (one for each choice), and `attention_mask` became a parallel list of three attention masks. This step followed the structure required for Multiple Choice tasks in Transformers. The output of this unflattening step was returned by the function.

The `map` function inserted the resulting `input_ids` and `attention_mask` into the `Dataset` as new columns. For example, the first dataset row contained three tokenized sequences, one for each statement–reason pair, stored under `input_ids`, and three corresponding attention masks under `attention_mask`.

Each value in the inner lists of `input_ids` represented a subword token from the tokenizer’s vocabulary. The example shown included three separate token sequences, each reflecting one of the possible explanations aligned with the given nonsensical statement.

The tokenizer followed the conventions of RoBERTa, where `<s>` served the role of BERT’s `[CLS]` token and `</s>` marked sentence boundaries. The prefix `Ġ` indicated the presence of a preceding whitespace in the original text, allowing the model to distinguish the first subword of each word.


> <pre>
{'id': 4122, 'FalseSent': 'You are likely to find a computer in the bathroom', 'OptionA': 'The computer needs to take a shower in the bathroom', 'OptionB': 'The computer may be broken in the bathroom', 'OptionC': "The computer won't walk into the bathroom", 'label': 1, '__index_level_0__': 4122, 'input_ids': [[0, 1185, 32, 533, 7, 465, 10, 3034, 11, 5, 8080, 2, 2, 133, 3034, 782, 7, 185, 10, 9310, 11, 5, 8080, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 1185, 32, 533, 7, 465, 10, 3034, 11, 5, 8080, 2, 2, 133, 3034, 189, 28, 3187, 11, 5, 8080, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 1185, 32, 533, 7, 465, 10, 3034, 11, 5, 8080, 2, 2, 133, 3034, 351, 75, 1656, 88, 5, 8080, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}
</pre>

The `input_ids` field contains three lists, one for each statement-reason pair. Each value in each list of `input_ids` represents a sub-word of the `tokenizer` vocabulary. For the example above, `input_ids` corresponds to the following thee sequences of sub-words:

> <pre>
> ['&lt;s&gt;', 'You', 'Ġare', 'Ġlikely', 'Ġto', 'Ġfind', 'Ġa', 'Ġcomputer', 'Ġin', 'Ġthe', 'Ġbathroom', '&lt;/s&gt;', '&lt;/s&gt;', 'The', 'Ġcomputer', 'Ġneeds', 'Ġto', 'Ġtake', 'Ġa', 'Ġshower', 'Ġin', 'Ġthe', 'Ġbathroom', '&lt;/s&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;']
>
> ['&lt;s&gt;', 'You', 'Ġare', 'Ġlikely', 'Ġto', 'Ġfind', 'Ġa', 'Ġcomputer', 'Ġin', 'Ġthe', 'Ġbathroom', '&lt;/s&gt;', '&lt;/s&gt;', 'The', 'Ġcomputer', 'Ġmay', 'Ġbe', 'Ġbroken', 'Ġin', 'Ġthe', 'Ġbathroom', '&lt;/s&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;']
>
> ['&lt;s&gt;', 'You', 'Ġare', 'Ġlikely', 'Ġto', 'Ġfind', 'Ġa', 'Ġcomputer', 'Ġin', 'Ġthe', 'Ġbathroom', '&lt;/s&gt;', '&lt;/s&gt;', 'The', 'Ġcomputer', 'Ġwon', "'t", 'Ġwalk', 'Ġinto', 'Ġthe', 'Ġbathroom', '&lt;/s&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;', '&lt;pad&gt;']
</pre>


In [14]:
def preprocess_data(examples, tokenizer, max_length):
    pairs = []
    for false_sent, option_a, option_b, option_c in zip(examples['FalseSent'], examples['OptionA'], examples['OptionB'], examples['OptionC']):
        pairs.extend([(false_sent, option_a), (false_sent, option_b), (false_sent, option_c)])
    tokenized_pairs = tokenizer(pairs, padding='max_length', truncation=True, max_length=max_length, return_tensors='pt')
    input_ids = tokenized_pairs['input_ids']
    attention_mask = tokenized_pairs['attention_mask']
    examples['input_ids'] = [input_ids[i:i+3].tolist() for i in range(0, len(input_ids), 3)]
    examples['attention_mask'] = [attention_mask[i:i+3].tolist() for i in range(0, len(attention_mask), 3)]
    return examples

In [15]:
train_dataset = train_dataset.map(lambda x: preprocess_data(x, tokenizer, max_length), batched=True)
dev_dataset = dev_dataset.map(lambda x: preprocess_data(x, tokenizer, max_length), batched=True)
test_dataset = test_dataset.map(lambda x: preprocess_data(x, tokenizer, max_length), batched=True)
print(train_dataset[0])
print("")
for seq in train_dataset[0]["input_ids"]:
    print(tokenizer.convert_ids_to_tokens(seq))
    print("")

Map:   0%|          | 0/9997 [00:00<?, ? examples/s]

Map:   0%|          | 0/997 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

{'id': 0, 'FalseSent': 'He poured orange juice on his cereal.', 'OptionA': 'Orange juice is usually bright orange.', 'OptionB': "Orange juice doesn't taste good on cereal.", 'OptionC': 'Orange juice is sticky if you spill it on the table.', 'label': 1, 'input_ids': [[0, 894, 13414, 8978, 10580, 15, 39, 25629, 4, 2, 2, 37264, 10580, 16, 2333, 4520, 8978, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 894, 13414, 8978, 10580, 15, 39, 25629, 4, 2, 2, 37264, 10580, 630, 75, 5840, 205, 15, 25629, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 894, 13414, 8978, 10580, 15, 39, 25629, 4, 2, 2, 37264, 10580, 16, 25247, 114, 47, 10923, 24, 15, 5, 2103, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

## Fine-tuning 

Fine-tuning was carried out using the Transformers **Trainer** API, which supported efficient model adaptation without custom training loops. Training behavior was controlled through **TrainingArguments**, which exposed the required hyperparameters and evaluation settings. The task required implementing the `create_training_arguments` function to construct a `TrainingArguments` object using the provided `epochs`, `train_batch_size`, `learning_rate`, and `output_dir` values. The configuration included evaluation on the development set at the end of each epoch and disabled intermediate checkpointing by setting `save_strategy="no"`.


In [16]:
def create_training_arguments(epochs, train_batch_size, learning_rate, output_dir):   # [1 Mark]
    training_args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=epochs,
        per_device_train_batch_size=train_batch_size,
        evaluation_strategy="epoch",
        learning_rate=learning_rate,
        save_strategy="no"
    )
    return training_args

In [17]:
train_args = create_training_arguments(epochs, train_batch_size, learning_rate, output_dir)

Next, a `Trainer` instance was created using the previously defined training arguments. Because Multiple Choice tasks require handling batches where each example contains a list of candidate sequences, the trainer relied on a custom `DataCollatorForMultipleChoice` provided with the notebook. The `create_trainer` function was completed to accept the model from `load_model`, the `TrainingArguments` produced by `create_training_arguments`, the train and development `Datasets`, and the tokenizer required by the data collator. The function instantiated and returned a `Trainer` configured with the model, the specified arguments, the custom data collator, and the appropriate train and development splits for training and epoch-level evaluation.

In [18]:
def create_trainer(model, train_args, train_dataset, dev_dataset, tokenizer):   # [1 Mark]
    data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
    trainer = Trainer(
        model=model,
        args=train_args,
        train_dataset=train_dataset,
        eval_dataset=dev_dataset,
        data_collator=data_collator,
    )
    return trainer

In [19]:
trainer = create_trainer(model, train_args, train_dataset, dev_dataset, tokenizer)

The `trainer` object created by `create_trainer` is ready to fine-tune the model by just running:

In [20]:
trainer.train()

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,0.3194,0.292553
2,0.182,0.355037
3,0.0718,0.388304


TrainOutput(global_step=3750, training_loss=0.2177707056681315, metrics={'train_runtime': 3396.5205, 'train_samples_per_second': 8.83, 'train_steps_per_second': 1.104, 'total_flos': 8188318090403100.0, 'train_loss': 0.2177707056681315, 'epoch': 3.0})

After training, the model was used to generate predictions on unlabeled data through the `Trainer.predict` method. The `make_predictions` function was completed to accept a `Trainer` instance and a test `Dataset`, execute `predict` on the provided data, and extract the logits from the returned `NamedTuple`. Each logits vector contained one value per label, and the final output consisted only of the index with the maximum logit. This was obtained by applying `numpy.argmax` along the last axis of the logits array, producing one predicted label index for each input example.

In [21]:
import numpy as np

def make_predictions(trainer, test_dataset):
    #Get the predictions
    predictions = trainer.predict(test_dataset)
    #Extract the logits
    logits = predictions.predictions
    #Find the index of the highest logit value for each example
    predicted_labels = np.argmax(logits, axis=-1)
    return predicted_labels

In [22]:
predictions = make_predictions(trainer, test_dataset)
test_data["prediction"] = predictions
test_data

Unnamed: 0,id,FalseSent,OptionA,OptionB,OptionC,label,prediction
0,1175,He loves to stroll at the park with his bed,A bed is too heavy to carry with when strollin...,walking at a park is good for health,Some beds are big while some are smaller,0,0
1,452,The inverter was able to power the continent.,An inverter is smaller than a car,An inverter is incapable of powering an entire...,An inverter is rechargeable.,1,1
2,275,The chef put extra lemons on the pizza.,Many types of lemons are to sour to eat.,Lemons and pizzas are both usually round.,Lemons are not a pizza topping.,2,2
3,869,sugar is used to make coffee sour,sugar is white while coffee is brown,sugar can dissolve in the coffee,sugar usually is used as a sweetener,2,2
4,50,There are beautiful planes here and there in t...,A plane flies upon the garden,You can have a small garden in your private plane,A plane can never be seen in garden,2,2
...,...,...,...,...,...,...,...
995,1114,"If it is a sunny day, you would got wet.",Usually a sunny day don't cause to wet.,People prefer to walk during sunny day.,People feel mess if they are wet.,0,0
996,8,ice hockey is a financial institution,Children's playing ice hockey requires financi...,Playing ice hockey well can bring you money,There are no relationships between ice hockey ...,2,2
997,1945,He put water without a container in the freeze...,Water and containers are two different element...,water cannot be in the freezer without a conta...,Water more deep in a container cannot always b...,1,1
998,1053,The desert has sand that you can drink.,Water is not the same color as the sand.,Sand is solid and inedible.,The desert has lots of sand.,1,1


Subtask B of ComVE was evaluated with accuracy. The `evaluate` library handled this metric, and the `evaluate_prediction` function computed accuracy by comparing the `prediction` and `label` columns of the test `DataFrame`. When `shrink_dataset` and `base_model` were set to `True`, the model did not learn the task and produced an expected accuracy of about 0.51. A full training run with both flags set to `False` produced an accuracy of roughly 0.928.

In [23]:
def evaluate_prediction(test_data):
    accuracy = evaluate.load("accuracy")
    return accuracy.compute(predictions=test_data["prediction"].values, references=test_data["label"].values)
evaluate_prediction(test_data)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

{'accuracy': 0.925}