# Pre-trained Language Models: SubTask A.

Pre-trained Language Models: SubTask A

This project implements SubTask A of the ComVE shared task from SemEval-2020. The objective is to identify which of two minimally different natural-language statements is nonsensical. The task is treated as a text-matching classification problem: the model receives a pair of statements and outputs the index of the illogical one. For example, between the pair “He put a turkey into the fridge” and “He put an elephant into the fridge,” the second statement is the nonsensical one.

The workflow fine-tunes a pre-trained language model using the Hugging Face Transformers library. RoBERTa (base) is used due to its BERT-derived architecture and larger pre-training corpus. Because full fine-tuning is computationally intensive, the project uses a reduced dataset and the base model variant to keep resource requirements manageable.

In [None]:
shrink_dataset = False
base_model = False
colab = True

Although these variables do not affect the automated tests used for evaluation, the output examples shown in the notebook assume `shrink_dataset=True`, `base_model=True`, and `colab=False`. Running the full version of the model requires significantly more compute. For a complete training run and a more accurate performance estimate, a cloud environment such as Google Colab can be used. Colab provides a Jupyter-based interface with optional GPU and TPU acceleration suitable for fine-tuning large language models. In that case, set `shrink_dataset=False`, `base_model=False`, and `colab=True`, then follow the provided Colab instructions. This project was executed using Colab.

In [None]:
if colab:
    ! pip install transformers==4.28.0 datasets evaluate
    import os
    if not os.path.exists("SemEval2020-Task4-Data/ALL data/Training Data/subtaskA_data_all.csv"):
        ! git clone https://github.com/wangcunxiang/SemEval2020-Task4-Commonsense-Validation-and-Explanation.git SemEval2020-Task4-Data

Collecting transformers==4.28.0
  Downloading transformers-4.28.0-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m22.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.28.0)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-16.1.0-cp310-cp310-manylinu

Following objects and functions were used:

In [None]:
import pandas as pd
import evaluate
from datasets import Dataset
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          TrainingArguments, Trainer,
                          enable_full_determinism)

Neural network training involves multiple stochastic operations, including weight initialization, data shuffling, and sampling. As a result, repeated runs of the same model can yield different outputs. To ensure reproducibility, the random number generator is initialized with a fixed seed. In the **Transformers** framework, this is achieved by explicitly setting the seed before training so that all randomness-dependent components behave deterministically.

In [None]:
enable_full_determinism(seed=42)

Reproducibility in neural network experiments can still vary across software versions and hardware configurations, even when a fixed seed is used. Minor differences in results are therefore expected. Neural network workflows also require selecting hyperparameters that define the model configuration. Identifying the right hyperparameter values typically involves repeated training and evaluation over many combinations, which is computationally expensive. For this assignment, predefined hyperparameter values are used instead of performing a full tuning process.

In [None]:
epochs = 3  # Number of epochs to train the model
train_batch_size = 8  # Number of examples used per gradient update
learning_rate = 1e-5  # The learning rate for the optimizer
max_length = 50  # Maximum lenght of the input sequence
output_dir = "modelA"  # The output directory where the model will be written to

## Loading the Pre-trained Model 

Loading a pre-trained model in this task requires retrieving both the sequence-classification version of the model and its tokenizer. AutoClasses streamline this process: `AutoModelForSequenceClassification` loads the model with a classification head suitable for SubTask A, and `AutoTokenizer` loads the matching tokenizer. The `load_model` function must therefore take the model name, use these AutoClasses to instantiate both components, and return them for downstream fine-tuning.

In [None]:
def load_model(model_name):
    #Load the tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    #Load the model for sequence classification
    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    return model, tokenizer

In [None]:
model_name = "roberta-base" if base_model else "roberta-large"
model, tokenizer = load_model(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.out_proj.weight', 'classifier.dense.weight']
You should 

## Data Pre-processing 

The ComVE dataset provides paired statements and a binary label indicating which one is nonsensical. The train, development, and test splits contain 10 000, 997, and 1 000 pairs respectively. Each row includes two text fields and a label (`0` if the first statement is nonsensical, `1` if the second is nonsensical). These splits are loaded into separate DataFrames to prepare them for tokenization and model input.

In [None]:
def load_data(data_csv, answers_csv):
    data = pd.read_csv(data_csv)
    reasons = pd.read_csv(answers_csv, header=None).rename(columns={0: "id", 1: "label"})
    return pd.merge(data, reasons, on="id")

In [None]:
train_data_csv = "SemEval2020-Task4-Data/ALL data/Training  Data/subtaskA_data_all.csv"
train_answers_csv = "SemEval2020-Task4-Data/ALL data/Training  Data/subtaskA_answers_all.csv"
train_data = load_data(train_data_csv, train_answers_csv)
dev_data_csv = "SemEval2020-Task4-Data/ALL data/Dev Data/subtaskA_dev_data.csv"
dev_answers_csv = "SemEval2020-Task4-Data/ALL data/Dev Data/subtaskA_gold_answers.csv"
dev_data = load_data(dev_data_csv, dev_answers_csv)
test_data_csv = "SemEval2020-Task4-Data/ALL data/Test Data/subtaskA_test_data.csv"
test_answers_csv = "SemEval2020-Task4-Data/ALL data/Test Data/subtaskA_gold_answers.csv"
test_data = load_data(test_data_csv, test_answers_csv)
if shrink_dataset:
    train_data = train_data.sample(n=100, random_state=42)
    dev_data = dev_data.sample(n=100, random_state=42)
    test_data = test_data.sample(n=100, random_state=42)
train_data

Unnamed: 0,id,sent0,sent1,label
0,0,He poured orange juice on his cereal.,He poured milk on his cereal.,0
1,1,He drinks apple.,He drinks milk.,0
2,2,Jeff ran a mile today,"Jeff ran 100,000 miles today",1
3,3,A mosquito stings me,I sting a mosquito,1
4,4,A niece is a person.,A giraffe is a person.,1
...,...,...,...,...
9995,9995,Mark ate a big bitter cherry pie,Mark ate a big sweet cherry pie,0
9996,9996,Gloria wears a cat on her head,Gloria wears a hat on her head,0
9997,9997,Harry went to the barbershop to have his hair cut,Harry went to the barbershop to have his glass...,1
9998,9998,Reilly is sleeping on the couch,Reilly is sleeping on the window,1


The Datasets library provides an Arrow-based table structure through its `Dataset` class, enabling efficient manipulation, slicing, and integration with Transformers. It supports direct loading from multiple sources, including local files, the Hugging Face Hub, or existing pandas DataFrames. When loaded from a DataFrame, each column becomes a dataset field and each row becomes an example, ensuring compatibility with downstream tokenization and batching workflows.

In [None]:
train_dataset = Dataset.from_pandas(train_data)
dev_dataset = Dataset.from_pandas(dev_data)
test_dataset = Dataset.from_pandas(test_data)
train_dataset[0]

{'id': 0,
 'sent0': 'He poured orange juice on his cereal.',
 'sent1': 'He poured milk on his cereal.',
 'label': 0}

The Datasets library provides an Arrow-based table structure through its `Dataset` class, enabling efficient storage, slicing, and pre-processing of large NLP datasets. It integrates tightly with the Transformers ecosystem and supports loading data from local files, pandas DataFrames, and the Hugging Face Hub. Each `Dataset` instance stores rows as examples and columns as typed fields, making it suitable for tokenization, batching, and model training workflows.

A key feature of Datasets is the `map` function, which applies a user-defined transformation over the dataset in batches. This design enables efficient pre-processing at scale and ensures compatibility with downstream model pipelines. In this project, `map` is used to tokenize pairs of statements for the ComVE SubTask A classification setup.

The next step is to implement a `preprocess_data` function. This function receives a batch of examples, the tokenizer loaded earlier, and the `max_length` hyperparameter. It must tokenize the `sent0` and `sent1` columns jointly using the tokenizer, with both padding and truncation applied to enforce uniform sequence length equal to `max_length`. The output must follow the standard Hugging Face preprocessing format.

The tokenizer returns a `BatchEncoding` object containing the fields required by Transformer models:

* `input_ids`: Token indices representing the tokenized input sequence.
* `attention_mask`: Binary masks indicating which tokens should be attended to by the model.

When passed to `map`, these fields are added to the dataset as new columns. For example, after preprocessing, a dataset row may look like:

```
{
 'id': 6252,
 'sent0': 'a duck walks on three legs',
 'sent1': 'a duck walks on two legs',
 'label': 0,
 '__index_level_0__': 6252,
 'input_ids': [0, 102, 15223, 5792, 15, 130, 5856, 2, 2, 102, 15223, 5792, 15, 80, 5856, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
}
```

Each value in `input_ids` corresponds to a sub-word token. For example, the decoded input sequence for the row above is:

```
['<s>', 'a', 'Ġduck', 'Ġwalks', 'Ġon', 'Ġthree', 'Ġlegs', '</s>', '</s>',
 'a', 'Ġduck', 'Ġwalks', 'Ġon', 'Ġtwo', 'Ġlegs', '</s>', '<pad>', ...]
```

The RoBERTa tokenizer uses `<s>` as the classification prefix (analogous to `[CLS]` in BERT), and `</s>` to mark both sentence boundaries and separators. The leading `Ġ` indicates that the token begins with a whitespace in the original text, allowing the model to differentiate between the first and subsequent sub-words within a word.

In [None]:
def preprocess_data(examples, tokenizer, max_length):
    tokenized_batch = tokenizer(
        examples["sent0"],
        examples["sent1"],
        padding='max_length',
        truncation=True,
        max_length=max_length,
        return_tensors="pt"
    )
    examples["input_ids"] = tokenized_batch["input_ids"]
    examples["attention_mask"] = tokenized_batch["attention_mask"]
    return examples

In [None]:
train_dataset = train_dataset.map(lambda x: preprocess_data(x, tokenizer, max_length), batched=True)
dev_dataset = dev_dataset.map(lambda x: preprocess_data(x, tokenizer, max_length), batched=True)
test_dataset = test_dataset.map(lambda x: preprocess_data(x, tokenizer, max_length), batched=True)
print(train_dataset[0])
print(tokenizer.convert_ids_to_tokens(train_dataset[0]["input_ids"]))

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/997 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

{'id': 0, 'sent0': 'He poured orange juice on his cereal.', 'sent1': 'He poured milk on his cereal.', 'label': 0, 'input_ids': [0, 894, 13414, 8978, 10580, 15, 39, 25629, 4, 2, 2, 894, 13414, 5803, 15, 39, 25629, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
['<s>', 'He', 'Ġpoured', 'Ġorange', 'Ġjuice', 'Ġon', 'Ġhis', 'Ġcereal', '.', '</s>', '</s>', 'He', 'Ġpoured', 'Ġmilk', 'Ġon', 'Ġhis', 'Ġcereal', '.', '</s>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']


## Fine-tuning

Transformers provides a high-level training interface through the `Trainer` API, which streamlines fine-tuning without requiring manual model loops in Keras or PyTorch. Training behavior is controlled by the `TrainingArguments` class, which exposes a large set of configuration options covering optimization, evaluation, logging, checkpointing, and device management. For this project, the goal is to instantiate both `TrainingArguments` and the corresponding `Trainer` object to fine-tune RoBERTa on the ComVE SubTask A dataset.

The `create_training_arguments` function must be implemented to generate a `TrainingArguments` instance using the provided `epochs`, `train_batch_size`, `learning_rate`, and `output_dir` parameters. The configuration must ensure that the model is evaluated on the development set after each epoch by enabling the appropriate evaluation strategy. The assignment requires disabling the default checkpoint-saving behavior of `Trainer`, which normally saves a checkpoint every 500 steps; this is achieved by explicitly setting:

```
save_strategy = "no"
```

The resulting `TrainingArguments` object will serve as the configuration input for the `Trainer`, which will handle batching, gradient updates, evaluation, and metric reporting during fine-tuning.


In [None]:
def create_training_arguments(epochs, train_batch_size, learning_rate, output_dir):
    training_args = TrainingArguments(
        output_dir = output_dir,
        num_train_epochs = epochs,
        per_device_train_batch_size = train_batch_size,
        learning_rate = learning_rate,
        logging_steps = 500,
        save_strategy = "no",
        evaluation_strategy = "epoch"
    )
    return training_args

In [None]:
train_args = create_training_arguments(epochs, train_batch_size, learning_rate, output_dir)

Trainer created using the model from `load_model`, the `TrainingArguments` produced by `create_training_arguments`, and the provided train and development `Datasets`. The returned `Trainer` instance is configured to train on the training dataset and evaluate on the development dataset during training.

In [None]:
from transformers import Trainer

def create_trainer(model, train_args, train_dataset, dev_dataset):
    #Create a Trainer object with the specified model, training arguments, and datasets
    trainer = Trainer(
        model=model,
        args=train_args,
        train_dataset=train_dataset,
        eval_dataset=dev_dataset,
    )
    return trainer

In [None]:
trainer = create_trainer(model, train_args, train_dataset, dev_dataset)

The `trainer` object created by `create_trainer` is ready to fine-tune the model by just running:

In [None]:
trainer.train()



Epoch,Training Loss,Validation Loss
1,0.6983,0.692558
2,0.6847,0.548349
3,0.372,0.28211


TrainOutput(global_step=3750, training_loss=0.5942439636230469, metrics={'train_runtime': 2025.7531, 'train_samples_per_second': 14.809, 'train_steps_per_second': 1.851, 'total_flos': 2730267666000000.0, 'train_loss': 0.5942439636230469, 'epoch': 3.0})

Predictions generated using the trained `Trainer` on the test `Dataset`. The `make_predictions` function runs `trainer.predict`, extracts the logits from the returned structure, applies `argmax` over the final axis, and outputs the index corresponding to the highest logit for each example.

In [None]:
import numpy as np

def make_predictions(trainer, test_dataset):
    #Use the trainer to make predictions on the test dataset
    predictions = trainer.predict(test_dataset)
    #Extract the logits from the predictions
    logits = predictions.predictions
    #Get the index of the label with the highest logit value for each example
    predicted_labels = np.argmax(logits, axis=-1)
    return predicted_labels

    examples["input_ids"] = tokenized_batch["input_ids"]
    examples["attention_mask"] = tokenized_batch["attention_mask"]
    return examples

In [None]:
predictions = make_predictions(trainer, test_dataset)
test_data["prediction"] = predictions
test_data

Unnamed: 0,id,sent0,sent1,label,prediction
0,1175,He loves to stroll at the park with his bed,He loves to stroll at the park with his dog.,0,0
1,452,The inverter was able to power the continent.,The inverter was able to power the house,0,0
2,275,The chef put extra lemons on the pizza.,The chef put extra mushrooms on the pizza.,0,0
3,869,sugar is used to make coffee sour,sugar is used to make coffee sweet,0,0
4,50,There are beautiful flowers here and there in ...,There are beautiful planes here and there in t...,1,1
...,...,...,...,...,...
995,1114,"If it had rained, you would got wet.","If it is a sunny day, you would got wet.",1,1
996,8,ice hockey is a sport,ice hockey is a financial institution,1,1
997,1945,He put water without a container in the freeze...,He put a watermelon in the freezer for 24 hours,0,0
998,1053,The desert has sand that you can drink.,"The desert is very dry, so bring water when yo...",0,0


The **SubTask A** of **ComVE** is evaluated using accuracy. The [evaluate](https://huggingface.co/docs/evaluate/index) library allows applying this and other metrics. The `evaluate_prediction` function takes the test `Dataset` and computes accuracy by comparing the predicted labels with the true labels. When `shrink_dataset` and `base_model` are set to `True`, the model only partially learns the task, yielding an expected accuracy of approximately 0.49. A full training run, with both variables set to `False`, achieves an accuracy of around 0.929.

In [None]:
def evaluate_prediction(test_data):
    accuracy = evaluate.load("accuracy")
    return accuracy.compute(predictions=test_data["prediction"].values, references=test_data["label"].values)
evaluate_prediction(test_data)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

{'accuracy': 0.895}