# Summary

This notebook demonstrates how to leverage transfer learning with a pretrained large language model (LLM) like RoBERTa Large. We begin with an instance of the model that has already been fine-tuned on a corpus such as Multi-Genre Natural Language Inference (MNLI), and show how to fine-tune it further on a related corpus, e.g. Stanford Natural Language Inference (SNLI). This process prepares the model for eventual fine-tuning on a task-specific dataset, such as SciEntsBank for Automated Short-Answer Grading. Here, we are transferring knowledge from MNLI and SNLI to the target task.

Fine-tuning a model on a second corpus differs slightly from fine-tuning it on a single corpus or a task-specific dataset. When fine-tuning on another corpus, it is important to consider several factors to ensure the process is aligned and effective. This notebook highlights these key aspects to ensure a smoother transfer of knowledge and improved performance on the target task.

# Install Required Packages

In [None]:
# For hardware acceleration
%pip install torch torchvision torchaudio

# For Hugging Face
%pip install transformers datasets accelerate

# For metrics
%pip install scikit-learn numpy

# For Notebook Widgets
%pip install ipywidgets widgetsnbextension

# Global Variables

In [1]:
dataset_name = 'stanfordnlp/snli'
model_name = 'FacebookAI/roberta-large-mnli'

# Load Dataset

In [2]:
from datasets import load_dataset

In [3]:
dataset = load_dataset(dataset_name)



In [4]:
print(dataset)

DatasetDict({
    test: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 10000
    })
    validation: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 10000
    })
    train: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 550152
    })
})


In [5]:
print('Label Map:', {index: label for index, label in enumerate(dataset['train'].features['label'].names)})

Label Map: {0: 'entailment', 1: 'neutral', 2: 'contradiction'}


# Load Model

It is important to omit `num_labels` when loading the model to ensure that the classification head retains its fine-tuned weights from the MNLI corpus rather than being reinitialized with random weights.

In [6]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_name)



In [7]:
print('Label Map:', model.config.id2label)

Label Map: {0: 'CONTRADICTION', 1: 'NEUTRAL', 2: 'ENTAILMENT'}


# Prepare Dataset

### Filter Out Unlabeled Examples

The dataset card states, "Dataset instances which don't have any gold label are marked with -1 label. Make sure you filter them before starting the training..." Although the label map shows only three distinct labels, we can identify the fourth label by retrieving the list of unique labels directly from the examples. It is crucial to inspect the labels, as such discrepancies may not always be documented in the dataset card, and the labels must align with the previous corpus and the current label map of the model.

In [8]:
print('Labels in examples:', set(dataset['train']['label']))

Labels in examples: {0, 1, 2, -1}


We need to filter out the unlabeled examples from each split before we can feed the data to the model. To filter the examples with our target labels, we need to specify the labels by their IDs (rather than their names) as they appear in the examples.

In [9]:
dataset = dataset.filter(lambda example: example['label'] in [0, 1, 2])



In [10]:
print(dataset)

DatasetDict({
    test: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 9824
    })
    validation: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 9842
    })
    train: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 549367
    })
})


### Align Labels

The label map of the dataset is not aligned with the label map of the model. For example, while the label `0` represents *contradiction* in the model, it corresponds to *entailment* in the dataset. Aligning the labels between the model and the dataset is essential to ensure they interpret label indices consistently. Without alignment, the fine-tuning process will train the model on incorrect label relationships and compromise the training process.

In [11]:
print('Labels Before Alignment:', dataset['train']['label'][:10])
print('Label Map Before Alignment:', {index: label for index, label in enumerate(dataset['train'].features['label'].names)})

Labels Before Alignment: [1, 2, 0, 1, 0, 2, 2, 0, 1, 1]
Label Map Before Alignment: {0: 'entailment', 1: 'neutral', 2: 'contradiction'}


In [12]:
dataset = dataset.align_labels_with_mapping({'contradiction': 0, 'neutral': 1, 'entailment': 2}, 'label')



In [13]:
print('Labels After Alignment:', dataset['train']['label'][:10])
print('Label Map After Alignment:', {index: label for index, label in enumerate(dataset['train'].features['label'].names)})

Labels After Alignment: [1, 0, 2, 1, 2, 0, 0, 2, 1, 1]
Label Map After Alignment: {0: 'contradiction', 1: 'neutral', 2: 'entailment'}


### Tokenize

In [14]:
from transformers import AutoTokenizer

In [15]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenization_function(example):
    return tokenizer(text = example['premise'], text_pair = example['hypothesis'], truncation = True)



In [16]:
dataset = dataset.map(tokenization_function, batched = True)



# Fine-tune

### Data Collator

The data collator batches the examples and pads the input sequences to the same length, ensuring compatibility with the model during training and evaluation.

In [17]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

### Metrics

To evaluate the performance of the model during training and validation.

In [18]:
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

In [19]:
def compute_metrics(labels):
    # Unpack predicted and true labels
    y_pred, y_true = labels
    
    # Convert logits (predicted probabilities) to class labels
    # by selecting the index with the highest probability.
    y_pred = np.argmax(y_pred, axis = 1)
    
    # Calculate metrics by comparing the predicted labels against the true labels
    acc = accuracy_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred, average = 'macro')
    
    # Return the calculated metrics
    return {
        'Acc': acc,
        'F1': f1
    }

### Early Stopping

Monitors the evaluation metric during training and halts the process early if performance does not improve for a specified number of evaluation steps, helping to prevent overfitting and conserve computational resources.

In [20]:
from transformers import EarlyStoppingCallback

early_stopping_callback = EarlyStoppingCallback(
    early_stopping_patience = 3,
    early_stopping_threshold = 0.001
)

### Trainer

Manages the entire fine-tuning process, including training, evaluation, optimization, logging, and checkpoint creation to simplify the model training and ensuring efficient execution.

We are using the same hyperparameters that were applied during the fine-tuning on the MNLI corpus, except for the number of epochs and the batch size. The number of epochs is set to 10, which is unlikely to be reached due to early stopping, as training will halt when no significant improvement is observed in the evaluation metric. The batch size is determined by the available memory of the computational hardware. For this setup, we have set the batch size to 64, as we will utilize eight NVIDIA RTX A4000 GPUs to fine-tune the model.

In [21]:
from transformers import Trainer
from transformers import TrainingArguments

training_args = TrainingArguments(
    num_train_epochs = 10,
    per_device_train_batch_size = 64,
    per_device_eval_batch_size = 64,
    
    adam_epsilon = 1e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.98,
    
    learning_rate = 2e-5,
    weight_decay = 0.1,
    lr_scheduler_type = 'linear',
    warmup_ratio = 0.06,
    
    eval_strategy = 'epoch',
    logging_strategy = 'epoch',
    metric_for_best_model = 'F1',
    greater_is_better = True,
    
    output_dir = 'checkpoints',
    overwrite_output_dir = True,
    save_strategy = 'epoch',
    save_total_limit = 4,
    load_best_model_at_end = True
)

trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = dataset['train'],
    eval_dataset = dataset['validation'],
    processing_class = tokenizer,
    data_collator = data_collator,
    compute_metrics = compute_metrics,
    callbacks = [early_stopping_callback]
)

### Train

In [22]:
trainer.train()

Epoch,Training Loss,Validation Loss,Acc,F1
1,0.2902,0.219194,0.926844,0.926414
2,0.229,0.215655,0.930197,0.92972
3,0.1822,0.223093,0.930502,0.930309
4,0.1457,0.247766,0.9304,0.930355
5,0.1183,0.257386,0.929384,0.929248


TrainOutput(global_step=10730, training_loss=0.02365808975774333, metrics={'train_runtime': 12869.618, 'train_samples_per_second': 3204.103, 'train_steps_per_second': 12.516, 'total_flos': 3.152317832477303e+17, 'train_loss': 0.02365808975774333, 'epoch': 5.0})

In [23]:
print('Best Epoch:', int(int(trainer.state.best_model_checkpoint.split('/')[-1].split('-')[-1]) / (trainer.state.max_steps / trainer.state.num_train_epochs)))

Best Epoch: 4


# Evaluate

In [24]:
from sklearn.metrics import classification_report

results = trainer.predict(dataset['test'])

print(classification_report(
    y_true = results.label_ids,
    y_pred = np.argmax(results.predictions, axis = 1),
    target_names = dataset['train'].features['label'].names
))

               precision    recall  f1-score   support

CONTRADICTION       0.95      0.95      0.95      3237
      NEUTRAL       0.89      0.90      0.89      3219
   ENTAILMENT       0.93      0.92      0.93      3368

     accuracy                           0.92      9824
    macro avg       0.92      0.92      0.92      9824
 weighted avg       0.92      0.92      0.92      9824



# Save Model

In [25]:
# Directory to save the model
output_dir = 'roberta-large-mnli-snli'

# Creates a draft model card.
# The model card is a README.md file that will be saved
# to the directory specified by training_args.output_dir.
trainer.create_model_card(
    model_name = 'roberta-large-mnli-snli',
    language = 'EN',
    license = 'mit',
    tags = ['RoBERTa Large', 'MNLI', 'SNLI'],
    finetuned_from = 'FacebookAI/roberta-large-mnli',
    tasks = ['RTE'],
    dataset_tags = ['MNLI', 'SNLI'],
    dataset = ['stanfordnlp/snli']
)

# Save the model including the tokenizer
trainer.save_model(output_dir = output_dir)

# Save the trainer state to resume training in the future
trainer.state.save_to_json(f'{output_dir}/trainer_state.json')