## **ConvBERT Reproduction Study: Fine-tuning for Performance on GLUE Benchmark**

This notebook focuses on the reproduction of the ConvBERT model's performance on the GLUE benchmark, a collection of natural language understanding tasks. By fine-tuning ConvBERT across multiple tasks such as sentiment analysis, paraphrase detection, and textual entailment, the study aims to validate the original paper's findings. The notebook provides an in-depth implementation of fine-tuning using the PyTorch framework and Hugging Face's Transformers library, with optimization techniques like gradient accumulation and mixed precision training to handle computational challenges.

In [None]:
# Install required packages
!pip install torch transformers datasets evaluate

import torch
from torch.utils.data import DataLoader
from transformers import ConvBertTokenizer, ConvBertForSequenceClassification
from datasets import load_dataset
from torch.nn.utils.rnn import pad_sequence
import os
from torch.optim.adamw import AdamW
from sklearn.metrics import accuracy_score, mean_squared_error
from scipy.stats import pearsonr
from torch.cuda.amp import autocast, GradScaler


Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━

In [None]:
# Define the GLUE tasks and their respective number of labels for classification or regression
# Each key represents a GLUE task, and the corresponding value is the number of labels for that task.
# For example:
# - SST-2 (Sentiment Analysis) and CoLA (Linguistic Acceptability) are binary classification tasks (2 labels).
# - MRPC (Paraphrase Detection), QQP (Question Paraphrase), QNLI, RTE, and WNLI are also binary classification tasks (2 labels).
# - STS-B (Semantic Textual Similarity) is a regression task (1 label) where the goal is to predict a continuous score.
# - MNLI (Multi-Genre Natural Language Inference) is a multi-class classification task with 3 labels (entailment, neutral, contradiction).

glue_tasks = {
    "sst2": 2,  # SST-2: Sentiment analysis (binary classification)
    "cola": 2,  # CoLA: Linguistic acceptability (binary classification)
    "mrpc": 2,  # MRPC: Paraphrase detection (binary classification)
    "stsb": 1,  # STS-B: Semantic similarity (regression task, 1 continuous value)
    "qqp": 2,   # QQP: Question paraphrase detection (binary classification)
    "mnli": 3,  # MNLI: Multi-genre NLI (3-class classification: entailment, neutral, contradiction)
    "qnli": 2,  # QNLI: Question-answering NLI (binary classification)
    "rte": 2,   # RTE: Recognizing textual entailment (binary classification)
    "wnli": 2   # WNLI: Winograd NLI (binary classification, known for its difficulty)
}


In [None]:
# Load ConvBERT tokenizer and model for sequence classification
model_name = "YituTech/conv-bert-base"
tokenizer = ConvBertTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


vocab.txt:   0%|          | 0.00/267k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/674 [00:00<?, ?B/s]



In [None]:
# Define a custom collate function to handle padding of sequences within a batch
# This function is useful for making sure that sequences in a batch have the same length
# by padding them appropriately. This is necessary when dealing with variable-length sequences (e.g., sentences).

def collate_fn(batch):
    # Extract 'input_ids', 'attention_mask', and 'label' from each item in the batch
    input_ids = [item['input_ids'] for item in batch]
    attention_mask = [item['attention_mask'] for item in batch]
    labels = [item['label'] for item in batch]

    # Pad the input_ids and attention_mask to the same length for all samples in the batch
    # pad_sequence pads sequences with zeros by default to make them of equal length.
    # 'batch_first=True' ensures that the output tensor has the batch size as the first dimension.
    input_ids_padded = pad_sequence(input_ids, batch_first=True)
    attention_mask_padded = pad_sequence(attention_mask, batch_first=True)

    # Convert the labels to a tensor since they are plain Python lists
    labels = torch.tensor(labels)

    # Return a dictionary with the padded input_ids, attention_mask, and labels
    return {
        'input_ids': input_ids_padded,            # Padded input token IDs
        'attention_mask': attention_mask_padded,  # Padded attention mask
        'label': labels                           # Labels for the batch
    }


In [None]:
# Mount the drive to save model checkpoints
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import warnings

warnings.filterwarnings('ignore')


# Initialize gradient scaler for mixed precision training
scaler = GradScaler()

# Gradient accumulation setup
accumulation_steps = 8  # Adjust this based on your memory requirements

# Directory to save checkpoints
checkpoint_dir = "/content/drive/MyDrive/Advanced NLP/Ass_2/model_checkpoints"
os.makedirs(checkpoint_dir, exist_ok=True)

def save_checkpoint(model, optimizer, task):
    checkpoint_path = os.path.join(checkpoint_dir, f"{task}_final.pth")
    torch.save({
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
    }, checkpoint_path)
    print(f"\nCheckpoint saved for {task}: {checkpoint_path}\n")

def load_checkpoint(model, optimizer, task):
    checkpoint_path = os.path.join(checkpoint_dir, f"{task}_final.pth")
    if os.path.exists(checkpoint_path):
        checkpoint = torch.load(checkpoint_path)
        model.load_state_dict(checkpoint['model_state_dict'])
        optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
        print(f"\nCheckpoint loaded for {task}: {checkpoint_path}\n")
        return True
    else:
        print(f"\nNo checkpoint found for {task}. Starting fresh.\n")
        return False


# Initialize dictionary to store task accuracies/metrics
task_metrics = {}

# Fine-tune for each GLUE task
for task, num_labels in glue_tasks.items():
    print(f"\n\nFine-tuning for task: {task}")

    # Load dataset for the current task
    dataset = load_dataset("glue", task)

    if task == 'mnli':  # MNLI has 'validation_matched' and 'validation_mismatched'
        train_dataset = dataset['train']
        validation_matched_dataset = dataset['validation_matched']
        validation_mismatched_dataset = dataset['validation_mismatched']
    else:
        train_dataset = dataset['train']
        test_dataset = dataset['validation']  # For tasks other than MNLI

    # Tokenize the data
    def tokenize(batch):
        if task in ['stsb', 'mrpc', 'qqp', 'mnli', 'rte', 'wnli']:  # Tasks with two sentences
            if 'premise' in batch and 'hypothesis' in batch:
                return tokenizer(batch['premise'], batch['hypothesis'], padding=True, truncation=True, max_length=512)
            elif 'sentence1' in batch:
                return tokenizer(batch['sentence1'], batch['sentence2'], padding=True, truncation=True, max_length=512)
            elif 'question1' in batch:  # For QQP task
                return tokenizer(batch['question1'], batch['question2'], padding=True, truncation=True, max_length=512)
        else:  # Tasks with one sentence
            return tokenizer(batch['sentence'], padding=True, truncation=True, max_length=512)


    # Apply tokenization
    train_dataset = train_dataset.map(lambda x: tokenize(x), batched=True)

    if task == 'mnli':
        validation_matched_dataset = validation_matched_dataset.map(lambda x: tokenize(x), batched=True)
        validation_mismatched_dataset = validation_mismatched_dataset.map(lambda x: tokenize(x), batched=True)
    else:
        test_dataset = test_dataset.map(lambda x: tokenize(x), batched=True)

    # Convert dataset to PyTorch tensors
    train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

    if task == 'mnli':
        validation_matched_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
        validation_mismatched_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
    else:
        test_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

    # Load the ConvBERT model for classification
    model = ConvBertForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)

    # Set up the optimizer
    optimizer = AdamW(model.parameters(), lr=5e-5, weight_decay=0.01)

    # Load the checkpoint if it exists
    checkpoint_loaded = load_checkpoint(model, optimizer, task)

    # Prepare DataLoader with optimized settings
    train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=True, collate_fn=collate_fn, num_workers=4, pin_memory=True)

    if task == 'mnli':
        validation_matched_dataloader = DataLoader(validation_matched_dataset, batch_size=16, collate_fn=collate_fn, num_workers=4, pin_memory=True)
        validation_mismatched_dataloader = DataLoader(validation_mismatched_dataset, batch_size=16, collate_fn=collate_fn, num_workers=4, pin_memory=True)
    else:
        test_dataloader = DataLoader(test_dataset, batch_size=16, collate_fn=collate_fn, num_workers=4, pin_memory=True)

    # Move model to the appropriate device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    if not checkpoint_loaded:  # Skip training if checkpoint is loaded
        # Fine-tuning loop with gradient accumulation and mixed precision
        model.train()
        for epoch in range(3):  # Use 3 epochs for each task
            total_loss = 0
            optimizer.zero_grad()

            for i, batch in enumerate(train_dataloader):
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['label'].to(device)

                # Autocast for mixed precision training
                with autocast():
                    outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
                    loss = outputs.loss / accumulation_steps  # Scale the loss for accumulation

                # Backpropagation
                scaler.scale(loss).backward()

                # Update weights after accumulation steps
                if (i + 1) % accumulation_steps == 0:
                    scaler.step(optimizer)
                    scaler.update()
                    optimizer.zero_grad()

                total_loss += loss.item() * accumulation_steps  # Re-scale the loss back

            avg_loss = total_loss / len(train_dataloader)
            print(f"Epoch {epoch+1}, Task: {task}, Loss: {avg_loss:.4f}")

            torch.cuda.empty_cache()

        # Save checkpoint after completing the task
        save_checkpoint(model, optimizer, task)

    # Evaluate the model
    model.eval()

    if task == 'mnli':
        # Evaluate on both validation_matched and validation_mismatched datasets
        def evaluate_mnli(split_name, dataloader):
            predictions, true_labels = [], []
            with torch.no_grad():
                for batch in dataloader:
                    input_ids = batch['input_ids'].to(device)
                    attention_mask = batch['attention_mask'].to(device)
                    labels = batch['label'].to(device)

                    outputs = model(input_ids, attention_mask=attention_mask)
                    logits = outputs.logits
                    preds = torch.argmax(logits, dim=-1)

                    predictions.extend(preds.cpu().numpy())
                    true_labels.extend(labels.cpu().numpy())

            accuracy = accuracy_score(true_labels, predictions)
            print(f"\nTask: {task} ({split_name}), Test Accuracy: {accuracy:.4f}")
            # Store the accuracy for GLUE score calculation
            task_metrics[f"{task}_{split_name}"] = accuracy  # Save for both matched and mismatched

        # Evaluate on both splits
        evaluate_mnli('validation_matched', validation_matched_dataloader)
        evaluate_mnli('validation_mismatched', validation_mismatched_dataloader)

    else:
        predictions, true_labels = [], []
        with torch.no_grad():
            for batch in test_dataloader:
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['label'].to(device)

                outputs = model(input_ids, attention_mask=attention_mask)
                logits = outputs.logits

                # Regression task (STS-B)
                if task == 'stsb':
                    preds = logits.squeeze()  # No need for argmax in regression
                else:  # Classification tasks
                    preds = torch.argmax(logits, dim=-1)

                predictions.extend(preds.cpu().numpy())
                true_labels.extend(labels.cpu().numpy())

        # Calculate metrics
        if task == 'stsb':  # Pearson correlation for regression task
            pearson_corr = pearsonr(true_labels, predictions)[0]
            mse = mean_squared_error(true_labels, predictions)
            print(f"\nTask: {task}, Pearson Correlation: {pearson_corr:.4f}, MSE: {mse:.4f}")
            task_metrics[task] = pearson_corr  # Store Pearson correlation for GLUE score
        else:  # Accuracy for classification tasks
            accuracy = accuracy_score(true_labels, predictions)
            print(f"\nTask: {task}, Test Accuracy: {accuracy:.4f}")
            task_metrics[task] = accuracy  # Store accuracy for GLUE score

    torch.cuda.empty_cache()




Fine-tuning for task: sst2


Some weights of ConvBertForSequenceClassification were not initialized from the model checkpoint at YituTech/conv-bert-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Checkpoint loaded for sst2: /content/drive/MyDrive/Advanced NLP/Ass_2/model_checkpoints/sst2_final.pth


Task: sst2, Test Accuracy: 0.9266


Fine-tuning for task: cola


Some weights of ConvBertForSequenceClassification were not initialized from the model checkpoint at YituTech/conv-bert-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Checkpoint loaded for cola: /content/drive/MyDrive/Advanced NLP/Ass_2/model_checkpoints/cola_final.pth


Task: cola, Test Accuracy: 0.8543


Fine-tuning for task: mrpc


Some weights of ConvBertForSequenceClassification were not initialized from the model checkpoint at YituTech/conv-bert-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Checkpoint loaded for mrpc: /content/drive/MyDrive/Advanced NLP/Ass_2/model_checkpoints/mrpc_final.pth


Task: mrpc, Test Accuracy: 0.8750


Fine-tuning for task: stsb


Some weights of ConvBertForSequenceClassification were not initialized from the model checkpoint at YituTech/conv-bert-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Checkpoint loaded for stsb: /content/drive/MyDrive/Advanced NLP/Ass_2/model_checkpoints/stsb_final.pth


Task: stsb, Pearson Correlation: 0.9028, MSE: 0.4367


Fine-tuning for task: qqp


Some weights of ConvBertForSequenceClassification were not initialized from the model checkpoint at YituTech/conv-bert-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Checkpoint loaded for qqp: /content/drive/MyDrive/Advanced NLP/Ass_2/model_checkpoints/qqp_final.pth


Task: qqp, Test Accuracy: 0.9146


Fine-tuning for task: mnli


Some weights of ConvBertForSequenceClassification were not initialized from the model checkpoint at YituTech/conv-bert-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Checkpoint loaded for mnli: /content/drive/MyDrive/Advanced NLP/Ass_2/model_checkpoints/mnli_final.pth


Task: mnli (validation_matched), Test Accuracy: 0.8639

Task: mnli (validation_mismatched), Test Accuracy: 0.8648


Fine-tuning for task: qnli


Some weights of ConvBertForSequenceClassification were not initialized from the model checkpoint at YituTech/conv-bert-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Checkpoint loaded for qnli: /content/drive/MyDrive/Advanced NLP/Ass_2/model_checkpoints/qnli_final.pth


Task: qnli, Test Accuracy: 0.6110


Fine-tuning for task: rte


Some weights of ConvBertForSequenceClassification were not initialized from the model checkpoint at YituTech/conv-bert-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Checkpoint loaded for rte: /content/drive/MyDrive/Advanced NLP/Ass_2/model_checkpoints/rte_final.pth


Task: rte, Test Accuracy: 0.7112


Fine-tuning for task: wnli


Some weights of ConvBertForSequenceClassification were not initialized from the model checkpoint at YituTech/conv-bert-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Checkpoint loaded for wnli: /content/drive/MyDrive/Advanced NLP/Ass_2/model_checkpoints/wnli_final.pth


Task: wnli, Test Accuracy: 0.5634


In [None]:
# Print all the accuracies and calculate the GLUE score
print("\nTask Metrics Summary:")
for task, score in task_metrics.items():
    print(f"{task}: {score:.4f}")

# Calculate GLUE score as the average of all metrics
glue_score = sum(task_metrics.values()) / len(task_metrics)
print(f"\nGLUE Score: {glue_score:.4f}")



Task Metrics Summary:
sst2: 0.9266
cola: 0.8543
mrpc: 0.8750
stsb: 0.9028
qqp: 0.9146
mnli_validation_matched: 0.8639
mnli_validation_mismatched: 0.8648
qnli: 0.6110
rte: 0.7112
wnli: 0.5634

GLUE Score: 0.8088
