# Scaling up the model

Notebook written by Jeffrey Dick on 2025-04-14.

To build the model for my project I used a [pretrained DeBERTa model](https://huggingface.co/MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli) for fine-tuning on scientific citation datasets.
These domain-specific datasets have a few thousand examples.
- [SciFact](https://github.com/allenai/scifact) [[paper](https://doi.org/10.18653/v1/2022.findings-emnlp.347)] - 1,409 examples
- [Citation-Integrity](https://github.com/ScienceNLP-Lab/Citation-Integrity/) [[paper](https://doi.org/10.1093/bioinformatics/btae420)] - 3,063 examples

This notebook scales up the model by running inference on the pre-training datasets, each of which consists of 100k+ examples:
- [Fever](https://huggingface.co/datasets/copenlu/fever_gold_evidence) [[paper](https://doi.org/10.18653/v1/N18-1074)] - Fact Extraction and VERification derived by altering Wikipedia sentences.
- [Facebook Adversarial NLI](https://huggingface.co/datasets/facebook/anli) [[paper](https://aclanthology.org/2020.acl-main.441/)] - Dynamic benchmark using human-and-model-in-the-loop for increasing difficulty.
- [NYU Multi NLI](https://huggingface.co/datasets/nyu-mll/multi_nli) [[paper](http://aclweb.org/anthology/N18-1101)] - Multi-Genre Natural Language Inference corpus, spanning genres including face-to-face, government, letters, 9/11, Oxford University Press books, Slate magazine, travel guides, telephone conversations, linguistics posts, and fiction.

## Implementation notes

- This notebook is compatible with a GPU runtime in the Google Colab environment.
- This notebook uses a Python package (**pyvers**) that I wrote for training and testing the model.
- Keeping the data processing and model-specific code in the package makes this notebook more readable.
- This notebook uses `git clone` to clone the **pyvers** package [from GitHub](https://github.com/jedick/pyvers), then installs the package with `pip`.


## Setup Python environment

In [None]:
# Clone pyvers package from GitHub
!git clone "https://github.com/jedick/pyvers.git"

In [None]:
# Install requirements for pyvers
!pip install -r pyvers/requirements.txt

In [None]:
# Install pyvers
!pip install ./pyvers

In [4]:
# Import required modules
import pytorch_lightning as pl
from pyvers.data import NLIDataModule
from pyvers.model import PyversClassifier
import pandas as pd

## Run inference

In [5]:
def run_inference(test_fold="test", fast_dev_run=False):
    """
    Runs inference on three pre-training datasets using two models:
    The first model has not undergone any fine-tuning;
    The second model has been fine-tuned on SciFact and Citation-Integrity.

    Args:
        test_fold: Select a non-default test fold for the data module.
        fast_dev_run: Change this to True (for 1 batch) or an int to limit inference to number of batches.
    """

    dataset_names = ["copenlu/fever_gold_evidence", "facebook/anli", "nyu-mll/multi_nli"]
    # Initialize output variables
    size = []
    acc_before = []
    acc_after = []

    # Set batch size to get number of examples rounded to 100
    batch_size = 100

    for dataset_name in dataset_names:

        # Load before-fine-tuning model
        dm = NLIDataModule(dataset_name, "MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli", batch_size=batch_size, test_fold=test_fold)
        print(f"Running inference on {test_fold} split from {dataset_name} using model BEFORE fine-tuning")
        model = PyversClassifier(dm.model_name)
        # Setup trainer and run inference
        trainer = pl.Trainer(fast_dev_run=fast_dev_run)
        test_metrics = trainer.test(model, datamodule=dm)
        acc_before.extend([test_metrics[0]["Accuracy"]])

        # Load after-fine-tuning model
        dm = NLIDataModule(dataset_name, "jedick/DeBERTa-v3-base-mnli-fever-anli-scifact-citint", batch_size=batch_size, test_fold=test_fold)
        print(f"Running inference on {test_fold} split from {dataset_name} using model AFTER fine-tuning")
        model = PyversClassifier(dm.model_name)
        # Setup trainer and run inference
        trainer = pl.Trainer(fast_dev_run=fast_dev_run)
        test_metrics = trainer.test(model, datamodule=dm)
        acc_after.extend([test_metrics[0]["Accuracy"]])

        # Get the size of the dataset
        dm.setup("test")
        size.extend([dm.test_dataloader().__len__() * batch_size])

    zipped_lists = zip(dataset_names, size, acc_before, acc_after)
    columns = ['dataset_name', 'size', 'acc_before', 'acc_after']
    return(pd.DataFrame(zipped_lists, columns=columns))

In [None]:
# Run inference on test splits
df_test = run_inference(test_fold="test")

In [None]:
# Run inference on train splits
df_train = run_inference(test_fold="train", fast_dev_run=1000)

## Results

The DataFrame columns show:
- dataset name
- size (number of examples, rounded to 100 becaue of batch size)
- accuracy for the base DeBERTA model (pretrained but without fine-tuning)
- accuracy for the model after fine-tuning on SciFact and Citation-Integrity

There are two sets of results: first for the test splits in each dataset, and second for the train splits in each dataset.

In [8]:
# Print the results for inference on test splits
df_test

Unnamed: 0,dataset_name,size,acc_before,acc_after
0,copenlu/fever_gold_evidence,16100,0.806035,0.785897
1,facebook/anli,1200,0.485833,0.4925
2,nyu-mll/multi_nli,9900,0.902665,0.864626


In [9]:
# Print the results for inference on train splits
df_train

Unnamed: 0,dataset_name,size,acc_before,acc_after
0,copenlu/fever_gold_evidence,228300,0.85359,0.84988
1,facebook/anli,100500,0.98012,0.91848
2,nyu-mll/multi_nli,392800,0.97598,0.92239


## Design Decisions
- To make this notebook run in a cloud environment, the datasets and models are downloaded from HuggingFace.
  - The pre-trained model is available from [Moritz Laurer's HF repo](https://huggingface.co/MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli).
  - This fine-tuned model is available from [my HF repo](https://huggingface.co/jedick/DeBERTa-v3-base-mnli-fever-anli-scifact-citint).
- Training and testing the model are done using the [PyTorch Lightning](https://github.com/Lightning-AI/pytorch-lightning) framework and classes from the **pyvers** package.
  - The data module (`NLIDataModule`) loads datasets from HuggingFace with consistent label encoding for the three classes (0 - Support, 1 - Not Enough Info, 2 - Refute).
  - The model class (`PyversClassifier`) implements fine-tuning on the pretrained model, with configuration for hyperparameter settings, metrics, and inference (see a [previous blog post](https://jedick.github.io/blog/experimenting-with-transformer-models-for-citation-verification/)).
  - By using PyTorch Lightning, no additional changes are needed to port the notebook from CPU to GPU.
- To reduce compute usage, I limited the runs to 1000 batches of 100 examples each (i.e., inference on 100,000 examples for each dataset).
  - This was done only for the train splits, and not for the test splits, which are considerably smaller.
  - Approximate timing to run inference on the train splits (100,000 examples/run * 3 datasets * 2 models = 600,000 total examples) with A100 GPU in Colab: 18 minutes.

## Findings

### Test splits

- The accuracy of predictions for the test split in MultiNLI is somewhat lower after fine-tuning (0.86 vs 0.90).
  - Fine-tuning on biomedical claims may have reduced performance on fiction and popular culture genres, which form part of the MultiNLI dataset.
  - This represents an acceptable tradeoff in order to achieve satisfactory performance on the in-domain datasets.
- There is a smaller accuracy loss on the Fever dataset (0.79 vs 0.81).
- There is a minor accuracy increase in the ANLI dataset, but the accuracy for both models is quite low (less than 0.5).

### Train splits
- As expected, accuracy on the train splits is higher than that for the test splits (note that the train splits were used for pretraining the DeBERTa model).
- The ANLI dataset has a large difference between train and test accuracy.
  - The Facebook ANLI dataset is a dynamic benchmark, with multiple rounds of increasing difficulty.
  - In each round, models are trained and human annotators devise a new test set that challenges the best-performing models.
  - The test split used here is from the last round (Round 3) of the Facebook ANLI dataset, representing the most challenging premise-hypothesis pairs. This explains the large difference in performance between the train and test sets for this dataset.

## Conclusion

The fine-tuned model exhibits some performance degradation in the MultiNLI and ANLI datasets.
The results suggest that the fine-tuned model is most useful for in-domain data (scientific citations).

In regards to scaling up the model, this notebook shows that making inferences in a matter of minutes for 100k+ examples is possible with a GPU runtime.