# **arxiv-fine-tuning**


Fine-tuning a transformer model on arXiv paper abstracts for downstream scientific text embedding tasks.

## **Setup**

This notebook is designed to work in both Google Colab and local environments.

**For Google Colab:**
- **Mount Google Drive:** Enables saving files and accessing them across Colab.
    > ⚠ **Warning** <br>
    > This mounts your entire Google Drive, giving theoretical access to all files. While the code only accesses the project folder, consider using a dedicated Google account.
- **Clone the repository:** Ensures the latest code and utility modules are available.
- Add repo to Python path: Lets us import custom project modules as regular Python packages.

**For local environments:**
- Add project root to Python path: Lets us import custom project modules from the parent directory.

[ Optionally ]:
- Enable Autoreload: Lets us modify utility modules without having to reload them manually (useful for development).

In [1]:
import os
import sys

def setup_environment(repo_url, dev=False, drive_mount_path="/content/drive"):
    """Sets up the development environment for both Google Colab and local environments."""

    if "google.colab" not in sys.modules:
        # Define local project root
        project_root = os.path.dirname(os.getcwd())

        print("Not running in Google Colab.\nSkipping Colab setup.")

    else:
        # Mount Google Drive
        from google.colab import drive
        drive.mount(drive_mount_path, force_remount=True)

        # Define where within Drive to clone the git repository
        project_parent_dir = os.path.join(drive_mount_path, "MyDrive")
        project_name = repo_url.split("/")[-1].replace('.git', "")
        project_root = os.path.join(project_parent_dir, project_name)

        # Clone the repository if it doesn't exist
        if not os.path.exists(project_root):
            print(f"\nCloning repository into {project_root}")
            try:
                os.chdir(project_parent_dir)  # Change to the parent directory to clone the repo
                !git clone {repo_url}
            finally:
                os.chdir(project_root)  # Always change back to the original directory, even if clone fails
        else:
            print(f"\nRepository already exists at {project_root}")

        print("\nColab setup complete.")

    # Add project to Python path
    if project_root not in sys.path:
        sys.path.insert(0, project_root)
        print(f"\n'{project_root}' added to Python path.")
    else:
        print(f"\n'{project_root}' in Python path.")

    # Enable autoreload (for developement)
    if dev:
        from IPython import get_ipython
        ipython = get_ipython()

        # Load extension quietly if not already loaded
        if "autoreload" not in ipython.extension_manager.loaded:
            ipython.magic("load_ext autoreload")

        print("\nAutoreload extension enabled (mode 2).")
        ipython.magic("autoreload 2")

In [2]:
setup_environment("https://github.com/nadrajak/arxiv-semantic-search.git", dev=True)

Mounted at /content/drive

Repository already exists at /content/drive/MyDrive/arxiv-semantic-search

Colab setup complete.

'/content/drive/MyDrive/arxiv-semantic-search' added to Python path.

Autoreload extension enabled (mode 2).


## **Imports**

In [3]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

import torch

from sentence_transformers import SentenceTransformer
from sentence_transformers.trainer import SentenceTransformerTrainer
from sentence_transformers.training_args import SentenceTransformerTrainingArguments
from sentence_transformers.losses import TripletLoss

from transformers import EarlyStoppingCallback

# Custom modules
from utils import config
from utils import data_loader
from utils import preprocessing
from utils import triplet_dataset

In [4]:
np.random.seed(config.RANDOM_SEED);
torch.manual_seed(config.RANDOM_SEED);

## **Load data**


We use the [arXiv dataset from Kaggle](https://www.kaggle.com/Cornell-University/arxiv), which contains metadata and abstracts for scholarly papers across STEM fields.

Below, we load a sample of the dataset and briefly inspect its structure.

In [5]:
# Download dataset from Kaggle
arxiv_dataset_path = data_loader.load_arxiv_dataset()

In [6]:
# Load json file as a pandas DataFrame
data = pd.read_json(arxiv_dataset_path, lines=True, nrows=config.FT_NROWS)

print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              15000 non-null  float64
 1   submitter       15000 non-null  object 
 2   authors         15000 non-null  object 
 3   title           15000 non-null  object 
 4   comments        13240 non-null  object 
 5   journal-ref     7925 non-null   object 
 6   doi             9392 non-null   object 
 7   report-no       1358 non-null   object 
 8   categories      15000 non-null  object 
 9   license         1060 non-null   object 
 10  abstract        15000 non-null  object 
 11  versions        15000 non-null  object 
 12  update_date     15000 non-null  object 
 13  authors_parsed  15000 non-null  object 
dtypes: float64(1), object(13)
memory usage: 1.6+ MB
None


In [7]:
print(f"'{data.iloc[1]['abstract']}'")
print(f"\n'{data.iloc[1]['categories']}'")

'  We describe a new algorithm, the $(k,\ell)$-pebble game with colors, and use
it obtain a characterization of the family of $(k,\ell)$-sparse graphs and
algorithmic solutions to a family of problems concerning tree decompositions of
graphs. Special instances of sparse graphs appear in rigidity theory and have
received increased attention in recent years. In particular, our colored
pebbles generalize and strengthen the previous results of Lee and Streinu and
give a new proof of the Tutte-Nash-Williams characterization of arboricity. We
also present a new decomposition that certifies sparsity based on the
$(k,\ell)$-pebble game with colors. Our work also exposes connections between
pebble game algorithms and previous sparse graph algorithms by Gabow, Gabow and
Westermann and Hendrickson.
'

'math.CO cs.CG'


## **Preprocessing**

We apply light normalization to abstracts and simplify the category labels:
- **Whitespace normalization:** Removes extra spaces, tabs, and newlines from text columns.
- **Abstract normalization:** Filters abstracts to a predefined length range to ensure quality.
- **Category truncation:** Keeps only the main/top-level categories.

Below, we preprocess and sample the data.

In [8]:
data = data[["title", "abstract", "categories", "authors"]]

data.head(5)

Unnamed: 0,title,abstract,categories,authors
0,Calculation of prompt diphoton production cros...,A fully differential calculation in perturba...,hep-ph,"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-..."
1,Sparsity-certifying Graph Decompositions,"We describe a new algorithm, the $(k,\ell)$-...",math.CO cs.CG,Ileana Streinu and Louis Theran
2,The evolution of the Earth-Moon system based o...,The evolution of Earth-Moon system is descri...,physics.gen-ph,Hongjun Pan
3,A determinant of Stirling cycle numbers counts...,We show that a determinant of Stirling cycle...,math.CO,David Callan
4,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,In this paper we show how to compute the $\L...,math.CA math.FA,Wael Abu-Shammala and Alberto Torchinsky


In [9]:
# Apply light preprocessing to text columns
data = preprocessing.normalize_whitespace(data)
data = preprocessing.normalize_abstracts(data)

# Simplify categories
data = preprocessing.truncate_categories(data)

print(data.info())

<class 'pandas.core.frame.DataFrame'>
Index: 14481 entries, 0 to 14999
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     14481 non-null  object
 1   abstract  14481 non-null  object
 2   category  14481 non-null  object
 3   authors   14481 non-null  object
dtypes: object(4)
memory usage: 565.7+ KB
None


In [10]:
print(f"'{data.iloc[1]['abstract']}'")
print(f"\n'{data.iloc[1]['category']}'")

'We describe a new algorithm, the $(k,\ell)$-pebble game with colors, and use it obtain a characterization of the family of $(k,\ell)$-sparse graphs and algorithmic solutions to a family of problems concerning tree decompositions of graphs. Special instances of sparse graphs appear in rigidity theory and have received increased attention in recent years. In particular, our colored pebbles generalize and strengthen the previous results of Lee and Streinu and give a new proof of the Tutte-Nash-Williams characterization of arboricity. We also present a new decomposition that certifies sparsity based on the $(k,\ell)$-pebble game with colors. Our work also exposes connections between pebble game algorithms and previous sparse graph algorithms by Gabow, Gabow and Westermann and Hendrickson.'

'math'


## **Fine-tuning**

We use [AllenAI SPECTER](https://huggingface.co/allenai/specter), a transformer model designed for scientific paper embeddings. This model is further fine-tuned on our ArXiv subset using triplet loss, which encourages similar abstracts to have similar embeddings. The fine-tuned model is then pushed to the Hugging Face Hub for easy reuse.

For robust evaluation, the data is split into:
- **(4/6) Training set:** For model fitting.
- **(1/6) Evaluation set:** For evaluating the base model.
- **(1/6) Test set:** For evaluating the fine-tuned model.

Performance is measured using **cosine accuracy** on triplet samples, where higher accuracy indicates better semantic understanding.

Below, we fine-tune the model.

In [11]:
# Initialize model
model = SentenceTransformer("allenai-specter");

# Move model to gpu if available
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device);

In [12]:
# Define training arguments -- mostly based on Transformers defaults (which SentenceTransformers wraps)
# https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.Trainer
args = SentenceTransformerTrainingArguments(
    num_train_epochs=3,
    per_device_train_batch_size=8,  # Max before CUDA out of memory
    per_device_eval_batch_size=8,   # Max before CUDA out of memory
    gradient_accumulation_steps=2,
    lr_scheduler_type="cosine",
    learning_rate=2e-5,
    warmup_ratio=0.1,
    dataloader_pin_memory=False,
    dataloader_num_workers=1,
    fp16=True,
    logging_steps=500,
    eval_strategy="steps",
    eval_steps=500,
    report_to="none",
)

In [13]:
# Split data into train and test
train_data, eval_data = train_test_split(data, test_size=0.33, random_state=config.RANDOM_SEED)
eval_data, test_data = train_test_split(eval_data, test_size=0.5, random_state=config.RANDOM_SEED)

print(f"Train: {len(train_data)}, Eval: {len(eval_data)}, Test: {len(test_data)}")

Train: 9702, Eval: 2389, Test: 2390


In [14]:
# Create train, eval, test datasets for SentenceTransformerTrainer
train_dataset = triplet_dataset.create_dataset_for_trainer(train_data, n=config.FT_TRAIN_TRIPLETS)
eval_dataset = triplet_dataset.create_dataset_for_trainer(eval_data, n=config.FT_EVAL_TRIPLETS)
test_dataset = triplet_dataset.create_dataset_for_trainer(test_data, n=config.FT_TEST_TRIPLETS)

print(f"Train: {train_dataset}\n\nEval: {eval_dataset}\n\nTest: {test_dataset}")

Train: Dataset({
    features: ['anchor', 'positive', 'negative'],
    num_rows: 10000
})

Eval: Dataset({
    features: ['anchor', 'positive', 'negative'],
    num_rows: 1000
})

Test: Dataset({
    features: ['anchor', 'positive', 'negative'],
    num_rows: 1000
})


In [15]:
# Create eval & testing evaluator for SentenceTransformerTrainer
dev_evaluator = triplet_dataset.create_triplet_evaluator_for_trainer(eval_dataset)
test_evaluator = triplet_dataset.create_triplet_evaluator_for_trainer(test_dataset)

In [16]:
base_eval_output = dev_evaluator(model)

In [17]:
# Define training loss
loss = TripletLoss(model)

In [18]:
# Initialize trainer
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,    # For validation loss
    loss=loss,
    evaluator=dev_evaluator       # For meaningful metrics
)

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

In [19]:
trainer.train()

Step,Training Loss,Validation Loss,Triplet Eval Cosine Accuracy
500,2.5031,0.795555,0.941
1000,1.0464,0.75939,0.945
1500,0.5218,0.708552,0.948


TrainOutput(global_step=1875, training_loss=1.1498443074544271, metrics={'train_runtime': 1463.5962, 'train_samples_per_second': 20.497, 'train_steps_per_second': 1.281, 'total_flos': 0.0, 'train_loss': 1.1498443074544271, 'epoch': 3.0})

In [20]:
tuned_eval_output = dev_evaluator(model)
tuned_test_output = test_evaluator(model)

In [21]:
print(f"Base model on eval dataset:  {base_eval_output}")
print(f"Tuned model on eval dataset: {tuned_eval_output}")
print(f"Tuned model on test dataset: {tuned_test_output}")

Base model on eval dataset:  {'triplet_eval_cosine_accuracy': 0.8209999799728394}
Tuned model on eval dataset: {'triplet_eval_cosine_accuracy': 0.9490000009536743}
Tuned model on test dataset: {'triplet_eval_cosine_accuracy': 0.9470000267028809}


In [23]:
# Check for Hugging Face login token
from google.colab import userdata
hf_token = userdata.get('HF_TOKEN')

# Login and push fine-tuned model to Hugging face
if hf_token:
    !huggingface-cli login --token $hf_token
    model.push_to_hub("nadrajak/allenai-specter-ft2")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
The token `kaggle` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `kaggle`


Uploading...:   0%|          | 0.00/440M [00:00<?, ?B/s]