# Fine-tuning embeddings model

## Installing library

PyTorch 2.5.1, TorchVision 0.20, and accelerate 0.26.0 are needed for sentence_tranformers and xformers. Need to run `pip install -U torch torchvision` first.

When you see the accelerate version error even if you installed `accelerate==0.26.0`, **please go to `kernel` on the menu of this notebook and run `Restart Kernel`, which resets the library import with new installed versions.**

In [None]:
!pip install accelerate==0.26.0
!pip install -U torch torchvision datasets
!pip install transformers[torch]
!pip install -U sentence_transformers
!pip install -U xformers --index-url https://download.pytorch.org/whl/cu124

## Download dataset 

This notebook uses AmazonQA dataset, https://huggingface.co/datasets/embedding-data/Amazon-QA

The dataset consits of over 1M pairs of a question (feature name "query") and list of answers (feature name "pos"). Fine-tuning embeddings model requires pairs of a question and an answer because text tokenizer assumes the pair. In the following, we apply pre-processing function named `select_one_positive` to extract the first answer of the answer list. We can apply the function using `map` function when `load_dataset`.

The dataset is split to training dataset (80%), validation dataset (5%) and test dataset (15%).

In [None]:
import os
from datasets import load_dataset
from sentence_transformers import SentenceTransformer, losses
from sentence_transformers import SentenceTransformerTrainer, SentenceTransformerTrainingArguments
from transformers import EarlyStoppingCallback

# Dataset
def select_one_positive(example, feature ='pos'):
    example[feature] = example[feature][0]
    return example

dataset_name = "embedding-data/Amazon-QA"
train_dataset = load_dataset(dataset_name, split='train[:80%]').map(select_one_positive)
# train_dataset.info.dataset_name = dataset_name +"_train"
val_dataset = load_dataset(dataset_name, split='train[80%:85%]').map(select_one_positive)
# val_dataset.info.dataset_name = dataset_name +"_val"
test_dataset = load_dataset(dataset_name, split='train[85%:]').map(select_one_positive)



## Fine-tuning script

This notebooks uses the model named "dunzhang/stella_en_400M_v5". We can use other huggingface model, which is compatible with SentenceTransformer library. This takes 8-9 hours with ml.g5.16xlarge (Single A10 GPU instance). Training loss is outputted every 1000 training steps (when feeding 8000 pairs = 1000steps X 8 pairs in one batch), and checkpoint is saved every 2000 training steps.


In [None]:
output_dir = "./user-default-efs/output"
n_epochs = 1
batch_size = 8
patience = 2
checkpoint = True
train_files = 1000
model_name = "dunzhang/stella_en_400M_v5"


# ！The default dimension is 1024, if you need other dimensions, please clone the model and modify `modules.json` to replace `2_Dense_1024` with another dimension, e.g. `2_Dense_256` or `2_Dense_8192` !
# on gpu
model = SentenceTransformer(model_name, trust_remote_code=True, config_kwargs={"use_memory_efficient_attention": False, "unpad_inputs": False} ).cuda()

# Define loss
loss = losses.MultipleNegativesRankingLoss(model)

# Calculate gradient accumulation steps: 128/batch_size, clamped between 1 and 64
gradient_accumulation_steps = min(max(128 // batch_size, 1), 64)
effective_batch_size = batch_size * gradient_accumulation_steps
learning_rate = 2e-5 * effective_batch_size / 64

# Training arguments
training_args = SentenceTransformerTrainingArguments(
    output_dir=output_dir,
    num_train_epochs=n_epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    learning_rate=learning_rate,
    warmup_ratio=0.1,
    bf16=True,
    gradient_accumulation_steps=gradient_accumulation_steps,
    gradient_checkpointing=checkpoint,
    eval_strategy="no" if val_dataset is None else "steps",
    eval_steps=1000,
    save_strategy="steps",
    save_steps=2000,
    save_total_limit=10,
    load_best_model_at_end=val_dataset is not None,
)


# Initialize trainer with early stopping only if validation is enable
trainer_kwargs = {
    "model": model,
    "args": training_args,
    "train_dataset": train_dataset,
    "loss": loss,
}
if val_dataset is not None:
    trainer_kwargs.update({
        "eval_dataset": val_dataset,
        "callbacks": [EarlyStoppingCallback(early_stopping_patience=patience)]
    })

trainer = SentenceTransformerTrainer(**trainer_kwargs)

# Train the model
trainer.train()

# Save the final model
last_model_name = model_name.split('/')[-1]
output_path = os.path.join(output_dir, dataset_name, f'fine_tuned_{last_model_name}')
model.save(output_path)
print(f"Model saved to {output_path}")


In [11]:
import pandas as pd 
pd.DataFrame(trainer.state.log_history)

Unnamed: 0,loss,grad_norm,learning_rate,epoch,step,eval_loss,eval_runtime,eval_samples_per_second,eval_steps_per_second,train_runtime,train_samples_per_second,train_steps_per_second,total_flos,train_loss
0,0.0988,40.514362,2.9e-05,0.07304,500,,,,,,,,,
1,0.0933,43.092133,3.8e-05,0.14608,1000,,,,,,,,,
2,,,,0.14608,1000,0.085428,491.5249,111.417,13.928,,,,,
3,0.0917,52.983276,3.5e-05,0.21912,1500,,,,,,,,,
4,0.0821,32.844742,3.1e-05,0.29216,2000,,,,,,,,,
5,,,,0.29216,2000,0.077204,491.489,111.425,13.929,,,,,
6,0.0768,24.372654,2.8e-05,0.3652,2500,,,,,,,,,
7,0.0736,43.456551,2.5e-05,0.43824,3000,,,,,,,,,
8,,,,0.43824,3000,0.072186,492.9578,111.093,13.888,,,,,
9,0.0722,29.331873,2.2e-05,0.51128,3500,,,,,,,,,
