[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/179vQNSRsmT_GohvbFt_e8UucoW21pZPE?usp=sharing)

## Fine-Tuning Sentence Transformers

### **1. Install and Import Relevant Libraries**

First, we install the `sentence-transformers` library:

In [1]:
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/227.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m225.3/227.1 kB[0m [31m8.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)


We will be utilising datasets provided by Hugging Face:

In [2]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/547.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/547.8 kB[0m [31m2.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-16.1.0-cp310-cp310-manylinux_2_28_x86_64.whl (40.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
Collecting requests>=2.32.2 (from datasets)
  Downloading requests-2.32.3-py3-none-any.w

Note: when running the code in this article, some users may be greeted with an error about the accelerate module in Python. To fix this, run:

In [3]:
!pip install transformers[torch] accelerate -U

Collecting accelerate
  Downloading accelerate-0.31.0-py3-none-any.whl (309 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.4/309.4 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.31.0


Amazing! We have installed the relevant modules needed to start fine-tuning. Let’s import the necessary libraries. These will be explained as we continue through the code.

In [4]:
from datasets import load_dataset
from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
)
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers.evaluation import TripletEvaluator

### **2. Load a Pre-Trained Model**

The powerful thing about sentence transformers is that you can take a pre-trained model and fine-tune it to adapt it to a specific task or domain, thereby significantly improving its performance on that particular application. So, we load a pre-trained model:

In [5]:
model = SentenceTransformer("microsoft/mpnet-base")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/493 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/532M [00:00<?, ?B/s]

Some weights of MPNetModel were not initialized from the model checkpoint at microsoft/mpnet-base and are newly initialized: ['mpnet.pooler.dense.bias', 'mpnet.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/472k [00:00<?, ?B/s]

Let’s break this down:

- **`SentenceTransformer`**: This class provides a simple interface for using pre-trained sentence embedding models.
- **`"microsoft/mpnet-base"`**: This is a specific pre-trained model developed by Microsoft, known for its performance in various natural language processing tasks.

### **3. Loading and Preparing the Dataset**

We now need to load a dataset to fine-tune our model on. We load a dataset called "all-nli" from sentence-transformers, specifically for triplet-based training, as discussed earlier. The dataset is split into training, evaluation, and test sets.

In [6]:
dataset = load_dataset("sentence-transformers/all-nli", "triplet")
train_dataset = dataset["train"].select(range(100_000))
eval_dataset = dataset["dev"]
test_dataset = dataset["test"]

Downloading readme:   0%|          | 0.00/5.15k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/38.4M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/782k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/810k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/557850 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/6584 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/6609 [00:00<?, ? examples/s]

We have the following variables:

- **`dataset`**: This variable holds the entire dataset loaded from the `"sentence-transformers/all-nli"` dataset with the `"triplet"` configuration (each example consists of triplets: anchor, positive, negative). The `load_dataset` function is used to download and prepare the dataset for use.
- **`train_dataset`**: This variable contains a subset of the training dataset. Specifically, it selects the first 100,000 samples from the "train" split of the dataset. The `select(range(100_000))` function is used to create this subset, which is typically done to limit the dataset size for faster training and experimentation.
- **`eval_dataset`**: This variable contains the evaluation dataset, also known as the validation set. It is obtained from the "dev" split of the original dataset. The evaluation dataset is used to evaluate the model's performance during training, helping to tune hyperparameters and prevent overfitting.
- **`test_dataset`**: This variable contains the test dataset, which is obtained from the "test" split of the original dataset. The test dataset is used to assess the final performance of the trained model on unseen data, providing an unbiased evaluation of its generalization capabilities.

### **4. Defining a Loss Function**

We now define the loss function used during training. For this example we use Multiple Negatives Ranking (MNR) as it’s the preferred loss function.

In [7]:
loss = MultipleNegativesRankingLoss(model)

You have the flexibility to use different loss functions if you wish to. See this [documentation](https://sbert.net/docs/sentence_transformer/loss_overview.html) for more information.

### **5. Creating a Trainer and Training the Model**

We initiate the training process by defining the following code. We have:

- **`SentenceTransformerTrainer`**: This class handles the training loop, evaluation, and optimization of the Sentence Transformer model. Note, we pass `model`, `train_dataset`, `eval_dataset` and `loss` as these are fundamental to an effective training process.
- **`train()`**: This method starts the training process.

This training process may take a couple of hours. You will see a progress bar and when computed, you will also see the **Step** and **Training Loss.** This will continue to be updated until the training is complete.

In [8]:
trainer = SentenceTransformerTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
)
trainer.train()

Step,Training Loss
500,0.9699
1000,0.816
1500,0.6876
2000,0.602
2500,0.5693
3000,0.535
3500,0.517
4000,0.5586
4500,0.4695
5000,0.4818


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss
500,0.9699
1000,0.816
1500,0.6876
2000,0.602
2500,0.5693
3000,0.535
3500,0.517
4000,0.5586
4500,0.4695
5000,0.4818


TrainOutput(global_step=37500, training_loss=0.2865847708129883, metrics={'train_runtime': 8897.3008, 'train_samples_per_second': 33.718, 'train_steps_per_second': 4.215, 'total_flos': 0.0, 'train_loss': 0.2865847708129883, 'epoch': 3.0})

**Step** refers to the current iteration of the training loop. It is an indicator of how many batches have been processed. If you see "Step 10", it means the trainer has processed 10 batches of data. Each step usually corresponds to one forward and backward pass through the model with one batch of data.

**Training Loss** is a measure of how well the model is performing on the training data at each step. The loss function calculates the difference between the model's predictions and the actual target values. The objective of training is to minimize this loss.

As we can see, the training loss is decreasing as the training loops through higher iterations. This is a good sign!

### **6. Evaluating the Trained Model**

After fine-tuning, we can evaluate the model using a `TripletEvaluator` to assess its performance on the test set. This will tell us how well our fine-tuning has done.

Let’s first evaluate our *initial* pre-trained model before any fine-tuning.

Here, we define the default base model we had to begin with as `old_model`. We set up a `test_evaluator` which will assess the model’s performance on the `test_dataset`.

This will provide accuracy metrics for our model using different distance measures. If you’re unfamiliar with distance metrics, we cover them [in this article.](https://www.marqo.ai/course/introduction-to-vector-embeddings)

In [9]:
old_model = SentenceTransformer(
    "microsoft/mpnet-base",
)

test_evaluator = TripletEvaluator(
    anchors=test_dataset["anchor"],
    positives=test_dataset["positive"],
    negatives=test_dataset["negative"],
    name="all-nli-test",
)
test_evaluator(old_model)

Some weights of MPNetModel were not initialized from the model checkpoint at microsoft/mpnet-base and are newly initialized: ['mpnet.pooler.dense.bias', 'mpnet.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'all-nli-test_cosine_accuracy': 0.6594038432440611,
 'all-nli-test_dot_accuracy': 0.452262066878499,
 'all-nli-test_manhattan_accuracy': 0.7147828718414283,
 'all-nli-test_euclidean_accuracy': 0.6562263579966712,
 'all-nli-test_max_accuracy': 0.7147828718414283}

The description in the output provides us with information about the similarity metric. The value indicates that the model correctly identified the positive example over the negative example X% of the time using said similarity measure.

For example,
`'all-nli-test_cosine_accuracy': 0.6594038432440611,`
says that the model correctly identified the positive example over the negative example 65.94% of the time using cosine similarity.

These are average results, the similarity measures are generally quite low and are in need of improvement. Let’s see if our newly fine-tuned model improves these.

We set up an evaluator again but this time, on our fine-tuned model.

In [10]:
test_evaluator = TripletEvaluator(
    anchors=test_dataset["anchor"],
    positives=test_dataset["positive"],
    negatives=test_dataset["negative"],
    name="all-nli-test",
)
test_evaluator(model)

{'all-nli-test_cosine_accuracy': 0.908760780753518,
 'all-nli-test_dot_accuracy': 0.08987743985474353,
 'all-nli-test_manhattan_accuracy': 0.9067937660765623,
 'all-nli-test_euclidean_accuracy': 0.9073990013617794,
 'all-nli-test_max_accuracy': 0.908760780753518}

We have high accuracy values for cosine, Manhattan, and Euclidean distances which indicate that our model performs well in distinguishing between positive and negative examples in the triplet evaluation. Awesome news! The low dot product accuracy, however, suggests that it is not an effective measure in this context and can be disregarded for evaluating our model.

We can see very clearly that our fine-tuning has significantly improved the overall performance of our model. Pretty cool!

Now, it’s not always the case that the more you fine-tune your model, the better it will become. There are several quantities that affect the overall performance of your model. Head back to our article to take a look at what these are.