# **Fine-Tuning with _Sentence Transformers_ Library**

In this report, I will:
1. load a the pre-trained sentence embedder: **all-MiniLM-L6-v2**
2. pick a dataset to fine-tune the embedder: **stsb_multi_mt** in English
3. choose a loss function (**CosineSimilarityLoss**) and training parameters
4. train the embedder
5. save the model and upload it to my Hugging-Face repository

# 0. Installing library + Importing Stuff


In [1]:
%%capture

!pip install --upgrade datasets sentence_transformers huggingface_hub

In [25]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineG

In [2]:
from datasets import load_dataset, Dataset
from transformers.utils.logging import disable_progress_bar
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
    SentenceTransformerModelCardData,
)
from transformers import logging
from sentence_transformers.losses import CosineSimilarityLoss, CoSENTLoss
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction
from torch.utils.data import IterableDataset, DataLoader
from sentence_transformers import InputExample

disable_progress_bar()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

## 1. Loading Base Model

**all-MiniLM-L6-v2**


> Our model is intended to be used as a sentence and short paragraph encoder. Given an input text, it outputs a vector which captures the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.

In [3]:
# Importing model
model = SentenceTransformer('all-MiniLM-L6-v2')
model.to(device)

# Testing on a few made up sentences
query = ['I am in love']
docs = ['I love walking', 'I love someone', 'I like spaghetti']

query_emb = model.encode(query)
docs_emb  = model.encode(docs)

similarities = model.similarity(query_emb, docs_emb)
for i, sim in enumerate(similarities[0]):
  print(f'{query} -> {docs[i]}: {sim}')

['I am in love'] -> I love walking: 0.2516159415245056
['I am in love'] -> I love someone: 0.6500446200370789
['I am in love'] -> I like spaghetti: 0.24325968325138092


In [12]:
# Count # trainable parameters
n_train_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"# Trainable Parameters: {n_train_params}")

# Trainable Parameters: 22713216


## 2. Dataset: STS



> This dataset provides pairs of sentences and a score of their similarity. Available languages are: *de, en, es, fr, it, nl, pl, pt, ru, zh*

Example:

*   **score 5**: The bird is bathing in the sink. **||** Birdie is washing itself in the water basin.
*   **score 4**: Two boys on a couch are playing video games. **||** Two boys are playing a video game.
*   **score 3**: John said he is considered a witness but not a suspect. **||** “He is not a suspect anymore.” John said.
*   **score 2**: They flew out of the nest in groups. **||** They flew into the nest together.
*   **score 1**: The woman is playing the violin. **||** The young lady enjoys listening to the guitar.
*   **score 0**: The black dog is running through the snow. **||** A race car driver is driving his car through the mud.

In [4]:
# Loading training, development and test sets.
# Renaming similarity_score to score
# Normalizing scores [0,5] to [0,1]

train = load_dataset("stsb_multi_mt", name="en", split="train")
train = train.rename_column("similarity_score", "score")
train = train.map(lambda row: {'score': row['score']/5})

dev = load_dataset("stsb_multi_mt", name="en", split="dev")
dev = dev.rename_column("similarity_score", "score")
dev = dev.map(lambda row: {'score': row['score']/5})

test = load_dataset("stsb_multi_mt", name="en", split="test")
test = test.rename_column("similarity_score", "score")
test = test.map(lambda row: {'score': row['score']/5})

Map:   0%|          | 0/5749 [00:00<?, ? examples/s]

size train: 5749


Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

size dev: 1500


Map:   0%|          | 0/1379 [00:00<?, ? examples/s]

size train: 1379


## 3. Building a Test Evaluator (Before Training)

For each pair (sentence1, sentence2) and score the test set, *EmbeddingSimilarityEvaluator* will apply the model to each sentence, get their embeddings, find the cosine similarity and then find the correlation between this estimated similarity and the true score.

In [5]:
# Testing Set
test_evaluator = EmbeddingSimilarityEvaluator(
    sentences1=test["sentence1"],
    sentences2=test["sentence2"],
    scores=test["score"],
)

test_evaluator(model)['pearson_cosine']

np.float64(0.8274064242198307)

## 4. Fine-Tuning Using the Training Set

We first pick a loss function: **CosineSimilarityLoss**


> CosineSimilarityLoss expects that the InputExamples consists of two texts and a float label. It computes the vectors u = model(sentence_A) and v = model(sentence_B) and measures the cosine-similarity between the two. By default, it minimizes the following loss: ||input_label - cos_score_transformation(cosine_sim(u,v))||$_2$.

Then we pick training arguments, such as how many steps until a full development set evaluation using **evaluator** from 3 and train on the full training set.

In [8]:
# Development Set
dev_evaluator = EmbeddingSimilarityEvaluator(
    sentences1=dev["sentence1"],
    sentences2=dev["sentence2"],
    scores=dev["score"],
)

# Loss Function
loss = CosineSimilarityLoss(model=model)

# Training arguments
args = SentenceTransformerTrainingArguments(
    output_dir="models/finetuned",
    report_to="none",
    eval_strategy="steps",
    learning_rate=1e-4,
    num_train_epochs=1,
    per_device_train_batch_size=16,
    eval_steps=100,
    logging_steps=100
)

# Defining trainer
trainer = SentenceTransformerTrainer(
    model=model,
    train_dataset=train,
    eval_dataset=dev,
    loss=loss,
    args=args,
    evaluator=dev_evaluator
)

# Training
trainer.train()

Step,Training Loss,Validation Loss,Pearson Cosine,Spearman Cosine
100,0.0108,0.026036,0.878463,0.877722
200,0.0091,0.024125,0.883701,0.882815
300,0.0167,0.021456,0.887633,0.886178


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

TrainOutput(global_step=360, training_loss=0.013861745595932007, metrics={'train_runtime': 169.8873, 'train_samples_per_second': 33.84, 'train_steps_per_second': 2.119, 'total_flos': 0.0, 'train_loss': 0.013861745595932007, 'epoch': 1.0})

## 5. Building a Test Evaluator (After Training)

Running evaluator on test set post-training.

In [9]:
test_evaluator(model)['pearson_cosine']

np.float64(0.8463898775235923)

## 6. Using Toys Examples on the Fine-Tuned Model

In [10]:
# Testing on a few made up sentences
query = ['I am in love']
docs = ['I love walking', 'I love someone', 'I like spaghetti']

query_emb = model.encode(query)
docs_emb  = model.encode(docs)

similarities = model.similarity(query_emb, docs_emb)
for i, sim in enumerate(similarities[0]):
  print(f'{query} -> {docs[i]}: {sim}')

['I am in love'] -> I love walking: 0.20138660073280334
['I am in love'] -> I love someone: 0.7428659200668335
['I am in love'] -> I like spaghetti: 0.17797380685806274


## 7. Saving and Uploading Model to Hugging Face's Repository

In [32]:
from huggingface_hub import HfApi, HfFolder, Repository, upload_folder

repo_id = "perticarari/fine-tuned-sts-embedder"

model.save("fine-tuned-sts-embedder")

api = HfApi()
api.create_repo(repo_id=repo_id, private=False)

upload_folder(
    folder_path="fine-tuned-sts-embedder",  # Path to the folder containing model files
    repo_id=repo_id,                      # Repository name
    commit_message="Initial commit"       # Commit message
)