<a href="https://colab.research.google.com/github/ouma09/GenAi/blob/main/train_embeddinsg_matryoshka_768_64_NLI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training a Matryoshka Embedding Model 🪆

It uses `MultipleNegativesRankingLoss` with `MatryoshkaLoss` to train a strong embedding model at output dimensions `[768, 512, 256, 128, 64]` using Natural Language Inference datasets (`AllNLI` in this case).



> Colab by: [mrm8488](https://twitter.com/mrm8488) adapted from [Sentence-Transformers](https://www.sbert.net/examples) script

In [None]:
! nvidia-smi

Sun Jun  2 12:19:46 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   51C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

### Install required dependencies 📦

In [None]:
! pip install -q sentence-transformers datasets "accelerate>=0.21.0" wandb

### Imports

In [None]:
from datasets import load_dataset
from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
    losses,
)
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SequentialEvaluator, SimilarityFunction
from sentence_transformers.training_args import BatchSamplers

### Set main variables ⚙️

In [None]:
model_name = "distilroberta-base" # Choose the model you want
batch_size = 128  # The larger you select this, the better the results (usually). But it requires more GPU memory
num_train_epochs = 1
matryoshka_dims = [768, 512, 256, 128, 64]

In [None]:
# Save path of the model
output_dir = f"output/matryoshka_nli_{model_name.replace('/', '-')}_{batch_size}_bs_{num_train_epochs}_e"

In [None]:
# 1. Here we define our SentenceTransformer model. If not already a Sentence Transformer model, it will automatically
# create one with "mean" pooling.
model = SentenceTransformer(model_name)
# If we want, we can limit the maximum sequence length for the model
# model.max_seq_length = 75



### Load the Dataset 📚

In [None]:
# 2. Load the AllNLI dataset: https://huggingface.co/datasets/sentence-transformers/all-nli
train_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="train")
eval_dataset = load_dataset("sentence-transformers/all-nli", "triplet", split="dev")

In [None]:
train_dataset, train_dataset[0]

(Dataset({
     features: ['anchor', 'positive', 'negative'],
     num_rows: 557850
 }),
 {'anchor': 'A person on a horse jumps over a broken down airplane.',
  'positive': 'A person is outdoors, on a horse.',
  'negative': 'A person is at a diner, ordering an omelette.'})

#### (Optional) Training on the entire dataset can take a long time, so for demonstration purposes, let's use only a small portion.



In [None]:
MAX_EXAMPLES = 10000
train_dataset = train_dataset.shuffle(seed=21).select(range(MAX_EXAMPLES))

### Define our training loss functions 📉

In [None]:
inner_train_loss = losses.MultipleNegativesRankingLoss(model)
train_loss = losses.MatryoshkaLoss(model, inner_train_loss, matryoshka_dims=matryoshka_dims)

### Set an evaluator to keep track of alongside the evaluation loss.

In [None]:
stsb_eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")
evaluators = []
for dim in matryoshka_dims:
    evaluators.append(
        EmbeddingSimilarityEvaluator(
            sentences1=stsb_eval_dataset["sentence1"],
            sentences2=stsb_eval_dataset["sentence2"],
            scores=stsb_eval_dataset["score"],
            main_similarity=SimilarityFunction.COSINE,
            name=f"sts-dev-{dim}",
            truncate_dim=dim,
        )
    )

In [None]:
dev_evaluator = SequentialEvaluator(evaluators, main_score_function=lambda scores: scores[0])

### Define the training args ⚙️

In [None]:
args = SentenceTransformerTrainingArguments(
    # Required parameter:
    output_dir=output_dir,
    # Optional training parameters:
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    warmup_ratio=0.1,
    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
    bf16=False,  # Set to True if you have a GPU that supports BF16
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
    # Optional tracking/debugging parameters:
    eval_strategy="steps",
    eval_steps=30,
    save_strategy="steps",
    save_steps=30,
    save_total_limit=2,
    logging_steps=30,
    run_name="matryoshka-nli_128_bs_1e",  # Will be used in W&B if `wandb` is installed
)

### Create the Trainer and run it 🏋️‍♀️

In [None]:
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=train_loss,
    evaluator=dev_evaluator,
)

In [None]:
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mmrm8488[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss,Validation Loss,Sts-dev-768 Pearson Cosine,Sts-dev-768 Spearman Cosine,Sts-dev-768 Pearson Manhattan,Sts-dev-768 Spearman Manhattan,Sts-dev-768 Pearson Euclidean,Sts-dev-768 Spearman Euclidean,Sts-dev-768 Pearson Dot,Sts-dev-768 Spearman Dot,Sts-dev-768 Pearson Max,Sts-dev-768 Spearman Max,Sts-dev-512 Pearson Cosine,Sts-dev-512 Spearman Cosine,Sts-dev-512 Pearson Manhattan,Sts-dev-512 Spearman Manhattan,Sts-dev-512 Pearson Euclidean,Sts-dev-512 Spearman Euclidean,Sts-dev-512 Pearson Dot,Sts-dev-512 Spearman Dot,Sts-dev-512 Pearson Max,Sts-dev-512 Spearman Max,Sts-dev-256 Pearson Cosine,Sts-dev-256 Spearman Cosine,Sts-dev-256 Pearson Manhattan,Sts-dev-256 Spearman Manhattan,Sts-dev-256 Pearson Euclidean,Sts-dev-256 Spearman Euclidean,Sts-dev-256 Pearson Dot,Sts-dev-256 Spearman Dot,Sts-dev-256 Pearson Max,Sts-dev-256 Spearman Max,Sts-dev-128 Pearson Cosine,Sts-dev-128 Spearman Cosine,Sts-dev-128 Pearson Manhattan,Sts-dev-128 Spearman Manhattan,Sts-dev-128 Pearson Euclidean,Sts-dev-128 Spearman Euclidean,Sts-dev-128 Pearson Dot,Sts-dev-128 Spearman Dot,Sts-dev-128 Pearson Max,Sts-dev-128 Spearman Max,Sts-dev-64 Pearson Cosine,Sts-dev-64 Spearman Cosine,Sts-dev-64 Pearson Manhattan,Sts-dev-64 Spearman Manhattan,Sts-dev-64 Pearson Euclidean,Sts-dev-64 Spearman Euclidean,Sts-dev-64 Pearson Dot,Sts-dev-64 Spearman Dot,Sts-dev-64 Pearson Max,Sts-dev-64 Spearman Max,Sequential Score
30,15.8875,6.108927,0.799079,0.807643,0.798068,0.797434,0.799267,0.79851,0.568376,0.585174,0.799267,0.807643,0.809987,0.814291,0.79835,0.797495,0.800138,0.798993,0.653782,0.667895,0.809987,0.814291,0.806137,0.812251,0.79596,0.795933,0.797646,0.797016,0.644407,0.66393,0.806137,0.812251,0.791715,0.803576,0.790396,0.791495,0.789726,0.790735,0.614015,0.629194,0.791715,0.803576,0.785613,0.801044,0.77937,0.783728,0.778577,0.783513,0.586909,0.607534,0.785613,0.801044,0.799079
60,7.4874,5.018856,0.817021,0.825594,0.808521,0.809334,0.809014,0.809879,0.578794,0.605063,0.817021,0.825594,0.821716,0.82773,0.808582,0.809043,0.809644,0.809735,0.636027,0.659483,0.821716,0.82773,0.817895,0.825711,0.807215,0.808102,0.807561,0.808229,0.6316,0.653106,0.817895,0.825711,0.808366,0.819529,0.80171,0.804404,0.800453,0.803275,0.605499,0.6324,0.808366,0.819529,0.796901,0.813815,0.790738,0.795324,0.788661,0.793537,0.536989,0.55289,0.796901,0.813815,0.817021


Computing widget examples:   0%|          | 0/5 [00:00<?, ?example/s]

Computing widget examples:   0%|          | 0/5 [00:00<?, ?example/s]

TrainOutput(global_step=79, training_loss=10.388897183575208, metrics={'train_runtime': 93.2846, 'train_samples_per_second': 107.199, 'train_steps_per_second': 0.847, 'total_flos': 0.0, 'train_loss': 10.388897183575208, 'epoch': 1.0})

### Evaluate on the STS Benchmark test dataset 🧪

In [None]:
test_dataset = load_dataset("sentence-transformers/stsb", split="test")
evaluators = []
for dim in matryoshka_dims:
    evaluators.append(
        EmbeddingSimilarityEvaluator(
            sentences1=test_dataset["sentence1"],
            sentences2=test_dataset["sentence2"],
            scores=test_dataset["score"],
            main_similarity=SimilarityFunction.COSINE,
            name=f"sts-test-{dim}",
            truncate_dim=dim,
        )
    )

In [None]:
test_evaluator = SequentialEvaluator(evaluators)

In [None]:
test_evaluator(model)

{'sts-test-768_pearson_cosine': 0.7830238745430493,
 'sts-test-768_spearman_cosine': 0.7773358019943875,
 'sts-test-768_pearson_manhattan': 0.7760333176930047,
 'sts-test-768_spearman_manhattan': 0.7571481372933749,
 'sts-test-768_pearson_euclidean': 0.776789479061736,
 'sts-test-768_spearman_euclidean': 0.7576814286884955,
 'sts-test-768_pearson_dot': 0.5696962552851287,
 'sts-test-768_spearman_dot': 0.5537713996518868,
 'sts-test-768_pearson_max': 0.7830238745430493,
 'sts-test-768_spearman_max': 0.7773358019943875,
 'sts-test-512_pearson_cosine': 0.7907935653551746,
 'sts-test-512_spearman_cosine': 0.7782713191251893,
 'sts-test-512_pearson_manhattan': 0.7764478901183614,
 'sts-test-512_spearman_manhattan': 0.7566433974497622,
 'sts-test-512_pearson_euclidean': 0.7782923050865409,
 'sts-test-512_spearman_euclidean': 0.7586940810248578,
 'sts-test-512_pearson_dot': 0.6258176622518048,
 'sts-test-512_spearman_dot': 0.6181814350373276,
 'sts-test-512_pearson_max': 0.7907935653551746,
 

### Save the model locally

In [None]:
final_output_dir = f"{output_dir}/final"
model.save(final_output_dir)

Computing widget examples:   0%|          | 0/5 [00:00<?, ?example/s]

### Push to the Hugging Face Hub 🤗
You may need an token. Get it here: https://huggingface.co/settings/tokens

In [None]:
model.push_to_hub(f"{model_name}-nli-matryoshka", token="<your_token>")