**Description**: Sigmoid loss for text similarity training. Inspired by
[SigLIP](https://arxiv.org/abs/2303.15343). MPNRL = Multiple Positives and Negatives
Ranking Loss.

**Usage**: run on a T4 GPU.

Modified from this SentenceTransformers
[script](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/matryoshka/matryoshka_nli.py).

In [None]:
%pip install git+https://github.com/kddubey/mpnrl.git

In [None]:
import os

os.environ["WANDB_PROJECT"] = "mpnrl"
os.environ["WANDB_LOG_MODEL"] = "false"

In [None]:
!wandb login

In [1]:
from datetime import datetime

from datasets import load_dataset
from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
    losses,
)
from sentence_transformers.evaluation import (
    EmbeddingSimilarityEvaluator,
    SimilarityFunction,
)
from sentence_transformers.training_args import BatchSamplers
import torch

from mpnrl.collator import group_positives_by_anchor, MPNRLDataCollator
from mpnrl.loss import MultiplePositivesNegativesRankingLoss

In [2]:
USE_CUSTOM = True

# Load model and data

In [3]:
model_name = "distilroberta-base"

batch_size = 128 if torch.cuda.is_available() else 8
num_train_epochs = 1

# Save path of the model
output_dir = f"output/sigltt_nli_{model_name.replace('/', '-')}-{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}"

In [4]:
# 1. Here we define our SentenceTransformer model. If not already a Sentence Transformer model, it will automatically
# create one with "mean" pooling.
model = SentenceTransformer(model_name)
# If we want, we can limit the maximum sequence length for the model
# model.max_seq_length = 75

No sentence-transformers model found with name distilroberta-base. Creating a new one with mean pooling.


In [44]:
# 2. Load the AllNLI dataset:
#    https://huggingface.co/datasets/sentence-transformers/all-nli
dataset_name = "sentence-transformers/all-nli"
train_dataset = load_dataset(dataset_name, "triplet", split="train")

# If you wish, you can limit the number of training samples
train_dataset = train_dataset.select(range(10_000))

In [6]:
len(set(train_dataset["anchor"]))

356

In [7]:
len(set(train_dataset["positive"]))

1860

In [8]:
len(set(train_dataset["negative"]))

820

Woah! Pretty sure this is the unfolded dataset from an `anchor: {positives: ...,
negatives: ...}` dataset. We could fold the data back into that format, but we'll see
that there's a huge skew in # positives and negatives per anchor. If we were to just
plug in this data as is for training, we'd need way more memory for the large anchors
than the small ones, which is inefficient. So we need to change how we feed/batch the
data. There are a few ways to do it.

I wanna explore the simplest one I can think of: feed the unfolded data via default
(simple random) batch sampling, and group anchors in the data loader to create a batch
of training data. One wrinkle is that we need to be careful about correct labels.

The goal is to get more stable batches than MNRL + NoDuplicatesBatchSampler / avoid
batch size decay.

# Sketch data collator

We'll need the full set of positives for each anchor so that we don't have false
negatives in the `labels` matrix.

In [10]:
anchor_to_positives = group_positives_by_anchor(train_dataset)
len(anchor_to_positives)

356

No we'll figure out how data collation will work. Assume the sampler gave us batch
indices. (I specifically chose these batch indices b/c 1 of the anchors has multiple
positives.)

In [11]:
# batch_indices = random.sample(range(len(train_dataset)), k=4)
batch_indices = [2679, 7675, 7221, 7768]

batch_indices

[2679, 7675, 7221, 7768]

In [12]:
batch = [train_dataset[idx] for idx in batch_indices]
batch

[{'anchor': 'A group of people are outside.',
  'positive': 'People walking down path below palm trees.',
  'negative': 'A group of people are gathered at some type of outdoor event.'},
 {'anchor': 'The man is standing.',
  'positive': 'A man is standing next to a rack with hats hanging on it.',
  'negative': 'A man laying on a sidewalk.'},
 {'anchor': 'The man is playing basketball.',
  'positive': 'A man in a basketball game is shooting the ball.',
  'negative': 'A swimmer swimming butterfly in a pool'},
 {'anchor': 'The man is standing.',
  'positive': 'Guy in blue jacket and hat standing in front of some type of machinery while starring up with a big smile.',
  'negative': 'A man sitting on a newspaper dispenser, behind a sign.'}]

In [13]:
# Using a dict for deterministic (insertion) order.
anchors = {record["anchor"]: None for record in batch}
positives = {record["positive"]: None for record in batch}
negatives = {record["negative"]: None for record in batch}

In [14]:
anchors

{'A group of people are outside.': None,
 'The man is standing.': None,
 'The man is playing basketball.': None}

`'The man is standing.'` is duplicated in this batch. That's fine.

In [15]:
a = len(anchors)
p = len(positives)
n = len(negatives)

# TODO: this is often a really sparse matrix. I don't think PyTorch supports sparse
# label matrices. Is there something else we can do?
labels = torch.zeros((a, p + n))
# The diagonal are 1s. Everything after the p'th column are 0s. There could be 1s on the
# off-diagonal in the first p columns. Let's label those.
#
# TODO: this labeling scales quadratically! Pretty damning operation for batch sizes >
# 10_000. I don't think that's worrisome b/c in reality we chunk and accumulate at that
# scale (b/c the similarity matrix will be too large).
for i, anchor in enumerate(anchors):
    for j, positive in enumerate(positives):
        labels[i, j] = positive in anchor_to_positives[anchor]

In [16]:
labels

tensor([[1., 0., 0., 0., 0., 0., 0., 0.],
        [0., 1., 0., 1., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 0., 0., 0.]])

In [17]:
positives

{'People walking down path below palm trees.': None,
 'A man is standing next to a rack with hats hanging on it.': None,
 'A man in a basketball game is shooting the ball.': None,
 'Guy in blue jacket and hat standing in front of some type of machinery while starring up with a big smile.': None}

In [18]:
anchors

{'A group of people are outside.': None,
 'The man is standing.': None,
 'The man is playing basketball.': None}

In [19]:
positive_pairs = [
    [
        j
        for j, positive in enumerate(positives)
        if positive in anchor_to_positives[anchor]
    ]
    for anchor in anchors
]
positive_pairs

[[0], [1, 3], [2]]

# Demo - training

In [22]:
if torch.cuda.is_available():
    bf16 = torch.cuda.is_bf16_supported()
else:
    bf16 = False

if bf16:
    print("Using mixed precision in bf16")
else:
    print("Not using mixed precision")

Using mixed precision in bf16


In [23]:
# 5, 6. Define training arguments, create the trainer
common_args = dict(
    # Required parameter:
    output_dir=output_dir,
    use_mps_device=False,
    # Optional training parameters:
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    warmup_ratio=0.1,
    fp16=False,
    bf16=bf16,
    seed=42,
    # Wandb
    report_to="wandb",
    logging_steps=1,
)

if USE_CUSTOM:
    print("Using **CUSTOM** MPNRL")
    train_loss = MultiplePositivesNegativesRankingLoss(model)
    data_collator = MPNRLDataCollator(train_dataset, tokenize_fn=model.tokenize)
    more_args = dict(
        batch_sampler=BatchSamplers.BATCH_SAMPLER,
        # Wandb
        run_name=f"{dataset_name}-mpnrl",
    )
else:
    print("Using OG MNRL")
    train_loss = losses.MultipleNegativesRankingLoss(model)
    data_collator = None
    more_args = dict(
        batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
        # Wandb
        run_name=f"{dataset_name}-mnrl",
    )

trainer = SentenceTransformerTrainer(
    model=model,
    args=SentenceTransformerTrainingArguments(**common_args, **more_args),
    train_dataset=train_dataset,
    loss=train_loss,
    data_collator=data_collator,
)

Using MPNRL


In [24]:
if torch.cuda.is_available():
    torch.cuda.reset_peak_memory_stats()

In [25]:
trainer.train()  # MNRL

Step,Training Loss


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

TrainOutput(global_step=79, training_loss=1.7303571097458466, metrics={'train_runtime': 100.4997, 'train_samples_per_second': 99.503, 'train_steps_per_second': 0.786, 'total_flos': 0.0, 'train_loss': 1.7303571097458466, 'epoch': 1.0})

In [25]:
trainer.train()  # MPNRL (custom)

Step,Training Loss


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

TrainOutput(global_step=79, training_loss=213.12996934335442, metrics={'train_runtime': 84.6281, 'train_samples_per_second': 118.164, 'train_steps_per_second': 0.933, 'total_flos': 0.0, 'train_loss': 213.12996934335442, 'epoch': 1.0})

In [26]:
if USE_CUSTOM:  # TODO: understand / debug these
    print(train_loss.scale)
    print(train_loss.bias)

Parameter containing:
tensor(19.9979, device='cuda:0', requires_grad=True)
Parameter containing:
tensor(-10.0021, device='cuda:0', requires_grad=True)


In [27]:
# MNRL
peak_memory_allocated = torch.cuda.max_memory_allocated()
peak_memory_reserved = torch.cuda.max_memory_reserved()

print(f"Peak memory allocated: {peak_memory_allocated / 1024**3:.2f} GB")
print(f"Peak memory reserved: {peak_memory_reserved / 1024**3:.2f} GB")

Peak memory allocated: 4.43 GB
Peak memory reserved: 4.99 GB


In [27]:
# MPNRL (custom)
peak_memory_allocated = torch.cuda.max_memory_allocated()
peak_memory_reserved = torch.cuda.max_memory_reserved()

print(f"Peak memory allocated: {peak_memory_allocated / 1024**3:.2f} GB")
print(f"Peak memory reserved: {peak_memory_reserved / 1024**3:.2f} GB")

Peak memory allocated: 4.62 GB
Peak memory reserved: 5.83 GB


# Demo - eval

In [28]:
# 7. Evaluate the model performance on the STS Benchmark test dataset
test_dataset = load_dataset("sentence-transformers/stsb", split="test")
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=test_dataset["sentence1"],
    sentences2=test_dataset["sentence2"],
    scores=test_dataset["score"],
    main_similarity=SimilarityFunction.COSINE,
    name="sts-test",
)

In [29]:
test_result = evaluator(model)

In [30]:
test_result  # MNRL

{'sts-test_pearson_cosine': 0.7173359658890913,
 'sts-test_spearman_cosine': 0.7179704210761317}

In [30]:
test_result  # MPNRL (custom)

{'sts-test_pearson_cosine': 0.7469811106886857,
 'sts-test_spearman_cosine': 0.7287394643279475}

I'm pretty sure the scores are higher b/c we went through more data in MPNRL than MNRL.
Prolly diminishing returns for larger data.

In [None]:
# # 8. Save the trained & evaluated model locally
# final_output_dir = f"{output_dir}/final"
# model.save(final_output_dir)
# final_output_dir