<h1>Chapter 10 - Creating Text Embedding Models</h1>
<i>Exploring methods for both training and fine-tuning embedding models.</i>

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961"><img src="https://img.shields.io/badge/Buy%20the%20Book!-grey?logo=amazon"></a>
<a href="https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/"><img src="https://img.shields.io/badge/O'Reilly-white.svg?logo=data:image/svg%2bxml;base64,PHN2ZyB3aWR0aD0iMzQiIGhlaWdodD0iMjciIHZpZXdCb3g9IjAgMCAzNCAyNyIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPGNpcmNsZSBjeD0iMTMiIGN5PSIxNCIgcj0iMTEiIHN0cm9rZT0iI0Q0MDEwMSIgc3Ryb2tlLXdpZHRoPSI0Ii8+CjxjaXJjbGUgY3g9IjMwLjUiIGN5PSIzLjUiIHI9IjMuNSIgZmlsbD0iI0Q0MDEwMSIvPgo8L3N2Zz4K"></a>
<a href="https://github.com/HandsOnLLM/Hands-On-Large-Language-Models"><img src="https://img.shields.io/badge/GitHub%20Repository-black?logo=github"></a>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter10/Chapter%2010%20-%20Creating%20Text%20Embedding%20Models.ipynb)

---

This notebook is for Chapter 10 of the [Hands-On Large Language Models](https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961) book by [Jay Alammar](https://www.linkedin.com/in/jalammar) and [Maarten Grootendorst](https://www.linkedin.com/in/mgrootendorst/).

---

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961">
<img src="https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/images/book_cover.png" width="350"/></a>


### [OPTIONAL] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>

If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

💡 **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---


In [None]:
# %%capture
# !pip install -q accelerate>=0.27.2 peft>=0.9.0 bitsandbytes>=0.43.0 transformers>=4.38.2 trl>=0.7.11 sentencepiece>=0.1.99
# !pip install -q sentence-transformers>=3.0.0 mteb>=1.1.2 datasets>=2.18.0

# Creating an Embedding Model

## **Data**

In [1]:
from datasets import load_dataset

# Load MNLI dataset from GLUE
# 0 = entailment, 1 = neutral, 2 = contradiction
train_dataset = load_dataset("glue", "mnli", split="train").select(range(50_000))
train_dataset = train_dataset.remove_columns("idx")

README.md:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/52.2M [00:00<?, ?B/s]

(…)alidation_matched-00000-of-00001.parquet:   0%|          | 0.00/1.21M [00:00<?, ?B/s]

(…)dation_mismatched-00000-of-00001.parquet:   0%|          | 0.00/1.25M [00:00<?, ?B/s]

test_matched-00000-of-00001.parquet:   0%|          | 0.00/1.22M [00:00<?, ?B/s]

test_mismatched-00000-of-00001.parquet:   0%|          | 0.00/1.26M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/392702 [00:00<?, ? examples/s]

Generating validation_matched split:   0%|          | 0/9815 [00:00<?, ? examples/s]

Generating validation_mismatched split:   0%|          | 0/9832 [00:00<?, ? examples/s]

Generating test_matched split:   0%|          | 0/9796 [00:00<?, ? examples/s]

Generating test_mismatched split:   0%|          | 0/9847 [00:00<?, ? examples/s]

In [4]:
print(train_dataset)
train_dataset[2]

Dataset({
    features: ['premise', 'hypothesis', 'label'],
    num_rows: 50000
})


{'premise': 'One of our number will carry out your instructions minutely.',
 'hypothesis': 'A member of my team will execute your orders with immense precision.',
 'label': 0}

## **Model**

In [5]:
from sentence_transformers import SentenceTransformer

# Use a base model
embedding_model = SentenceTransformer("bert-base-uncased")

No sentence-transformers model found with name bert-base-uncased. Creating a new one with mean pooling.


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

## **Loss Function**

In [6]:
from sentence_transformers import losses

# Define the loss function. In soft-max loss, we will also need to explicitly set the number of labels.
train_loss = losses.SoftmaxLoss(
    model=embedding_model,
    sentence_embedding_dimension=embedding_model.get_sentence_embedding_dimension(),
    num_labels=3,
)

## Evaluation

In [None]:
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

# Create an embedding similarity evaluator for stsb
val_sts = load_dataset("glue", "stsb", split="validation")
print(val_sts)
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=val_sts["sentence1"],
    sentences2=val_sts["sentence2"],
    scores=[score / 5 for score in val_sts["label"]],
    main_similarity="cosine",
)

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 1500
})


## **Training**

In [9]:
from sentence_transformers.training_args import SentenceTransformerTrainingArguments

# Define the training arguments
args = SentenceTransformerTrainingArguments(
    output_dir="base_embedding_model",
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
)

In [10]:
from sentence_transformers.trainer import SentenceTransformerTrainer

# Train embedding model
trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    loss=train_loss,
    evaluator=evaluator,
)
trainer.train()

Column 'hypothesis' is at index 1, whereas a column with this name is usually expected at index 0. Note that the column order can be important for some losses, e.g. MultipleNegativesRankingLoss will always consider the first column as the anchor and the second as the positive, regardless of the dataset column names. Consider renaming the columns to match the expected order, e.g.:
dataset = dataset.select_columns(['hypothesis', 'entailment', 'contradiction'])


Step,Training Loss
100,1.0786
200,0.9484
300,0.8918
400,0.8431
500,0.8183
600,0.8281
700,0.8068
800,0.7995
900,0.7801
1000,0.7697


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

TrainOutput(global_step=1563, training_loss=0.8138990200870097, metrics={'train_runtime': 181.1458, 'train_samples_per_second': 276.021, 'train_steps_per_second': 8.628, 'total_flos': 0.0, 'train_loss': 0.8138990200870097, 'epoch': 1.0})

In [11]:
# Evaluate our trained model
evaluator(embedding_model)

{'pearson_cosine': 0.5353668381643834, 'spearman_cosine': 0.609079110878603}

# MTEB

In [18]:
from mteb import MTEB

# Choose evaluation task
evaluation = MTEB(tasks=["Banking77Classification"])

# Calculate results
results = evaluation.run(embedding_model)
from pprint import pprint

pprint(results[0].to_dict())

Passing task names as strings is deprecated and will be removed in the next release. Please use `tasks = mteb.get_tasks(tasks=[...])` method to get tasks instead.


{'dataset_revision': '0fd18e25b25c072e09e0d92ab615fda904d66300',
 'evaluation_time': 13.096168994903564,
 'kg_co2_emissions': None,
 'mteb_version': '1.28.0',
 'scores': {'test': [{'accuracy': 0.592175,
                      'f1': 0.590248,
                      'f1_weighted': 0.590248,
                      'hf_subset': 'default',
                      'languages': ['eng-Latn'],
                      'main_score': 0.592175,
                      'scores_per_experiment': [{'accuracy': 0.587987,
                                                 'f1': 0.587427,
                                                 'f1_weighted': 0.587427},
                                                {'accuracy': 0.570779,
                                                 'f1': 0.568697,
                                                 'f1_weighted': 0.568697},
                                                {'accuracy': 0.607468,
                                                 'f1': 0.603815,
             

⚠️ **VRAM Clean-up** - You will need to run the code below to partially empty the VRAM (GPU RAM). If that does not work, it is advised to restart the notebook instead. You can check the resources on the right-hand side (if you are using Google Colab) to check whether the used VRAM is indeed low. You can also run `!nivia-smi` to check current usage.

In [19]:
# # Empty and delete trainer/model
trainer.accelerator.clear()
del trainer, embedding_model

# Garbage collection and empty cache
import gc
import torch

gc.collect()
torch.cuda.empty_cache()

# Loss Fuctions

⚠️ **VRAM Clean-up**
* `Restart` the notebook in order to clean-up memory if you move on to the next training example.

## Cosine Similarity Loss

In [20]:
from datasets import Dataset, load_dataset

# Load MNLI dataset from GLUE
# 0 = entailment, 1 = neutral, 2 = contradiction
train_dataset = load_dataset("glue", "mnli", split="train").select(range(50_000))
train_dataset = train_dataset.remove_columns("idx")

# (neutral/contradiction)=0 and (entailment)=1
mapping = {2: 0, 1: 0, 0: 1}
train_dataset = Dataset.from_dict(
    {
        "sentence1": train_dataset["premise"],
        "sentence2": train_dataset["hypothesis"],
        "label": [float(mapping[label]) for label in train_dataset["label"]],
    }
)

In [21]:
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

# Create an embedding similarity evaluator for stsb
val_sts = load_dataset("glue", "stsb", split="validation")
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=val_sts["sentence1"],
    sentences2=val_sts["sentence2"],
    scores=[score / 5 for score in val_sts["label"]],
    main_similarity="cosine",
)

In [22]:
from sentence_transformers import losses, SentenceTransformer
from sentence_transformers.trainer import SentenceTransformerTrainer
from sentence_transformers.training_args import SentenceTransformerTrainingArguments

# Define model
embedding_model = SentenceTransformer("bert-base-uncased")

# Loss function
train_loss = losses.CosineSimilarityLoss(model=embedding_model)

# Define the training arguments
args = SentenceTransformerTrainingArguments(
    output_dir="cosineloss_embedding_model",
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
)

# Train model
trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    loss=train_loss,
    evaluator=evaluator,
)
trainer.train()

No sentence-transformers model found with name bert-base-uncased. Creating a new one with mean pooling.


Step,Training Loss
100,0.2298
200,0.171
300,0.1683
400,0.1604
500,0.1492
600,0.1568
700,0.1501
800,0.1545
900,0.1483
1000,0.1477


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

TrainOutput(global_step=1563, training_loss=0.15677150914246504, metrics={'train_runtime': 182.0847, 'train_samples_per_second': 274.598, 'train_steps_per_second': 8.584, 'total_flos': 0.0, 'train_loss': 0.15677150914246504, 'epoch': 1.0})

In [23]:
# Evaluate our trained model
evaluator(embedding_model)

{'pearson_cosine': 0.7251246311202354, 'spearman_cosine': 0.7273200557984623}

⚠️ **VRAM Clean-up**
* `Restart` the notebook in order to clean-up memory if you move on to the next training example.

In [24]:
import gc
import torch

gc.collect()
torch.cuda.empty_cache()

## Multiple Negatives Ranking Loss

In [25]:
import random
from tqdm import tqdm
from datasets import Dataset, load_dataset

# # Load MNLI dataset from GLUE
mnli = load_dataset("glue", "mnli", split="train").select(range(50_000))
mnli = mnli.remove_columns("idx")
mnli = mnli.filter(lambda x: True if x["label"] == 0 else False)

# Prepare data and add a soft negative
train_dataset = {"anchor": [], "positive": [], "negative": []}
soft_negatives = mnli["hypothesis"]
random.shuffle(soft_negatives)
for row, soft_negative in tqdm(zip(mnli, soft_negatives)):
    train_dataset["anchor"].append(row["premise"])
    train_dataset["positive"].append(row["hypothesis"])
    train_dataset["negative"].append(soft_negative)
train_dataset = Dataset.from_dict(train_dataset)
len(train_dataset)

16875it [00:00, 26267.57it/s]


16875

In [26]:
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

# Create an embedding similarity evaluator for stsb
val_sts = load_dataset("glue", "stsb", split="validation")
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=val_sts["sentence1"],
    sentences2=val_sts["sentence2"],
    scores=[score / 5 for score in val_sts["label"]],
    main_similarity="cosine",
)

In [27]:
from sentence_transformers import losses, SentenceTransformer
from sentence_transformers.trainer import SentenceTransformerTrainer
from sentence_transformers.training_args import SentenceTransformerTrainingArguments

# Define model
embedding_model = SentenceTransformer("bert-base-uncased")

# Loss function
train_loss = losses.MultipleNegativesRankingLoss(model=embedding_model)

# Define the training arguments
args = SentenceTransformerTrainingArguments(
    output_dir="mnrloss_embedding_model",
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
)

# Train model
trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    loss=train_loss,
    evaluator=evaluator,
)
trainer.train()

No sentence-transformers model found with name bert-base-uncased. Creating a new one with mean pooling.


Step,Training Loss
100,0.3246
200,0.1069
300,0.0824
400,0.0676
500,0.0679


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

TrainOutput(global_step=528, training_loss=0.12604896769379126, metrics={'train_runtime': 77.6926, 'train_samples_per_second': 217.202, 'train_steps_per_second': 6.796, 'total_flos': 0.0, 'train_loss': 0.12604896769379126, 'epoch': 1.0})

In [None]:
# Evaluate our trained model
evaluator(embedding_model)

{'pearson_cosine': 0.8070727434643791,
 'spearman_cosine': 0.8106193672462586,
 'pearson_manhattan': 0.8213132116968124,
 'spearman_manhattan': 0.8164551132664518,
 'pearson_euclidean': 0.820988086354926,
 'spearman_euclidean': 0.8160139830687847,
 'pearson_dot': 0.7429357515240518,
 'spearman_dot': 0.7316164586329814,
 'pearson_max': 0.8213132116968124,
 'spearman_max': 0.8164551132664518}

# **Fine-tuning**

⚠️ **VRAM Clean-up**
* `Restart` the notebook in order to clean-up memory if you move on to the next training example.

In [28]:
import gc
import torch

gc.collect()
torch.cuda.empty_cache()

## **Supervised**

In [1]:
from datasets import load_dataset
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

# Load MNLI dataset from GLUE
# 0 = entailment, 1 = neutral, 2 = contradiction
train_dataset = load_dataset("glue", "mnli", split="train").select(range(50_000))
train_dataset = train_dataset.remove_columns("idx")

# Create an embedding similarity evaluator for stsb
val_sts = load_dataset("glue", "stsb", split="validation")
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=val_sts["sentence1"],
    sentences2=val_sts["sentence2"],
    scores=[score / 5 for score in val_sts["label"]],
    main_similarity="cosine",
)

Downloading readme:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

In [2]:
from sentence_transformers import losses, SentenceTransformer
from sentence_transformers.trainer import SentenceTransformerTrainer
from sentence_transformers.training_args import SentenceTransformerTrainingArguments

# Define model
embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Loss function
train_loss = losses.MultipleNegativesRankingLoss(model=embedding_model)

# Define the training arguments
args = SentenceTransformerTrainingArguments(
    output_dir="finetuned_embedding_model",
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
)

# Train model
trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    loss=train_loss,
    evaluator=evaluator,
)
trainer.train()

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Column 'hypothesis' is at index 1, whereas a column with this name is usually expected at index 0. Note that the column order can be important for some losses, e.g. MultipleNegativesRankingLoss will always consider the first column as the anchor and the second as the positive, regardless of the dataset column names. Consider renaming the columns to match the expected order, e.g.:
dataset = dataset.select_columns(['hypothesis', 'entailment', 'contradiction'])


Step,Training Loss
100,0.1614
200,0.1132
300,0.1185
400,0.114
500,0.114
600,0.0975
700,0.1112
800,0.1016
900,0.1062
1000,0.1079


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

TrainOutput(global_step=1563, training_loss=0.11048706204785953, metrics={'train_runtime': 64.2082, 'train_samples_per_second': 778.716, 'train_steps_per_second': 24.343, 'total_flos': 0.0, 'train_loss': 0.11048706204785953, 'epoch': 1.0})

In [3]:
# Evaluate our trained model
evaluator(embedding_model)

{'pearson_cosine': 0.8449820425863115, 'spearman_cosine': 0.845499141455288}

In [4]:
# Evaluate the pre-trained model
original_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
evaluator(original_model)

{'pearson_cosine': 0.8696194558644238, 'spearman_cosine': 0.8671631197908374}

⚠️ **VRAM Clean-up**
* `Restart` the notebook in order to clean-up memory if you move on to the next training example.

In [5]:
import gc
import torch

gc.collect()
torch.cuda.empty_cache()

## **Augmented SBERT**

**Step 1:** Fine-tune a cross-encoder

In [6]:
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset, Dataset
from sentence_transformers import InputExample
from sentence_transformers.datasets import NoDuplicatesDataLoader

# Prepare a small set of 10000 documents for the cross-encoder
dataset = load_dataset("glue", "mnli", split="train").select(range(10_000))
mapping = {2: 0, 1: 0, 0: 1}

# Data Loader
gold_examples = [
    InputExample(texts=[row["premise"], row["hypothesis"]], label=mapping[row["label"]])
    for row in tqdm(dataset)
]
gold_dataloader = NoDuplicatesDataLoader(gold_examples, batch_size=32)

# Pandas DataFrame for easier data handling
gold = pd.DataFrame(
    {
        "sentence1": dataset["premise"],
        "sentence2": dataset["hypothesis"],
        "label": [mapping[label] for label in dataset["label"]],
    }
)

100%|█████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 42291.39it/s]


In [7]:
from sentence_transformers.cross_encoder import CrossEncoder

# Train a cross-encoder on the gold dataset
cross_encoder = CrossEncoder("bert-base-uncased", num_labels=2)
cross_encoder.fit(
    train_dataloader=gold_dataloader,
    epochs=1,
    show_progress_bar=True,
    warmup_steps=100,
    use_amp=False,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/312 [00:00<?, ?it/s]

**Step 2:** Create new sentence pairs

In [8]:
# Prepare the silver dataset by predicting labels with the cross-encoder
silver = load_dataset("glue", "mnli", split="train").select(range(10_000, 50_000))
pairs = list(zip(silver["premise"], silver["hypothesis"]))

**Step 3:** Label new sentence pairs with the fine-tuned cross-encoder (silver dataset)

In [9]:
import numpy as np

# Label the sentence pairs using our fine-tuned cross-encoder
output = cross_encoder.predict(pairs, apply_softmax=True, show_progress_bar=True)
silver = pd.DataFrame(
    {
        "sentence1": silver["premise"],
        "sentence2": silver["hypothesis"],
        "label": np.argmax(output, axis=1),
    }
)

Batches:   0%|          | 0/1250 [00:00<?, ?it/s]

**Step 4:** Train a bi-encoder (SBERT) on the extended dataset (gold + silver dataset)

In [10]:
# Combine gold + silver
data = pd.concat([gold, silver], ignore_index=True, axis=0)
data = data.drop_duplicates(subset=["sentence1", "sentence2"], keep="first")
train_dataset = Dataset.from_pandas(data, preserve_index=False)

In [11]:
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

# Create an embedding similarity evaluator for stsb
val_sts = load_dataset("glue", "stsb", split="validation")
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=val_sts["sentence1"],
    sentences2=val_sts["sentence2"],
    scores=[score / 5 for score in val_sts["label"]],
    main_similarity="cosine",
)

In [12]:
from sentence_transformers import losses, SentenceTransformer
from sentence_transformers.trainer import SentenceTransformerTrainer
from sentence_transformers.training_args import SentenceTransformerTrainingArguments

# Define model
embedding_model = SentenceTransformer("bert-base-uncased")

# Loss function
train_loss = losses.CosineSimilarityLoss(model=embedding_model)

# Define the training arguments
args = SentenceTransformerTrainingArguments(
    output_dir="augmented_embedding_model",
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
)

# Train model
trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    loss=train_loss,
    evaluator=evaluator,
)
trainer.train()

No sentence-transformers model found with name bert-base-uncased. Creating a new one with mean pooling.


Step,Training Loss
100,0.2156
200,0.1589
300,0.143
400,0.1405
500,0.1391
600,0.136
700,0.136
800,0.1353
900,0.1329
1000,0.1318


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

TrainOutput(global_step=1563, training_loss=0.14066580450847527, metrics={'train_runtime': 183.3412, 'train_samples_per_second': 272.705, 'train_steps_per_second': 8.525, 'total_flos': 0.0, 'train_loss': 0.14066580450847527, 'epoch': 1.0})

In [13]:
# Evaluate our trained model
evaluator(embedding_model)

{'pearson_cosine': 0.7121389296385138, 'spearman_cosine': 0.7195521073910296}

In [14]:
trainer.accelerator.clear()

[]

**Step 5**: Evaluate without silver dataset

In [15]:
# Combine gold + silver
data = pd.concat([gold], ignore_index=True, axis=0)
data = data.drop_duplicates(subset=["sentence1", "sentence2"], keep="first")
train_dataset = Dataset.from_pandas(data, preserve_index=False)

# Define model
embedding_model = SentenceTransformer("bert-base-uncased")

# Loss function
train_loss = losses.CosineSimilarityLoss(model=embedding_model)

# Define the training arguments
args = SentenceTransformerTrainingArguments(
    output_dir="gold_only_embedding_model",
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
)

# Train model
trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    loss=train_loss,
    evaluator=evaluator,
)
trainer.train()

No sentence-transformers model found with name bert-base-uncased. Creating a new one with mean pooling.


Step,Training Loss
100,0.2274
200,0.1719
300,0.1614


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

TrainOutput(global_step=313, training_loss=0.18601570266504258, metrics={'train_runtime': 36.7881, 'train_samples_per_second': 271.827, 'train_steps_per_second': 8.508, 'total_flos': 0.0, 'train_loss': 0.18601570266504258, 'epoch': 1.0})

In [16]:
# Evaluate our trained model
evaluator(embedding_model)

{'pearson_cosine': 0.6248541039369981, 'spearman_cosine': 0.6508723068251712}

Compared to using both the silver and gold datasets, using only the gold dataset reduces the performance of the model!

⚠️ **VRAM Clean-up**
* `Restart` the notebook in order to clean-up memory if you move on to the next training example.

In [17]:
import gc
import torch

gc.collect()
torch.cuda.empty_cache()

## **Unsupervised Learning**

### Tranformer-based Denoising AutoEncoder (TSDAE)

In [4]:
# Download additional tokenizer
import nltk

nltk.download("punkt_tab")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [5]:
from tqdm import tqdm
from datasets import Dataset, load_dataset
from sentence_transformers.datasets import DenoisingAutoEncoderDataset

# Create a flat list of sentences
mnli = load_dataset("glue", "mnli", split="train").select(range(25_000))
flat_sentences = mnli["premise"] + mnli["hypothesis"]

# Add noise to our input data
damaged_data = DenoisingAutoEncoderDataset(list(set(flat_sentences)))

# Create dataset
train_dataset = {"damaged_sentence": [], "original_sentence": []}
for data in tqdm(damaged_data):
    train_dataset["damaged_sentence"].append(data.texts[0])
    train_dataset["original_sentence"].append(data.texts[1])
train_dataset = Dataset.from_dict(train_dataset)

100%|██████████████████████████████████████████████████████████████████████████████████████████| 48353/48353 [00:06<00:00, 8051.70it/s]


In [7]:
train_dataset[:5]

{'damaged_sentence': ['brain',
  'Nearly agrees today that was was America could.',
  "don't be expert any aspect at",
  'The did identify statutes executive requirements on rulemaking',
  "n't"],
 'original_sentence': ['Their brain is not clear.',
  'Nearly everyone agrees today that the Vietnam War was unwinnable and was needlessly prolonged so America could save face.',
  "so you know you don't have to be an expert in any aspect of it at all",
  'The FCC did not identify any other statutes or executive orders imposing requirements on the rulemaking.',
  "Don't be offended."]}

In [None]:
# # Choose a different deletion ratio
# flat_sentences = list(set(flat_sentences))
# damaged_data = DenoisingAutoEncoderDataset(
#     flat_sentences,
#     noise_fn=lambda s: DenoisingAutoEncoderDataset.delete(s, del_ratio=0.6)
# )

In [8]:
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

# Create an embedding similarity evaluator for stsb
val_sts = load_dataset("glue", "stsb", split="validation")
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=val_sts["sentence1"],
    sentences2=val_sts["sentence2"],
    scores=[score / 5 for score in val_sts["label"]],
    main_similarity="cosine",
)

In [9]:
from sentence_transformers import models, SentenceTransformer

# Create your embedding model
word_embedding_model = models.Transformer("bert-base-uncased")
pooling_model = models.Pooling(
    word_embedding_model.get_word_embedding_dimension(), "cls"
)
embedding_model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

In [10]:
from sentence_transformers import losses

# Use the denoising auto-encoder loss
train_loss = losses.DenoisingAutoEncoderLoss(embedding_model, tie_encoder_decoder=True)
train_loss.decoder = train_loss.decoder.to("cuda")

Some weights of BertLMHeadModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['bert.encoder.layer.0.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.0.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.0.crossattention.output.dense.bias', 'bert.encoder.layer.0.crossattention.output.dense.weight', 'bert.encoder.layer.0.crossattention.self.key.bias', 'bert.encoder.layer.0.crossattention.self.key.weight', 'bert.encoder.layer.0.crossattention.self.query.bias', 'bert.encoder.layer.0.crossattention.self.query.weight', 'bert.encoder.layer.0.crossattention.self.value.bias', 'bert.encoder.layer.0.crossattention.self.value.weight', 'bert.encoder.layer.1.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.1.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.1.crossattention.output.dense.bias', 'bert.encoder.layer.1.crossattention.output.dense.weight', 'bert.encoder.layer.1.crossattention.self.key.bias', 'bert.e

In [11]:
from sentence_transformers.trainer import SentenceTransformerTrainer
from sentence_transformers.training_args import SentenceTransformerTrainingArguments

# Define the training arguments
args = SentenceTransformerTrainingArguments(
    output_dir="tsdae_embedding_model",
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=100,
    fp16=True,
    eval_steps=100,
    logging_steps=100,
)

# Train model
trainer = SentenceTransformerTrainer(
    model=embedding_model,
    args=args,
    train_dataset=train_dataset,
    loss=train_loss,
    evaluator=evaluator,
)
trainer.train()

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


Step,Training Loss
100,6.8465
200,4.9158
300,4.6496
400,4.4589
500,4.3798
600,4.267
700,4.2227
800,4.1702
900,4.0728
1000,4.0341


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

TrainOutput(global_step=3023, training_loss=4.028012769410754, metrics={'train_runtime': 459.6198, 'train_samples_per_second': 105.202, 'train_steps_per_second': 6.577, 'total_flos': 0.0, 'train_loss': 4.028012769410754, 'epoch': 1.0})

In [12]:
# Evaluate our trained model
evaluator(embedding_model)

{'pearson_cosine': 0.7415350044212735, 'spearman_cosine': 0.7465172711959415}

In [None]:
import gc
import torch

gc.collect()
torch.cuda.empty_cache()