# Train and Fine-Tune Sentence Transformers Models - Notebook Companion

In [94]:
%%capture
!pip install sentence-transformers

## How Sentence Transformers models work


In [95]:
from sentence_transformers import SentenceTransformer, models

## Step 1: use an existing language model
word_embedding_model = models.Transformer('microsoft/deberta-v3-small', max_seq_length=500)

## Step 2: use a pool function over the token embeddings
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())

## Join steps 1 and 2 using the modules argument
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2Model: ['lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.LayerNorm.weight', 'mask_predictions.classifier.weight', 'mask_predictions.classifier.bias', 'mask_predictions.dense.weight', 'lm_predictions.lm_head.dense.bias', 'mask_predictions.LayerNorm.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.LayerNorm.bias', 'lm_predictions.lm_head.bias', 'mask_predictions.dense.bias']
- This IS expected if you are initializing DebertaV2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Special tokens have 

## How to prepare your dataset for training a Sentence Transformers model


In [96]:
%%capture
!pip install datasets

In [97]:
from datasets import load_dataset

dataset_id = "embedding-data/QQP_triplets"
# dataset_id = "embedding-data/sentence-compression"

dataset = load_dataset(dataset_id)



  0%|          | 0/1 [00:00<?, ?it/s]

In [98]:
print(f"- The {dataset_id} dataset has {dataset['train'].num_rows} examples.")
print(f"- Each example is a {type(dataset['train'][0])} with a {type(dataset['train'][0]['set'])} as value.")
print(f"- Examples look like this: {dataset['train'][0]}")

- The embedding-data/QQP_triplets dataset has 101762 examples.
- Each example is a <class 'dict'> with a <class 'dict'> as value.
- Examples look like this: {'set': {'query': 'Why in India do we not have one on one political debate as in USA?', 'pos': ['Why cant we have a public debate between politicians in India like the one in US?'], 'neg': ['Can people on Quora stop India Pakistan debate? We are sick and tired seeing this everyday in bulk?', 'Why do politicians, instead of having a decent debate on issues going in and around the world, end up fighting always?', 'Can educated politicians make a difference in India?', 'What are some unusual aspects about politics and government in India?', 'What is debate?', 'Why does civic public communication and discourse seem so hollow in modern India?', 'What is a Parliamentary debate?', "Why do we always have two candidates at the U.S. presidential debate. yet the ballot has about 7 candidates? Isn't that a misrepresentation of democracy?", 'Wh

Convert the examples into `InputExample`s. It might around 10 minutes in Google Colab.

In [100]:
from tqdm.auto import tqdm
from sentence_transformers import InputExample

train_examples = []
n_examples = 1000 
## For training with the entire dataset you can use `for i in range(dataset['train'].num_rows):`

for i in tqdm(range(n_examples)):
  example = dataset['train']['set'][i]
  train_examples.append(InputExample(texts=[example['query'], example['pos'][0], example['neg'][0]]))
  # Print each 50 examples how the example looks
  if i % 50 == 0:
    print(f"Anchor: {example['query']} --- Positive: {example['pos'][0]} --- Negative: {example['neg'][0]}")

  0%|          | 0/1000 [00:00<?, ?it/s]

Anchor: Why in India do we not have one on one political debate as in USA? --- Positive: Why cant we have a public debate between politicians in India like the one in US? --- Negative: Can people on Quora stop India Pakistan debate? We are sick and tired seeing this everyday in bulk?
Anchor: Can imaginary time, energy and gravity exist? --- Positive: Does imaginary gravity exist? --- Negative: How does gravity exist?
Anchor: What were the books Aman Bansal used for his Jee preparation? --- Positive: Which books were used my Aman Bansal for JEE preparation? --- Negative: What are some good books for IIT JEE preparation for class 10?
Anchor: How does Donald Trump expect Mexico to pay for his proposed border wall? --- Positive: How Donald Trump will make Mexico pay for the wall? --- Negative: What if Trump's wall isn't a physical wall at all? He said Mexico would pay. What are some financial deterrents the US could impose? Is it possible?
Anchor: Is World War III on its way right now? ---

In [101]:
print(f"We have a {type(train_examples)} of length {len(train_examples)} containing {type(train_examples[0])}'s.")

We have a <class 'list'> of length 1000 containing <class 'sentence_transformers.readers.InputExample.InputExample'>'s.


We wrap our training dataset into a Pytorch `Dataloader` to shuffle examples and get batch sizes.

In [102]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

## Loss functions for training a Sentence Transformers model


In [103]:
from sentence_transformers import losses

train_loss = losses.TripletLoss(model=model)

## How to train a Sentence Transformer model


In [104]:
num_epochs = 10

warmup_steps = int(len(train_dataloader) * num_epochs * 0.1) #10% of train data

In [105]:
model.fit(train_objectives=[(train_dataloader, train_loss)],
          epochs=num_epochs,
          warmup_steps=warmup_steps) 

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

Iteration:   0%|          | 0/63 [00:00<?, ?it/s]

## How to share a Sentence Transformers to the Hugging Face Hub

In [73]:
!huggingface-cli login


        _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
        _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
        _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
        _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
        _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

        To login, `huggingface_hub` now requires a token generated from https://huggingface.co/settings/tokens .
        
Token: 
Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in yo

In [107]:
model.save_to_hub(
    "deberta-sentence-transformer", 
    organization="embedding-data",
    train_datasets=["embedding-data/QQP_triplets"],
    exist_ok=True, 
    )

Cloning https://huggingface.co/embedding-data/deberta-sentence-transformer into local empty directory.


Download file pytorch_model.bin:   0%|          | 3.47k/539M [00:00<?, ?B/s]

Download file spm.model:   0%|          | 3.48k/2.35M [00:00<?, ?B/s]

Download file tokenizer.json:   0%|          | 3.48k/8.26M [00:00<?, ?B/s]

Clean file spm.model:   0%|          | 1.00k/2.35M [00:00<?, ?B/s]

Clean file tokenizer.json:   0%|          | 1.00k/8.26M [00:00<?, ?B/s]

Clean file pytorch_model.bin:   0%|          | 1.00k/539M [00:00<?, ?B/s]

Upload file pytorch_model.bin:   0%|          | 3.34k/539M [00:00<?, ?B/s]

To https://huggingface.co/embedding-data/deberta-sentence-transformer
   e04b3da..424b703  main -> main

   e04b3da..424b703  main -> main



'https://huggingface.co/embedding-data/deberta-sentence-transformer/commit/424b7034e4e5f4315716c2b5800fa296b02c92d7'

## Extra: How to fine-tune a Sentence Transformer model


Now we will fine-tune our Sentence Transformer model.

In [108]:
modelB = SentenceTransformer('embedding-data/deberta-sentence-transformer')

In [109]:
dataset_id = "embedding-data/sentence-compression"
datasetB = load_dataset(dataset_id)



  0%|          | 0/1 [00:00<?, ?it/s]

In [110]:
print(f"Examples look like this: {datasetB['train']['set'][0]}")

Examples look like this: ["The USHL completed an expansion draft on Monday as 10 players who were on the rosters of USHL teams during the 2009-10 season were selected by the League's two newest entries, the Muskegon Lumberjacks and Dubuque Fighting Saints.", 'USHL completes expansion draft']


In [111]:
train_examplesB = []
n_examples = 500 
## For training with the entire dataset you can use `for i in range(dataset['train'].num_rows):`

for i in tqdm(range(n_examples)):
  example = datasetB['train']['set'][i]
  train_examplesB.append(InputExample(texts=[example[0], example[1]]))
  # Print each 50 examples how the example looks
  if i % 50 == 0:
    print(f"Anchor: {example[0]} --- Positive: {example[1]}")

  0%|          | 0/500 [00:00<?, ?it/s]

Anchor: The USHL completed an expansion draft on Monday as 10 players who were on the rosters of USHL teams during the 2009-10 season were selected by the League's two newest entries, the Muskegon Lumberjacks and Dubuque Fighting Saints. --- Positive: USHL completes expansion draft
Anchor: A WOMAN has been seriously injured in a collision with a police van in North Devon. --- Positive: Woman seriously injured in collision with police van
Anchor: The war of words between Team Anna and the government grew intense on Sunday, with Kiran Bedi describing Prime Minister Manmohan Singh as Dhritarashtra. --- Positive: Team Anna's Kiran Bedi describes Manmohan Singh as 'Dhritarashtra'
Anchor: Moga, July 15 Two unidentified migrant labourers were crushed to death by a speeding vehicle near Chuharchak village here when they were standing near the roadside. --- Positive: Two migrant labourers crushed to death
Anchor: Partition affected the Indian Muslims adversely, especially those in the country's

In [119]:
train_dataloaderB = DataLoader(train_examplesB, shuffle=True, batch_size=64)
train_lossB = losses.MultipleNegativesRankingLoss(model=modelB)
num_epochsB = 10
warmup_stepsB = int(len(train_dataloaderB) * num_epochsB * 0.1) #10% of train data

In [None]:
model.fit(train_objectives=[(train_dataloaderB, train_lossB)],
          epochs=num_epochsB,
          warmup_steps=warmup_stepsB) 