<a href="https://colab.research.google.com/github/gupta24789/sentence-transformers/blob/main/02_train_sentence_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Train Sentence Transformers Models

In [3]:
!pip install -q datasets
!pip install -q sentence-transformers

In [4]:
from datasets import load_dataset
from torch.utils.data import DataLoader
from sentence_transformers import SentenceTransformer, models, InputExample, losses

## Define Model

In [25]:
# Train model from scratch
## Step 1: use an existing language model
word_embedding_model = models.Transformer('distilroberta-base')
## Step 2: use a pool function over the token embeddings
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
## Join steps 1 and 2 using the modules argument
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

## Load Dataset

In [5]:
dataset_id = "embedding-data/QQP_triplets"
dataset = load_dataset(dataset_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/183M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [6]:
dataset

DatasetDict({
    train: Dataset({
        features: ['set'],
        num_rows: 101762
    })
})

In [14]:
dataset['train']['set'][0]

{'query': 'Why in India do we not have one on one political debate as in USA?',
 'pos': ['Why cant we have a public debate between politicians in India like the one in US?'],
 'neg': ['Can people on Quora stop India Pakistan debate? We are sick and tired seeing this everyday in bulk?',
  'Why do politicians, instead of having a decent debate on issues going in and around the world, end up fighting always?',
  'Can educated politicians make a difference in India?',
  'What are some unusual aspects about politics and government in India?',
  'What is debate?',
  'Why does civic public communication and discourse seem so hollow in modern India?',
  'What is a Parliamentary debate?',
  "Why do we always have two candidates at the U.S. presidential debate. yet the ballot has about 7 candidates? Isn't that a misrepresentation of democracy?",
  'Why is civic public communication and discourse so hollow in modern India?',
  "Aren't the Presidential debates teaching our whole country terrible c

In [22]:
## Convert the examples into InputExamples
train_examples = []
n_examples = 10000     ## considering 10000 samples only
train_data = dataset['train']['set']

## Here from one record, we are considering : anchor, first pos, and first neg
for i in range(n_examples):
  example = train_data[i]
  train_examples.append(InputExample(texts=[example['query'], example['pos'][0], example['neg'][0]]))

In [23]:
## DataLoaders
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

In [26]:
## loss function
train_loss = losses.TripletLoss(model=model)

## Train Model

In [None]:
num_epochs = 5
warmup_steps = int(len(train_dataloader) * num_epochs * 0.1) #10% of train data##

## training
model.fit(train_objectives=[(train_dataloader, train_loss)],
          epochs=num_epochs,
          warmup_steps=warmup_steps)

In [None]:
## save model to disk
model