# Fine tune text embeddings
[basic](https://huggingface.co/blog/how-to-train-sentence-transformers)

# Imports

In [9]:
!pip install sentence-transformers datasets -qqq

In [10]:
from sentence_transformers import SentenceTransformer
from datasets import load_dataset

# Initiate

In [5]:
model_id = "sentence-transformers/all-MiniLM-L6-v2"
model = SentenceTransformer(model_id)

modules.json: 100%|██████████| 349/349 [00:00<00:00, 837kB/s]
config_sentence_transformers.json: 100%|██████████| 116/116 [00:00<00:00, 1.07MB/s]
README.md: 100%|██████████| 10.7k/10.7k [00:00<00:00, 40.8MB/s]
sentence_bert_config.json: 100%|██████████| 53.0/53.0 [00:00<00:00, 529kB/s]
config.json: 100%|██████████| 612/612 [00:00<00:00, 2.60MB/s]
pytorch_model.bin: 100%|██████████| 90.9M/90.9M [00:08<00:00, 11.0MB/s]
tokenizer_config.json: 100%|██████████| 350/350 [00:00<00:00, 3.73MB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 9.18MB/s]
tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 8.62MB/s]
special_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 910kB/s]
1_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 1.47MB/s]


# Datasets

- In all cases, negatives are implicitly created, whether we provide them or not. The negatives are created by the model itself, and are the other documents in the batch. This is why we don't need to provide negatives in the dataset. When we provide the explicit negatives, the model will use them instead of the implicit negatives.

| dataset_structure           | examples                                                                          | loss                                                                                                | application                                                           |
|-----------------------------|-----------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------|
| <query, document, label>    | snli                                                                              | ContrastiveLoss; SoftmaxLoss; CosineSimilarityLoss                                                  | natural language inference (NLI) {entailment, netural, contradiciton} |
| <query, document>           | embedding-data/flickr30k_captions_quintets; embedding-data/coco_captions_quintets | MultipleNegativesRankingLoss; MegaBatchMarginLoss                                                   | natural language inference (NLI) {entailment}                         |
| <query, class>              | trec; yahoo_answers_topics                                                        | BatchHardTripletLoss; BatchAllTripletLoss; BatchHardSoftMarginTripletLoss; BatchSemiHardTripletLoss |                                                                       |
| <query, document, negative> | embedding-data/QQP_triplets                                                       | TripletLoss;                                                                                        |                                                                       |

In [11]:
dataset_id = "embedding-data/QQP_triplets"
dataset = load_dataset(dataset_id)

Downloading readme: 100%|██████████| 6.27k/6.27k [00:00<00:00, 26.8MB/s]
Downloading data: 100%|██████████| 183M/183M [00:19<00:00, 9.40MB/s] 
Generating train split: 101762 examples [00:00, 254111.56 examples/s]


In [20]:
print(f"- The {dataset_id} dataset has {dataset['train'].num_rows} examples.")
print(
    f"- Each example is a {type(dataset['train'][0])} with a {type(dataset['train'][0]['set'])} as value."
)
sample = dataset["train"][-1]
print(f"- Examples look like this: {sample}")
print(f"- Positives: {len(sample.get('set').get('pos'))}")
print(f"- Negatives: {len(sample.get('set').get('neg'))}")

- The embedding-data/QQP_triplets dataset has 101762 examples.
- Each example is a <class 'dict'> with a <class 'dict'> as value.
- Examples look like this: {'set': {'query': 'Why do you use an iPhone?', 'pos': ['Why do people buy the iPhone?'], 'neg': ["Why shouldn't I buy an iPhone?", 'Why is iPhone so expensive?', 'Why are iPhones so expensive?', 'Why iphone are so costly?', 'Why are iPhones costly?', 'Is the iPhone really more expensive? Why or why not?', 'Why people are madly buying iPhone 4 in India, given that it is a more than 3-year-old hardware?', 'Why should I not buy the iPhone 5?', 'Why should I not buy an iPhone 7?', 'Why do some people prefer iPhones to Androids?', 'What are the reasons why people buy Samsung phones?', 'Why are iPhone users so loyal to the brand?', 'Why is the iPhone 6 so expensive?', 'Are iPhones seriously worth the price?', 'Are Apple iPhones worth the price?', 'Why is the iPhone 6s so expensive?', 'Is the iPhone really worth its price?', 'Is iPhone re