# SetFit

## Introduction

Setfit is a way of finetuning sentence transformers in a few shot way for classification tasks
https://huggingface.co/blog/setfit


It typically needs very few examples compared to traditional BERT task specific finetuning . For example, with only 8 labeled examples per class on the Customer Reviews (CR) sentiment dataset, SetFit is competitive with fine-tuning RoBERTa Large on the full training set of 3k examples

Its features are 

1) Doesn't need handcrafted prompts even though its few shot. Directly takes a small number of classification examples (8-16 per class) to finetune sentence transformers
2) Fast to train
3) Native multilingual support


## Details

Its a 2 step training process

1) Finetunes sentence transformers taking a small number of labeled examples as an input (This is usually done using text similarity pairs or triplets , as discussed in the sentence transformers notebook. How is this triplet or pairs created from the labeled examples ? Positive and negative pairs are created by in class and out of class sampling respectively.
2) Once the sentence transformers are finetuned in the previous step, a simple classification head is added on those encoded embeddings, and trained

(WHY NOT JOINT TRAINING ?? )

During inference, the unseen sample passes through the finetuned sentence transformer from step A, gets an embedding, and this embedding is passed through the classification layer on second step to get the final label

## Code

In [5]:
!pip install setfit

Collecting setfit
  Using cached setfit-0.7.0-py3-none-any.whl (45 kB)
Collecting evaluate>=0.3.0
  Using cached evaluate-0.4.0-py3-none-any.whl (81 kB)
Collecting responses<0.19
  Using cached responses-0.18.0-py3-none-any.whl (38 kB)
Installing collected packages: responses, evaluate, setfit
Successfully installed evaluate-0.4.0 responses-0.18.0 setfit-0.7.0


In [6]:
from datasets import load_dataset
from sentence_transformers.losses import CosineSimilarityLoss

from setfit import SetFitModel, SetFitTrainer

In [7]:
dataset = load_dataset("SetFit/SentEval-CR")

Downloading readme:   0%|          | 0.00/447 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/427k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/109k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [11]:
# Select N examples per class (8 in this case)
train_ds = dataset["train"].shuffle(seed=42).select(range(8 * 2))
test_ds = dataset["test"]

In [12]:
# Load SetFit model from Hub
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")

# Create trainer
trainer = SetFitTrainer(
    model=model,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    loss_class=CosineSimilarityLoss,
    batch_size=16,
    num_iterations=20, # Number of text pairs to generate for contrastive learning
    num_epochs=1 # Number of epochs to use for contrastive learning
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/594 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading (…)f39ef/.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)0182ff39ef/README.md:   0%|          | 0.00/3.70k [00:00<?, ?B/s]

Downloading (…)82ff39ef/config.json:   0%|          | 0.00/594 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)f39ef/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

Downloading (…)0182ff39ef/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)2ff39ef/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


In [13]:
trainer.train()
metrics = trainer.evaluate()

Generating Training Pairs:   0%|          | 0/20 [00:00<?, ?it/s]

***** Running training *****
  Num examples = 640
  Num epochs = 1
  Total optimization steps = 40
  Total train batch size = 16


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/40 [00:00<?, ?it/s]

***** Running evaluation *****


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

## References

1) https://huggingface.co/blog/setfit
2) https://arxiv.org/abs/2209.11055