# SetFit Text Classification Hyperparameter Search

In this notebook, we'll learn how to do hyperparameter search with SetFit.

## Setup

If you're running this Notebook on Colab or some other cloud platform, you will need to install the `setfit` and `optuna` libraries. Uncomment the following cell and run it:

In [1]:
# %pip install setfit[optuna] matplotlib

In [1]:
import numpy as np
from datasets import load_dataset
from sentence_transformers.losses import CosineSimilarityLoss
from setfit import SetFitModel, SetFitTrainer

First, we prepare the dataset. For details, see the multilabel training notebook.

In [2]:
def encode_labels(record):
    return {"labels": [record[feature] for feature in features]}


num_samples = 8
dataset = load_dataset("ethos", "multilabel")

features = dataset["train"].column_names
features.remove("text")
samples = np.concatenate([np.random.choice(np.where(dataset["train"][f])[0], num_samples) for f in features])
dataset = dataset.map(encode_labels)
train_dataset = dataset["train"].select(samples)
eval_dataset = dataset["train"].select(np.setdiff1d(np.arange(len(dataset["train"])), samples))

Reusing dataset ethos (/home/lewis/.cache/huggingface/datasets/ethos/multilabel/1.0.0/898d3d005459ee3ff80dbeec2f169c6b7ea13de31a08458193e27dec3dd9ae38)


  0%|          | 0/1 [00:00<?, ?it/s]



  0%|          | 0/433 [00:00<?, ?ex/s]

## Hyperparameter search
For a hyperparameter search we need several changes to the normal training setup:

* Instead of a model, we pass a `model_init` function, which optionally can use the dictionary of hyperparameters
* We set up a function that defines which parameters we are interested in optimizing

In [3]:
model_id = "sentence-transformers/paraphrase-mpnet-base-v2"


def make_model(params=None):
    multi_target_strategy = params["multi_target_strategy"] if params else "one-vs-rest"
    return SetFitModel.from_pretrained(
        model_id, multi_target_strategy=multi_target_strategy
    )

In [4]:
trainer = SetFitTrainer(
    model_init=make_model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss_class=CosineSimilarityLoss,
    num_epochs=1,
    num_iterations=5,
    column_mapping={"text": "text", "labels": "label"},
)

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


For this particular trial, we try to find the optimal learning rate, `multi_target_strategy`, and batch size for training with only 5 iterations on 8 samples/class:

In [5]:
def hyperparameter_search_function(trial):
    return {
        "learning_rate": trial.suggest_float("learning_rate", 1e-5, 1e-3, log=True),
        "batch_size": trial.suggest_categorical("batch_size", [4, 8, 16, 32]),
        "multi_target_strategy": trial.suggest_categorical("multi_target_strategy", ["one-vs-rest", "multi-output", "classifier-chain"])
    }

Now, we are ready to do the hyperparameter search. The default settings maximize accuracy, which is fine here. As we are searching 3 parameters, we choose a slightly higher number of trials than the default 10:

In [6]:
best = trainer.hyperparameter_search(hyperparameter_search_function, n_trials=10)
best

[32m[I 2022-11-02 17:18:47,494][0m A new study created in memory with name: no-name-4c556f33-0ba8-4d70-b3f9-0ada38f0fc34[0m
Trial: {'learning_rate': 0.00010629008736634152, 'batch_size': 8, 'multi_target_strategy': 'multi-output'}
model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
Applying column mapping to training dataset
***** Running training *****
  Num examples = 900
  Num epochs = 1
  Total optimization steps = 113
  Total train batch size = 8


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/113 [00:00<?, ?it/s]

Applying column mapping to evaluation dataset
***** Running evaluation *****
[32m[I 2022-11-02 17:19:11,914][0m Trial 0 finished with value: 0.32978723404255317 and parameters: {'learning_rate': 0.00010629008736634152, 'batch_size': 8, 'multi_target_strategy': 'multi-output'}. Best is trial 0 with value: 0.32978723404255317.[0m
Trial: {'learning_rate': 0.0003604528096739407, 'batch_size': 4, 'multi_target_strategy': 'classifier-chain'}
model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
Applying column mapping to training dataset
***** Running training *****
  Num examples = 900
  Num epochs = 1
  Total optimization steps = 225
  Total train batch size = 4


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/225 [00:00<?, ?it/s]

Applying column mapping to evaluation dataset
***** Running evaluation *****
[32m[I 2022-11-02 17:19:35,114][0m Trial 1 finished with value: 0.010638297872340425 and parameters: {'learning_rate': 0.0003604528096739407, 'batch_size': 4, 'multi_target_strategy': 'classifier-chain'}. Best is trial 0 with value: 0.32978723404255317.[0m
Trial: {'learning_rate': 3.824697536038996e-05, 'batch_size': 8, 'multi_target_strategy': 'classifier-chain'}
model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
Applying column mapping to training dataset
***** Running training *****
  Num examples = 900
  Num epochs = 1
  Total optimization steps = 113
  Total train batch size = 8


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/113 [00:00<?, ?it/s]

Applying column mapping to evaluation dataset
***** Running evaluation *****
[32m[I 2022-11-02 17:19:58,174][0m Trial 2 finished with value: 0.3377659574468085 and parameters: {'learning_rate': 3.824697536038996e-05, 'batch_size': 8, 'multi_target_strategy': 'classifier-chain'}. Best is trial 2 with value: 0.3377659574468085.[0m
Trial: {'learning_rate': 5.746413410832957e-05, 'batch_size': 16, 'multi_target_strategy': 'multi-output'}
model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
Applying column mapping to training dataset
***** Running training *****
  Num examples = 900
  Num epochs = 1
  Total optimization steps = 57
  Total train batch size = 16


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/57 [00:00<?, ?it/s]

Applying column mapping to evaluation dataset
***** Running evaluation *****
[32m[I 2022-11-02 17:20:25,896][0m Trial 3 finished with value: 0.2925531914893617 and parameters: {'learning_rate': 5.746413410832957e-05, 'batch_size': 16, 'multi_target_strategy': 'multi-output'}. Best is trial 2 with value: 0.3377659574468085.[0m
Trial: {'learning_rate': 0.0003365182729005924, 'batch_size': 4, 'multi_target_strategy': 'one-vs-rest'}
model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
Applying column mapping to training dataset
***** Running training *****
  Num examples = 900
  Num epochs = 1
  Total optimization steps = 225
  Total train batch size = 4


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/225 [00:00<?, ?it/s]

Applying column mapping to evaluation dataset
***** Running evaluation *****
[32m[I 2022-11-02 17:20:48,944][0m Trial 4 finished with value: 0.010638297872340425 and parameters: {'learning_rate': 0.0003365182729005924, 'batch_size': 4, 'multi_target_strategy': 'one-vs-rest'}. Best is trial 2 with value: 0.3377659574468085.[0m
Trial: {'learning_rate': 0.0006737237417791992, 'batch_size': 8, 'multi_target_strategy': 'one-vs-rest'}
model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
Applying column mapping to training dataset
***** Running training *****
  Num examples = 900
  Num epochs = 1
  Total optimization steps = 113
  Total train batch size = 8


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/113 [00:00<?, ?it/s]

Applying column mapping to evaluation dataset
***** Running evaluation *****
[32m[I 2022-11-02 17:21:12,004][0m Trial 5 finished with value: 0.010638297872340425 and parameters: {'learning_rate': 0.0006737237417791992, 'batch_size': 8, 'multi_target_strategy': 'one-vs-rest'}. Best is trial 2 with value: 0.3377659574468085.[0m
Trial: {'learning_rate': 1.667171552236077e-05, 'batch_size': 32, 'multi_target_strategy': 'classifier-chain'}
model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
Applying column mapping to training dataset
***** Running training *****
  Num examples = 900
  Num epochs = 1
  Total optimization steps = 29
  Total train batch size = 32


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/29 [00:00<?, ?it/s]

[33m[W 2022-11-02 17:21:14,273][0m Trial 6 failed because of the following error: OutOfMemoryError('CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 23.65 GiB total capacity; 22.32 GiB already allocated; 19.06 MiB free; 22.78 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF')[0m
Traceback (most recent call last):
  File "/home/lewis/miniconda3/envs/setfit/lib/python3.10/site-packages/optuna/study/_optimize.py", line 196, in _run_trial
    value_or_values = func(trial)
  File "/home/lewis/git/setfit/src/setfit/integrations.py", line 27, in _objective
    trainer.train(trial=trial)
  File "/home/lewis/git/setfit/src/setfit/trainer.py", line 352, in train
    self.model.model_body.fit(
  File "/home/lewis/miniconda3/envs/setfit/lib/python3.10/site-packages/sentence_transformers/SentenceTransformer.py", line 721, in fit
    loss_va

OutOfMemoryError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 23.65 GiB total capacity; 22.32 GiB already allocated; 19.06 MiB free; 22.78 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

We can study the optimization results in more detail using the `backend` value of the results:

In [None]:
from optuna.visualization.matplotlib import plot_param_importances

plot_param_importances(best.backend);

The final step is to train using the optimal parameters and check the model's performance again using the `evaluate()` method:

In [None]:
trainer.apply_hyperparameters(best.hyperparameters, final_model=True) # replaces model_init with a fixed model
trainer.train()

In [None]:
metrics = trainer.evaluate()
best.objective, metrics