# Setfit testing

This is based almost exactly on https://github.com/huggingface/setfit/blob/main/notebooks/text-classification_multilabel.ipynb - used this to test if we can get this working on CPU in a timely way. Took 30-40 minutes to train. Code from their notebook tweaked to save locally rather than push to hub.

In [5]:
from datasets import load_dataset

model_id = "sentence-transformers/paraphrase-mpnet-base-v2"
#dataset = load_dataset("ethos", "multilabel")

Found cached dataset ethos (C:/Users/gkf18/.cache/huggingface/datasets/ethos/multilabel/1.0.0/898d3d005459ee3ff80dbeec2f169c6b7ea13de31a08458193e27dec3dd9ae38)


  0%|          | 0/1 [00:00<?, ?it/s]

In [6]:
# i had dataset cached - so saving to disk to get a working notebook with data
#dataset.save_to_disk("ethos-dataset")

Saving the dataset (0/1 shards):   0%|          | 0/433 [00:00<?, ? examples/s]

In [7]:
from datasets import load_from_disk

# Load the dataset from disk
dataset = load_from_disk("ethos-dataset")

In [8]:
import numpy as np

features = dataset["train"].column_names
features.remove("text")
features

['violence',
 'directed_vs_generalized',
 'gender',
 'race',
 'national_origin',
 'disability',
 'religion',
 'sexual_orientation']

In [9]:
num_samples = 8
samples = np.concatenate(
    [np.random.choice(np.where(dataset["train"][f])[0], num_samples) for f in features]
)

In [10]:
def encode_labels(record):
    return {"labels": [record[feature] for feature in features]}


dataset = dataset.map(encode_labels)

Map:   0%|          | 0/433 [00:00<?, ? examples/s]

In [11]:
train_dataset = dataset["train"].select(samples)
eval_dataset = dataset["train"].select(
    np.setdiff1d(np.arange(len(dataset["train"])), samples)
)

In [12]:
from setfit import SetFitModel

model = SetFitModel.from_pretrained(model_id, multi_target_strategy="one-vs-rest")

Downloading:   0%|          | 0.00/594 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


In [14]:
from sentence_transformers.losses import CosineSimilarityLoss
from setfit import SetFitTrainer

trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss_class=CosineSimilarityLoss,
    num_iterations=5, # should be 20 - using 5 for a quick test train to test rest of notebook
    column_mapping={"text": "text", "labels": "label"},
)

In [15]:
trainer.train()

Applying column mapping to training dataset
***** Running training *****
  Num examples = 970
  Num epochs = 1
  Total optimization steps = 61
  Total train batch size = 16


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/61 [00:00<?, ?it/s]

In [16]:
metrics = trainer.evaluate()
metrics

Applying column mapping to evaluation dataset
***** Running evaluation *****


{'accuracy': 0.3602150537634409}

In [17]:
trainer.model._save_pretrained("setfit-test-model-fasttest") # remove -fasttest for full model with 20 iterations 

In [18]:
from setfit import SetFitModel
model = SetFitModel._from_pretrained("setfit-test-model-fasttest") # remove -fasttest for full model with 20 iterations 

In [19]:
# this is a toxicity model
preds = model(
    [
        "something about feminism",
        "Something about race",
    ]
)
preds

tensor([[0, 0, 1, 0, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0, 0, 0]], dtype=torch.int32)

In [20]:
# Show predicted labels, requires you to have stored the 'features' somewhere
[[f for f, p in zip(features, ps) if p] for ps in preds]

[['gender'], ['race']]