# SetFit Example

##### Hello! We will be going through a quick and easy example of using our setfit code for text classification. To complete this tutorial you will need dependencies from requirements.txt and SetFit installed in your environment. Please do so by running this cell:

In [None]:
!python -m pip install setfit
!python -m pip install -r requirements.txt

In [None]:
!pip install evaluate==0.1.2


##### Before we proceed, we must first choose a model and identify the dataset we would like to classify. For this tutorial we'll be using SST2, already on the SetFit hub: "SetFit/sst2"

In [1]:
from datasets import load_dataset, load_metric

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
dataset = "SetFit/sst2"

##### We load the "train" and "test" portions of the data

In [3]:
train_sst2 = load_dataset(dataset, split="train")

Using custom data configuration SetFit--sst2-4811211b52125821
Reusing dataset json (/home/eun_seo_huggingface_co/.cache/huggingface/datasets/SetFit___json/SetFit--sst2-4811211b52125821/0.0.0/da492aad5680612e4028e7f6ddc04b1dfcec4b64db470ed7cc5f2bb265b9b6b5)


In [4]:
test_sst2 = load_dataset(dataset,split="test")

Using custom data configuration SetFit--sst2-4811211b52125821
Reusing dataset json (/home/eun_seo_huggingface_co/.cache/huggingface/datasets/SetFit___json/SetFit--sst2-4811211b52125821/0.0.0/da492aad5680612e4028e7f6ddc04b1dfcec4b64db470ed7cc5f2bb265b9b6b5)


In [5]:
from setfit.data import SAMPLE_SIZES, create_fewshot_splits

##### We not sample our data so that we have n number of examples for each class. We start with 4 in this case.

In [52]:
n = 4
fewshot_sst2 = create_fewshot_splits(train_sst2, [n])

##### Create_fewshot_splits has samples 10 different groups of n=4 (per class) data splits.

In [23]:
for name in fewshot_sst2:
    print(name)

train-4-0
train-4-1
train-4-2
train-4-3
train-4-4
train-4-5
train-4-6
train-4-7
train-4-8
train-4-9


##### Let's try our SetFit test on just one run. We'll call it try1. This means we're training our model on just one run of 4 examples of each class.

In [55]:
try1 = 'train-4-0'
fewshot_sst2[try1]

Dataset({
    features: ['text', 'label', 'label_text'],
    num_rows: 8
})

##### We then download the backbone model, in this case "paraphrase-mpnet-base-v2" from Sentence Transformers

In [17]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("paraphrase-mpnet-base-v2")

##### There are many loss options to choose from but for this simple example we'll use SupConLoss. We can import SupConLoss from setfit.modeling.

In [18]:
from setfit.modeling import SupConLoss

##### We split the data into train/test and text/labels parts

In [26]:
x_train = fewshot_sst2[try1]["text"]
y_train = fewshot_sst2[try1]["label"]

In [27]:
x_test = test_sst2["text"]
y_test = test_sst2["label"]

##### We set the batch size to 16 and max sequence length to 256

In [28]:
batch_size = 16
model.max_seq_length = 256
num_epochs = 10


##### Now we have to make our train data loader. For this, we'll import the DataLoader from pytorch, and helper functions from Sentence Transformers

In [29]:
from torch.utils.data import DataLoader
from sentence_transformers import InputExample
from sentence_transformers.datasets import SentenceLabelDataset

In [49]:
train_examples = [InputExample(texts=[text], label=label) for text, label in zip(x_train, y_train)]
train_data_sampler = SentenceLabelDataset(train_examples)
batch_size = min(batch_size, len(train_data_sampler))
train_dataloader = DataLoader(train_data_sampler, batch_size=batch_size, drop_last=True)

In [50]:
loss_class = SupConLoss
train_loss = loss_class(model=model)

In [37]:
train_steps = len(train_dataloader) * num_epochs

##### Set our warm-up steps...

In [38]:
import math
warmup_steps = math.ceil(train_steps*0.1)

##### And we finally train our model!

In [39]:
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=1,
    steps_per_epoch=train_steps,
    warmup_steps=warmup_steps,
    show_progress_bar=False,

)

##### We are now ready to evaluate our model's performance. We'll be using accruacy as our evaluation metric.

In [41]:
from sklearn.linear_model import LogisticRegression
from setfit.modeling import SKLearnWrapper
from evaluate import load


In [42]:
metric = "accuracy"
clf = SKLearnWrapper(model, LogisticRegression())
metric_fn = load(metric)

In [45]:
clf.fit(x_train,y_train)
y_pred = clf.predict(x_test)
metrics = metric_fn.compute(predictions=y_pred, references=y_test)

In [56]:
print(metrics)

{'accuracy': 0.8165842943437671}
