# **Multi-Label Text Classification: Sparse Transfer Learning with the Python API**

In this example, you will fine-tune a 90% pruned BERT model onto the Go-Emotions dataset (a multi-label classification problem) using SparseML's Hugging Face Integration.

### **Sparse Transfer Learning Overview**

Sparse Transfer Learning is very similiar to typical fine-tuning you are used to when training models. However, with Sparse Transfer Learning, we start the training process from a pre-sparsified checkpoint and maintain the sparsity structure while the fine tuning occurs.

At the end, you will have a sparse model trained on your dataset, ready to be deployed with DeepSparse for GPU-class performance on CPUs!

### **Pre-Sparsified BERT**
SparseZoo, Neural Magic's open source repository of pre-sparsified models, contains a 90% pruned version of BERT, which has been sparsified on the upstream Wikipedia and BookCorpus datasets with the
masked language modeling objective. [Check out the model card](https://sparsezoo.neuralmagic.com/models/nlp%2Fmasked_language_modeling%2Fobert-base%2Fpytorch%2Fhuggingface%2Fwikipedia_bookcorpus%2Fpruned90-none). We will use this model as the starting point for the transfer learning process.


**Let's dive in!**

## **Installation**

Install SparseML via `pip`.



In [None]:
!pip install sparseml[transformers]

If you are running on Google Colab, restart the runtime after this step.

In [None]:
import sparseml
from sparsezoo import Model
from sparseml.transformers.utils import SparseAutoModel
from sparseml.transformers.sparsification import Trainer, TrainingArguments
import numpy as np
from transformers import (
    AutoModelForSequenceClassification,
    AutoConfig, 
    AutoTokenizer, 
    EvalPrediction, 
    default_data_collator
)
from datasets import load_dataset, load_metric
from sklearn.metrics import precision_recall_fscore_support

## **Step 1: Load a Dataset**

SparseML is integrated with Hugging Face, so we can use the `datasets` class to load datasets from the Hugging Face hub or from local files. 

[GoEmotions Dataset Card](https://huggingface.co/datasets/go_emotions)

In [None]:
# load dataset from HF Hub
dataset = load_dataset("go_emotions", "simplified")

# alternatively, load from local JSON files
dataset["train"].to_json("go_emotions-train.json")
dataset["validation"].to_json("go_emotions-validation.json")
data_files = {}
data_files["train"] = "go_emotions-train.json"
data_files["validation"] = "go_emotions-validation.json"

dataset_from_json = load_dataset('json', data_files=data_files)

In [None]:
# configs for below
INPUT_COL_1 = "text"
INPUT_COL_2 = None
LABEL_COL = "labels"

all_labels = set()
for i in range(len(dataset_from_json["train"])):
  for l in dataset_from_json["train"][i]["labels"]:
    all_labels.add(l)
NUM_LABELS = len(all_labels)

We can see the input is `text` which is a string of text and the labels are `labels` which is a list of integers representing 28 possible emotions. We can see that a single text can have multiple labels attached. This is a multi-label classification problem where we will predict the 28 classes separately.

In [None]:
!head go_emotions-train.json --lines=10

{"text":"My favourite food is anything I didn't have to cook myself.","labels":[27],"id":"eebbqej"}
{"text":"Now if he does off himself, everyone will think hes having a laugh screwing with people instead of actually dead","labels":[27],"id":"ed00q6i"}
{"text":"WHY THE FUCK IS BAYLESS ISOING","labels":[2],"id":"eezlygj"}
{"text":"To make her feel threatened","labels":[14],"id":"ed7ypvh"}
{"text":"Dirty Southern Wankers","labels":[3],"id":"ed0bdzj"}
{"text":"OmG pEyToN iSn'T gOoD eNoUgH tO hElP uS iN tHe PlAyOfFs! Dumbass Broncos fans circa December 2015.","labels":[26],"id":"edvnz26"}
{"text":"Yes I heard abt the f bombs! That has to be why. Thanks for your reply:) until then hubby and I will anxiously wait \ud83d\ude1d","labels":[15],"id":"ee3b6wu"}
{"text":"We need more boards and to create a bit more space for [NAME]. Then we\u2019ll be good.","labels":[8,20],"id":"ef4qmod"}
{"text":"Damn youtube and outrage drama is super lucrative for reddit","labels":[0],"id":"ed8wbdn"}
{"text":"

## **Step 2: Setup Evaluation Metric**

GoEmotions is a multi-label classification problem, where each label is predicted independently and sequences can have multiple labels. We wil evaluate on class-level precision and recall as well as aggregating with a simple average.

Since SparseML is integrated with Hugging Face, we can pass `compute_metrics` function for evaluation (which will be passed to the `Trainer` class below).

In [None]:
# helper function for computing class and macro 
def compute_p_r_f1(targets, predictions):
  precision, recall, f1, _ = precision_recall_fscore_support(targets, predictions)

  # compile results into required str -> float dict
  results = {}
  for idx in range(predictions.shape[1]):
    results[f"precision_{idx}"] = precision[idx]
    results[f"recall_{idx}"] = recall[idx]
    results[f"f1_{idx}"] = f1[idx]

  # add macro averages and std to results
  results["precision_macro_average"] = precision.mean()
  results["recall_macro_average"] = recall.mean()
  results["f1_macro_average"] = f1.mean()

  results["precision_std"] = precision.std()
  results["recall_std"] = recall.std()
  results["f1_std"] = f1.std()

  return results

# we predict each label independently
THRESHOLD = 0.3 # < per go_emotions paper
def compute_metrics(p: EvalPrediction):
  preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
  preds_sigmoid = 1 / (1 + np.exp(-preds))
  multi_label_preds = (preds_sigmoid > THRESHOLD).astype(np.float32)
  return compute_p_r_f1(multi_label_preds, p.label_ids)

## **Step 3: Download Files for Sparse Transfer Learning**

First, we need to select a sparse checkpoint to begin the training process. In this case, we will fine-tune a 90% pruned version of BERT onto the GoEmotions dataset. This model is available in SparseZoo, identified by the following stub:
```
zoo:nlp/masked_language_modeling/obert-base/pytorch/huggingface/wikipedia_bookcorpus/pruned90-none
```

Next, we need to create a sparsification recipe for usage in the training process. Recipes are YAML files that encode the sparsity related algorithms and parameters to be applied by SparseML. For Sparse Transfer Learning, we need to use a recipe that instructs SparseML to maintain sparsity during the training process and to apply quantization over the final few epochs. 

In the case of GoEmotions, there is a transfer learning recipe available in the SparseZoo, identified by the following stub:

```
zoo:nlp/multilabel_text_classification/obert-base/pytorch/huggingface/goemotions/pruned90_quant-none
```

Finally, SparseML has the optional ability to apply model distillation from a teacher model during the transfer learning process to boost accuracy. In this case, we will use a dense version of BERT trained on the GoEmotions dataset which is hosted in SparseZoo. This model is identified by the following stub:

```
zoo:nlp/multilabel_text_classification/obert-base/pytorch/huggingface/goemotions/base-none
```

Use the `sparsezoo` python client to download the models and recipe using their SparseZoo stubs.

In [None]:
# 90% pruned upstream BERT trained on MLM objective
model_stub = "zoo:nlp/masked_language_modeling/obert-base/pytorch/huggingface/wikipedia_bookcorpus/pruned90-none" 
model_path = Model(model_stub, download_path="./model").training.path 

# dense BERT trained on Go Emotions
teacher_stub = "zoo:nlp/multilabel_text_classification/obert-base/pytorch/huggingface/goemotions/base-none"
teacher_path = Model(teacher_stub, download_path="./teacher").training.path 

# download transfer recipe for GoEmotions
transfer_stub = teacher_stub = "zoo:nlp/multilabel_text_classification/obert-base/pytorch/huggingface/goemotions/pruned90_quant-none"
recipe_path = Model(transfer_stub, download_path="./transfer_recipe").recipes.default.path

We can see that the upstream model (trained on Wikipedia BookCorpus) and  configuration files have been downloaded to the local directory.

In [None]:
%ls ./model/training

all_results.json   special_tokens_map.json  trainer_state.json  vocab.txt
config.json        tokenizer_config.json    training_args.bin
pytorch_model.bin  tokenizer.json           train_results.json


We can see that a transfer learning recipe has been downloaded. The `ConstantPruningModifier` instructs SparseML to maintain the sparsity structure of the network as the model trains and the `QuantizationModifier` instructs SparseML to run Quantization Aware Training at the end of training.

In [None]:
%cat ./transfer_recipe/recipe/recipe_original.md

#### **Inspecting the Recipe**

Here is the transfer learning recipe:

```yaml
version: 1.1.0

# General Variables
num_epochs: 9.0

transfer_init_lr: 3.5e-5
transfer_final_lr: 1e-8

transfer_distill_hardness: 0.6
transfer_distill_temperature: 5.0

transfer_weight_decay: 0.09

qat_start_epoch: 6.0
qat_warm_epoch: 6.1

qat_init_lr: 5e-5
qat_final_lr: 1e-8

qat_observer_epoch: 8.0
qat_quantize_embeddings: 1

qat_distill_hardness: 0.4
qat_distill_temperature: 5.0

qat_weight_decay: 0.09

# Modifiers:

training_modifiers:
  - !EpochRangeModifier
      start_epoch: 0.0
      end_epoch: eval(num_epochs)

  - !LearningRateFunctionModifier
      start_epoch: 0.0
      end_epoch: eval(qat_start_epoch)
      lr_func: linear
      init_lr: eval(transfer_init_lr)
      final_lr: eval(transfer_final_lr)

  - !SetLearningRateModifier
      start_epoch: eval(qat_start_epoch)
      learning_rate: eval(qat_final_lr)

  - !LearningRateFunctionModifier
      start_epoch: eval(qat_start_epoch)
      end_epoch: eval(qat_warm_epoch)
      lr_func: linear
      init_lr: eval(qat_final_lr)
      final_lr: eval(qat_init_lr)

  - !LearningRateFunctionModifier
      start_epoch: eval(qat_warm_epoch)
      end_epoch: eval(num_epochs)
      lr_func: linear
      init_lr: eval(qat_init_lr)
      final_lr: eval(qat_final_lr)
    
quantization_modifiers:
  - !QuantizationModifier
      start_epoch: eval(qat_start_epoch)
      disable_quantization_observer_epoch: eval(qat_observer_epoch)
      freeze_bn_stats_epoch: eval(qat_observer_epoch)
      quantize_embeddings: eval(qat_quantize_embeddings)
      quantize_linear_activations: 0
      custom_quantizable_module_types: ['GELUActivation']
      exclude_module_types: ['LayerNorm', 'GELUActivation', 'Tanh']
      submodules:
        - bert.embeddings
        - bert.encoder
        - bert.pooler
        - classifier

distillation_modifiers:
  - !DistillationModifier
      start_epoch: 0.0
      end_epoch: eval(qat_start_epoch)
      hardness: eval(transfer_distill_hardness)
      temperature: eval(transfer_distill_temperature)
      distill_output_keys: [logits]
  - !DistillationModifier
      start_epoch: eval(qat_start_epoch)
      hardness: eval(qat_distill_hardness)
      temperature: eval(qat_distill_temperature)
      distill_output_keys: [logits]

constant_modifiers:
  - !ConstantPruningModifier
      start_epoch: 0.0
      params: __ALL_PRUNABLE__

regularization_modifiers:
  - !SetWeightDecayModifier
      start_epoch: 0.0
      end_epoch: eval(qat_start_epoch)
      weight_decay: eval(transfer_weight_decay)

  - !SetWeightDecayModifier
      start_epoch: eval(qat_start_epoch)
      weight_decay: eval(qat_weight_decay)
```


The `Modifiers` in the transfer learning recipe are the important items that encode how SparseML should modify the training process for Sparse Transfer Learning:
- `ConstantPruningModifier` tells SparseML to pin weights at 0 over all epochs, maintaining the sparsity structure of the network
- `QuantizationModifier` tells SparseML to quanitze the weights with quantization aware training over the last 5 epochs
- `DistillationModifier` tells SparseML how to apply distillation during the trainign process, targeting the logits

Below, SparseML's `Trainer` will parses the modifiers and updates the training process to implement the algorithms specified here.

## **Step 4: Setup Hugging Face Model Objects**

Next, we will set up the Hugging Face `tokenizer, config, and model`. 

These are all native Hugging Face objects, so check out the Hugging Face docs for more details on `AutoModel`, `AutoConfig`, and `AutoTokenizer` as needed. 

We instantiate these classes by passing the local path to the directory containing the `pytorch_model.bin`, `tokenizer.json`, and `config.json` files from the SparseZoo download.

In [None]:
# initialize config, with multi_label_classification problem type
config = AutoConfig.from_pretrained(model_path, 
                                    num_labels=NUM_LABELS,
                                    problem_type="multi_label_classification")

teacher_config = AutoConfig.from_pretrained(teacher_path, 
                                    num_labels=NUM_LABELS,
                                    problem_type="multi_label_classification")

# initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)

# initialize model using familiar HF AutoModel
model_kwargs = {"config": config}
model_kwargs["state_dict"], s_delayed = SparseAutoModel._loadable_state_dict(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path, **model_kwargs,)
SparseAutoModel.log_model_load(model, model_path, "student", s_delayed)     # prints metrics on sparsity profile

# initialize teacher using familiar HF AutoModel
teacher_kwargs = {"config": teacher_config}
teacher_kwargs["state_dict"], t_delayed = SparseAutoModel._loadable_state_dict(teacher_path)
teacher = AutoModelForSequenceClassification.from_pretrained(teacher_path, **teacher_kwargs,)
SparseAutoModel.log_model_load(teacher, teacher_path, "teacher", t_delayed) # prints metrics on sparsity profile

## **Step 5: Tokenize Dataset**

Run the tokenizer on the dataset. This is standard Hugging Face functionality. We map the indexed labels to one hot labels.

In [None]:
MAX_SEQ_LEN = 30

# helper function
def one_hot_labels(target_labels):
  # use 1 - 1e-9 for now as workaround target, values get cast to int somewhere
  # when encoded as 1.0/0.0. must be float for compatibility w/ BCE logits loss
  return [
      1 - 1e-9 if label in target_labels else 0.0 for label in range(NUM_LABELS)
  ]

# preprocessing function
def preprocess_fn(examples):
  args = None
  if INPUT_COL_2 is None:
    args = (examples[INPUT_COL_1], )
  else:
    args = (examples[INPUT_COL_1], examples[INPUT_COL_2])
  result = tokenizer(*args, 
                   padding="max_length", 
                   max_length=min(tokenizer.model_max_length, MAX_SEQ_LEN), 
                   truncation=True)
  
  result[LABEL_COL] = [
      one_hot_labels(target_labels) for target_labels in examples[LABEL_COL]
  ]
  return result

# tokenize the dataset
tokenized_dataset = dataset_from_json.map(
    preprocess_fn,
    batched=True,
    desc="Running tokenizer on dataset",
)

Running tokenizer on dataset:   0%|          | 0/44 [00:00<?, ?ba/s]

Running tokenizer on dataset:   0%|          | 0/6 [00:00<?, ?ba/s]

In [None]:
print(tokenized_dataset["train"][0][LABEL_COL])

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.999999999]


In [None]:
print(tokenized_dataset["validation"][0][LABEL_COL])

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.999999999]


## **Step 6: Run Training**

SparseML has a custom `Trainer` class that inherits from the [Hugging Face `Trainer` Class](https://huggingface.co/docs/transformers/main_classes/trainer). As such, the SparseML `Trainer` has all of the existing functionality of the HF trainer. However, in addition, we can supply a `recipe` and (optionally) a `teacher`. 


As we saw above, the `recipe` encodes the sparsity related algorithms and hyperparameters of the training process in a YAML file. The SparseML `Trainer` parses the `recipe` and adjusts the training workflow to apply the algorithms in the recipe.

The `teacher` is an optional argument that instructs SparseML to apply model distillation to support the training process.

In [None]:
# setup trainer arguments
training_args = TrainingArguments(
    output_dir="./training_output",
    do_train=True,
    do_eval=True,
    resume_from_checkpoint=False,
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    fp16=False)

# initialize trainer
trainer = Trainer(
    model=model,
    model_state_path=model_path,
    recipe=recipe_path,
    teacher=teacher,
    metadata_args=["per_device_train_batch_size","per_device_eval_batch_size","fp16"],
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=default_data_collator,
    compute_metrics=compute_metrics)

In [None]:
# step 5: run training
%rm -rf training_output
train_result = trainer.train(resume_from_checkpoint=False)
trainer.save_model()
trainer.save_state()
trainer.save_optimizer_and_scheduler(training_args.output_dir)

## **Step 7: Export To ONNX**

Run the following to export the model to ONNX. The script creates a `deployment` folder containing ONNX file and the necessary configuration files (e.g. `tokenizer.json`) for deployment with DeepSparse.

In [None]:
!sparseml.transformers.export_onnx \
  --model_path training_output \
  --task text_classification