# **Token Classification: Sparse Transfer Learning with the Python API**

In this example, you will fine-tune a 90% pruned BERT model onto the WNUT NER dataset using SparseML's Hugging Face Integration.

### **Sparse Transfer Learning Overview**

Sparse Transfer Learning is very similiar to typical fine-tuning you are used to when training models. However, with Sparse Transfer Learning, we start the training process from a pre-sparsified checkpoint and maintain the sparsity structure while the fine tuning occurs.

At the end, you will have a sparse model trained on your dataset, ready to be deployed with DeepSparse for GPU-class performance on CPUs!

### **Pre-Sparsified BERT**
SparseZoo, Neural Magic's open source repository of pre-sparsified models, contains a 90% pruned version of BERT, which has been sparsified on the upstream Wikipedia and BookCorpus datasets with the
masked language modeling objective. [Check out the model card](https://sparsezoo.neuralmagic.com/models/nlp%2Fmasked_language_modeling%2Fobert-base%2Fpytorch%2Fhuggingface%2Fwikipedia_bookcorpus%2Fpruned90-none). We will use this model as the starting point for the transfer learning process.


**Let's dive in!**

## **Installation**

Install SparseML via `pip`.



In [None]:
!pip install sparseml[transformers]

If you are running on Google Colab, restart the runtime after this step.

In [None]:
import sparseml
from sparsezoo import Model
from sparseml.transformers.utils import SparseAutoModel
from sparseml.transformers.sparsification import Trainer, TrainingArguments
import numpy as np
from transformers import (
    AutoModelForTokenClassification,
    AutoConfig, 
    AutoTokenizer,
    EvalPrediction,
    DataCollatorForTokenClassification,
    PreTrainedTokenizerFast
)
from datasets import load_dataset, load_metric

## **Step 1: Load a Dataset**

SparseML is integrated with Hugging Face, so we can use the `datasets` class to load datasets from the Hugging Face hub or from local files.

[WNUT Dataset Card](https://huggingface.co/datasets/wnut_17)

In [None]:
# load dataset from HF hub
dataset = load_dataset("wnut_17")
dataset["train"].to_json("wnut_17-train.json")
dataset["validation"].to_json("wnut_17-validation.json")

# alternatively, load from JSONL file
data_files = {}
data_files["train"] = "wnut_17-train.json"
data_files["validation"] = "wnut_17-validation.json"
dataset_from_json = load_dataset('json', data_files=data_files)

We can see the input is a `tokens` which is a list of words and the labels are `ner_tags` which are integers corresponding to a tag type.

In [None]:
!head wnut_17-train.json --lines=10

## **Step 2: Setup Evaluation Metric**

WNUT is a NER task. We will use the [seqeval](https://huggingface.co/spaces/evaluate-metric/seqeval) metric to evaluate the accuracy of the pipeline. 

The seqeval metric needs to be passed tags rather than tag indexes, so we need to create a mapping between the indexes and the tags so that we can pass the tags to the seqeval metric.

Per the [WNUT dataset card](https://huggingface.co/datasets/wnut_17), the NER tags map to the following classes:

```
{
  0: "O", 
  1: "B-corporation", 
  2: "I-corporation", 
  3: "B-creative-work", 
  4: "I-creative-work", 
  5: "B-group", 
  6: "I-group", 
  7: "B-location", 
  8: "I-location", 
  9: "B-person", 
  10: "I-person", 
  11: "B-product", 
  12: "I-product"
}
```

In [None]:
# label mapping
LABEL_MAP = {
  0: "O", 
  1: "B-corporation", 
  2: "I-corporation", 
  3: "B-creative-work", 
  4: "I-creative-work", 
  5: "B-group", 
  6: "I-group", 
  7: "B-location", 
  8: "I-location", 
  9: "B-person", 
  10: "I-person", 
  11: "B-product", 
  12: "I-product"
}

# other configs
INPUT_COL = "tokens"
LABEL_COL = "ner_tags"
SPECIAL_TOKEN_ID = -100
NUM_LABELS = len(LABEL_MAP)


print(dataset_from_json)
print(dataset_from_json["train"][0][INPUT_COL])
print(dataset_from_json["train"][0][LABEL_COL])

In [None]:
# load evaluation metric - seqeval
metric = load_metric("seqeval")

# setup metrics function
def compute_metrics(p: EvalPrediction):
  predictions, labels = p
  predictions = np.argmax(predictions, axis=2)

  # Remove ignored index (special tokens) and convert indexed tags to labels
  true_predictions = [
    [LABEL_MAP[pred] for (pred, lab) in zip(prediction, label) if lab != SPECIAL_TOKEN_ID]
    for prediction, label in zip(predictions, labels)
  ]
  true_labels = [
    [LABEL_MAP[lab] for (_, lab) in zip(prediction, label) if lab != SPECIAL_TOKEN_ID]
    for prediction, label in zip(predictions, labels)
  ]

  # example: results = metrics.compute(predictions=["0", "B-group", "0"], true_labels=["0", "B-person", "B-group"])
  #   we used the LABEL_MAP to convert the tags (which are integers in wnut) into the corresponding LABEL
  results = metric.compute(predictions=true_predictions, references=true_labels)

  return {
    "precision": results["overall_precision"],
    "recall": results["overall_recall"],
    "f1": results["overall_f1"],
    "accuracy": results["overall_accuracy"],
  }

## **Step 3: Train the Teacher**

To support the sparse transfer learning process, we will first train a dense teacher model from scratch - which we can then distill during the sparse transfer learning process.

In this case, we will use BERT base.

In [None]:
# load teacher model and tokenizer
TEACHER = "bert-base-uncased"
teacher_config = AutoConfig.from_pretrained(TEACHER, num_labels=NUM_LABELS)
tokenizer = AutoTokenizer.from_pretrained(TEACHER)
teacher = AutoModelForTokenClassification.from_pretrained(TEACHER, config=teacher_config)

### **Tokenize Dataset**

Run the tokenizer on the dataset. 

In this function, we handle the case where an individual word is tokenized into multiple tokens. In particular, we set the `label_id = SPECIAL_TOKEN_ID` for each token besides the first token in a word. 

When evaluating the accuracy with `compute_metrics` (defined above), we filter out tokens with `SPECIAL_TOKEN_ID`, such that each word counts only once in the precision and recall calculations.

In [None]:
def preprocess_fn(examples):
  tokenized_inputs = tokenizer(
    examples[INPUT_COL], 
    padding="max_length", 
    max_length=min(tokenizer.model_max_length, 128), 
    truncation=True,
    is_split_into_words=True # the texts in our dataset are lists of words (with a label for each word)
  )
  
  labels = []
  for i, label in enumerate(examples[LABEL_COL]):
    word_ids = tokenized_inputs.word_ids(batch_index=i)
    previous_word_idx = None
    label_ids = []

    for word_idx in word_ids:
      # Special tokens have a word id that is None. We set the label to SPECIAL_TOKEN_ID
      # so they are automatically ignored in the loss function.
      if word_idx is None:
        label_ids.append(SPECIAL_TOKEN_ID)

      # We set the label for the first token of each word.
      elif word_idx != previous_word_idx:
        label_ids.append(label[word_idx])

      # We will not label the other tokens of a work, so set to SPECIAL_TOKEN_ID
      else:
        label_ids.append(SPECIAL_TOKEN_ID)
      previous_word_idx = word_idx
    labels.append(label_ids)

  tokenized_inputs["labels"] = labels
  return tokenized_inputs

# tokenize the dataset
tokenized_dataset = dataset_from_json.map(
    preprocess_fn,
    batched=True,
    desc="Running tokenizer on dataset"
)

### **Teacher Training: Fine-Tune the Teacher**

We use the native Hugging Face `Trainer` (which we import as `HFTrainer`) to train the model. Check out the [Hugging Face documentation](https://huggingface.co/docs/transformers/main_classes/trainer) for more details on the `Trainer` as needed.


In [None]:
from transformers import Trainer as HFTrainer
from transformers import TrainingArguments as HFTrainingArguments

# setup trainer arguments
teacher_training_args = HFTrainingArguments(
    output_dir="./teacher_training",
    do_train=True,
    do_eval=True,
    num_train_epochs=20.0,
    learning_rate=5e-5,
    lr_scheduler_type="linear",
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32)

# initialize trainer
teacher_trainer = HFTrainer(
    model=teacher,
    args=teacher_training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForTokenClassification(tokenizer, pad_to_multiple_of=8 if teacher_training_args.fp16 else None),
    compute_metrics=compute_metrics)

In [None]:
# run training
%rm -rf ./teacher_training
teacher_trainer.train(resume_from_checkpoint=False)

## **Step 4: Sparse Transfer Learning**

Now that we have the teacher trained, we can sparse transfer learn from the pre-sparsified version of BERT with distillation support. 

First, we need to select a sparse checkpoint to begin the training process. In this case, we are fine-tuning a 90% pruned version of BERT onto the TweetEval Emotion dataset. This model is available in SparseZoo, identified by the following stub:
```
zoo:nlp/masked_language_modeling/obert-base/pytorch/huggingface/wikipedia_bookcorpus/pruned90-none
```

Next, we need to create a sparsification recipe for usage in the training process. Recipes are YAML files that encode the sparsity related algorithms and parameters to be applied by SparseML. For Sparse Transfer Learning, we need to use a recipe that instructs SparseML to maintain sparsity during the training process and to apply quantization over the final few epochs. 

In SparseZoo, there is a transfer recipe which was used to fine-tune BERT onto the CONLL2003 task (which is also a NER task). Since the WNUT dataset is a similiar problem to CONLL, we will use the CONLL recipe, which is identified by the following stub:

```
zoo:nlp/token_classification/obert-base/pytorch/huggingface/conll2003/pruned90_quant-none
```

Use the `sparsezoo` python client to download the models and recipe using their SparseZoo stubs.

In [None]:
# 90% pruned upstream BERT trained on MLM objective (pruned90)
model_stub = "zoo:nlp/masked_language_modeling/obert-base/pytorch/huggingface/wikipedia_bookcorpus/pruned90-none" 
model_path = Model(model_stub, download_path="./model").training.path

# sparse transfer learning recipe for conll2003 (pruned90_quant)
transfer_stub = "zoo:nlp/token_classification/obert-base/pytorch/huggingface/conll2003/pruned90_quant-none"
recipe_path = Model(transfer_stub, download_path="./transfer_recipe").recipes.default.path

We can see that the upstream model (trained on Wikipedia BookCorpus) and  configuration files have been downloaded to the local directory.

In [None]:
%ls ./model/training

We can see that a transfer learning recipe has been downloaded. The `ConstantPruningModifier` instructs SparseML to maintain the sparsity structure of the network as the model trains and the `QuantizationModifier` instructs SparseML to run Quantization Aware Training at the end of training.

In [None]:
%cat ./transfer_recipe/recipe/recipe_original.md

#### Inspecting the Recipe

Here is the transfer learning recipe:

```yaml
version: 1.1.0

# General Variables
num_epochs: 13
init_lr: 1.5e-4 
final_lr: 0

qat_start_epoch: 8.0
observer_epoch: 12.0
quantize_embeddings: 1

distill_hardness: 1.0
distill_temperature: 2.0

# Modifiers:

training_modifiers:
  - !EpochRangeModifier
      end_epoch: eval(num_epochs)
      start_epoch: 0.0

  - !LearningRateFunctionModifier
      start_epoch: 0
      end_epoch: eval(num_epochs)
      lr_func: linear
      init_lr: eval(init_lr)
      final_lr: eval(final_lr)
    
quantization_modifiers:
  - !QuantizationModifier
      start_epoch: eval(qat_start_epoch)
      disable_quantization_observer_epoch: eval(observer_epoch)
      freeze_bn_stats_epoch: eval(observer_epoch)
      quantize_embeddings: eval(quantize_embeddings)
      quantize_linear_activations: 0
      exclude_module_types: ['LayerNorm']
      submodules:
        - bert.embeddings
        - bert.encoder
        - classifier

distillation_modifiers:
  - !DistillationModifier
     hardness: eval(distill_hardness)
     temperature: eval(distill_temperature)
     distill_output_keys: [logits]

constant_modifiers:
  - !ConstantPruningModifier
      start_epoch: 0.0
      params: __ALL_PRUNABLE__
```


The `Modifiers` in the transfer learning recipe are the important items that encode how SparseML should modify the training process for Sparse Transfer Learning:
- `ConstantPruningModifier` tells SparseML to pin weights at 0 over all epochs, maintaining the sparsity structure of the network
- `QuantizationModifier` tells SparseML to quanitze the weights with quantization aware training over the last 5 epochs
- `DistillationModifier` tells SparseML how to apply distillation during the trainign process, targeting the logits

Below, SparseML's `Trainer` will parses the modifiers and updates the training process to implement the algorithms specified here.

Next, we will set up the Hugging Face `tokenizer, config, and model`. Since the tokenizer for the teacher and student are the same, we can use the same `tokenizer` and `dataset` used to train the teacher abve.

These are all native Hugging Face objects, so check out the Hugging Face docs for more details on `AutoModel`, `AutoConfig`, and `AutoTokenizer` as needed. 

We instantiate these classes by passing the local path to the directory containing the `pytorch_model.bin`, `tokenizer.json`, and `config.json` files from the SparseZoo download.

In [None]:
# note: since the teacher and student have the same tokenizer, we can use the one from the teacher training

# initialize config
config = AutoConfig.from_pretrained(model_path, num_labels=NUM_LABELS)

# initialize model
model_kwargs = {"config": config}
model_kwargs["state_dict"], s_delayed = SparseAutoModel._loadable_state_dict(model_path)
model = AutoModelForTokenClassification.from_pretrained(model_path, **model_kwargs,)
SparseAutoModel.log_model_load(model, model_path, "student", s_delayed) # prints metrics on sparsity profile

### **Sparse Transfer Learning: Fine-Tune the Model**

SparseML has a custom `Trainer` class that inherits from the [Hugging Face `Trainer` Class](https://huggingface.co/docs/transformers/main_classes/trainer). As such, the SparseML `Trainer` has all of the existing functionality of the HF trainer. However, in addition, we can supply a `recipe` and (optionally) a `teacher`. 


As we saw above, the `recipe` encodes the sparsity related algorithms and hyperparameters of the training process in a YAML file. The SparseML `Trainer` parses the `recipe` and adjusts the training workflow to apply the algorithms in the recipe. We use the `recipe_args` function to modify the recipe slightly (training for more epochs than used for Conll2003).


The `teacher` is an optional argument that instructs SparseML to apply model distillation to support the training process. We pass the teacher rained abobve here.

In [None]:
# setup trainer arguments
training_args = TrainingArguments(
    output_dir="./training_output",
    do_train=True,
    do_eval=True,
    resume_from_checkpoint=False,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="epoch",
    save_total_limit=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    fp16=False)

# initialize trainer
trainer = Trainer(
    model=model,
    model_state_path=model_path,
    recipe=recipe_path,
    recipe_args={
        "num_epochs":25,
        "init_lr": 5e-5,
        "qat_start_epoch": 20.0,
        "observer_epoch": 24.0},
    teacher=teacher,
    metadata_args=["per_device_train_batch_size","per_device_eval_batch_size","fp16"],
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForTokenClassification(tokenizer, pad_to_multiple_of=8 if training_args.fp16 else None),
    compute_metrics=compute_metrics)

In [None]:
# step 5: run training
%rm -rf training_output
train_result = trainer.train(resume_from_checkpoint=False)
trainer.save_model()
trainer.save_state()
trainer.save_optimizer_and_scheduler(training_args.output_dir)

## **Step 7: Export To ONNX**

Run the following to export the model to ONNX. The script creates a `deployment` folder containing ONNX file and the necessary configuration files (e.g. `tokenizer.json`) for deployment with DeepSparse.

In [None]:
!sparseml.transformers.export_onnx \
  --model_path training_output \
  --task token_classification

## **Optional: Deploy with DeepSparse**

In [None]:
%pip install deepsparse

In [None]:
from deepsparse import Pipeline

pipeline = Pipeline.create("token_classification", model_path="./deployment")

In [None]:
from pprint import pprint
prediction = pipeline("Japan, co-hosts of the World Cup in 2002 and ranked 20th in the world by FIFA, are favourites to regain their title here.")
pprint(prediction.predictions)

In [None]:
prediction = pipeline("China controlled most of the match and saw several chances missed until the 78th minute when Uzbek striker Igor Shkvyrin took advantage of a misdirected defensive header to lob the ball over the advancing Chinese keeper and into an empty net.")
pprint(prediction.predictions)