# **Sentiment Analysis: Sparse Transfer Learning with the Python API**

In this example, you will fine-tune a 90% pruned BERT model onto the SST2 dataset using SparseML's Hugging Face Integration.

### **Sparse Transfer Learning Overview**

Sparse Transfer Learning is very similiar to typical fine-tuning you are used to when training models. However, with Sparse Transfer Learning, we start the training process from a pre-sparsified checkpoint and maintain the sparsity structure while the fine tuning occurs. At the end, you will have a sparse model trained on your dataset, ready to be deployed with DeepSparse for GPU-class performance on CPUs!

### **Pre-Sparsified BERT**
SparseZoo, Neural Magic's open source repository of pre-sparsified models, contains a 90% pruned version of BERT, which has been sparsified on the upstream Wikipedia and BookCorpus datasets with the
masked language modeling objective. [Check out the model card](https://sparsezoo.neuralmagic.com/models/nlp%2Fmasked_language_modeling%2Fobert-base%2Fpytorch%2Fhuggingface%2Fwikipedia_bookcorpus%2Fpruned90-none). We will use this model as the starting point for the transfer learning process.


***Let's dive in!***

## **Installation**

Install SparseML via `pip`.



In [None]:
%pip uninstall torch -y
%pip install sparseml[torch]

If you are running on Google Colab, restart the runtime after this step.

In [None]:
import sparseml
from sparsezoo import Model
from sparseml.transformers.utils import SparseAutoModel
from sparseml.transformers.sparsification import Trainer, TrainingArguments
import numpy as np
from transformers import (
    AutoModelForTokenClassification,
    AutoConfig, 
    AutoTokenizer,
    EvalPrediction,
    DataCollatorForTokenClassification,
    PreTrainedTokenizerFast
)
from datasets import ClassLabel, load_dataset, load_metric

## **Step 1: Load a Dataset**

SparseML is integrated with Hugging Face, so we can use the `datasets` class to load datasets from the Hugging Face hub or from local files. 

[Conll2003 Dataset Card](https://huggingface.co/datasets/conll2003)

In [None]:
# load dataset from HF hub
dataset = load_dataset("conll2003")
dataset["train"].to_json("conll2003-train.json")
dataset["validation"].to_json("conll2003-validation.json")

# alternatively, load from JSONL file
data_files = {}
data_files["train"] = "conll2003-train.json"
data_files["validation"] = "conll2003-validation.json"
dataset_from_json = load_dataset('json', data_files=data_files)

We can see the input is `tokens` which is a list of words and the labels are `ner_tags` which are a list of integers corresponding to a tag type for each word.

In [None]:
!head conll2003-train.json --lines=10

## **Step 2: Setup Evaluation Metric**

Token classification predicts a category for every word in the input sentence. We can use the [seqeval metric](https://huggingface.co/spaces/evaluate-metric/seqeval) to evaluate the tag-level precision and recall of the pipeline. 

The seqeval metric needs to be passed tags rather than tag indexes, so we need to create a mapping between the indexes and the tags so that we can pass the tags to the seqeval metric.

The Conll2003 named-entity-recognition tags map to the following classes:

```
{
  'O': 0, 
  'B-PER': 1, 
  'I-PER': 2, 
  'B-ORG': 3, 
  'I-ORG': 4, 
  'B-LOC': 5, 
  'I-LOC': 6, 
  'B-MISC': 7, 
  'I-MISC': 8
}
```

In [None]:
# label mapping
LABEL_MAP = {0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-ORG', 4: 'I-ORG', 5: 'B-LOC', 6: 'I-LOC', 7: 'B-MISC', 8: 'I-MISC'}

# other configs
INPUT_COL = "tokens"
LABEL_COL = "ner_tags"
NUM_LABELS = len(LABEL_MAP)
SPECIAL_TOKEN_ID = -100

In [None]:
# load evaluation metric
metric = load_metric("seqeval")

# setup metrics function
def compute_metrics(p: EvalPrediction):
  predictions, labels = p
  predictions = np.argmax(predictions, axis=2)
   
  # Remove ignored index (special tokens) and convert indexed tags to labels
  true_predictions = [
    [LABEL_MAP[pred] for (pred, lab) in zip(prediction, label) if lab != SPECIAL_TOKEN_ID]
    for prediction, label in zip(predictions, labels)
  ]
  true_labels = [
    [LABEL_MAP[lab] for (_, lab) in zip(prediction, label) if lab != SPECIAL_TOKEN_ID]
    for prediction, label in zip(predictions, labels)
  ]
  
  # example: results = metrics.compute(predictions=["0", "B-group", "0"], true_labels=["0", "B-org", "I-org"])
  #   we used the LABEL to convert the tags (which are integers) into the corresponding LABEL
  #   seqeval should be passed the actual labels
  results = metric.compute(predictions=true_predictions, references=true_labels)
  return {
    "precision": results["overall_precision"],
    "recall": results["overall_recall"],
    "f1": results["overall_f1"],
    "accuracy": results["overall_accuracy"],
  }

## **Step 3: Download Files for Sparse Transfer Learning**

First, we need to select a sparse checkpoint to begin the training process. In this case, we will fine-tune a 90% pruned version of BERT onto the SST2 dataset. This model is available in SparseZoo, identified by the following stub:
```
zoo:nlp/masked_language_modeling/obert-base/pytorch/huggingface/wikipedia_bookcorpus/pruned90-none
```

Next, we also need to create/select a sparsification recipe for usage in the training process. Recipes are YAML files that encode the sparsity related algorithms and parameters to be applied by SparseML. For Sparse Transfer Learning, we use a recipe that instructs SparseML to maintain sparsity during the training process and to apply quantization over the final few epochs. In the case of Conll2003, there is a transfer learning recipe available in the SparseZoo, identified by the following stub:
```
zoo:nlp/token_classification/obert-base/pytorch/huggingface/conll2003/pruned90_quant-none
```

Finally, SparseML has the optional ability to apply model distillation from a teacher model during the transfer learning process to boost accuracy. In this case, we will use a dense version of BERT trained on the Conll2003 dataset which is hosted in SparseZoo. This model is identified by the following stub:

```
zoo:nlp/token_classification/obert-base/pytorch/huggingface/conll2003/base-none
```

Use the `sparsezoo` python client to download the models and recipe using their SparseZoo stubs.

In [None]:
# downloads 90% pruned upstream BERT trained on MLM objective (pruned90)
model_stub = "zoo:nlp/masked_language_modeling/obert-base/pytorch/huggingface/wikipedia_bookcorpus/pruned90-none" 
model_path = Model(model_stub, download_path="./model").training.path

# downloads dense BERT trained on CONLL2003 (base_none)
teacher_stub = "zoo:nlp/token_classification/obert-base/pytorch/huggingface/conll2003/base-none"
teacher_path = Model(teacher_stub, download_path="./teacher").training.path

# download pruned quantized transfer recipe for CONLL2003 (pruned90_quant)
transfer_stub = "zoo:nlp/token_classification/obert-base/pytorch/huggingface/conll2003/pruned90_quant-none"
recipe_path = Model(transfer_stub, download_path="./transfer_recipe").recipes.default.path

We can see that the upstream model (trained on Wikipedia BookCorpus) and  configuration files have been downloaded to the local directory.

In [None]:
%ls ./model/training

We can see that a transfer learning recipe has been downloaded. The `ConstantPruningModifier` instructs SparseML to maintain the sparsity structure of the network as the model trains and the `QuantizationModifier` instructs SparseML to run Quantization Aware Training at the end of training.

In [None]:
%cat ./transfer_recipe/recipe/recipe_original.md

## **Step 4: Setup Hugging Face Model Objects**

Next, we will set up the Hugging Face `tokenizer, config, and model`. These are all native Hugging Face objects, so check out the Hugging Face docs for more details on `AutoModel`, `AutoConfig`, and `AutoTokenizer` as needed. We instantiate these classes by passing the local path to the directory containing the `pytorch_model.bin`, `tokenizer.json`, and `config.json` files from Step 3.

In [None]:
# shared tokenizer between teacher and student
tokenizer = AutoTokenizer.from_pretrained(model_path)
assert(isinstance(tokenizer, PreTrainedTokenizerFast))

# setup configs
model_config = AutoConfig.from_pretrained(model_path, num_labels=NUM_LABELS)
teacher_config = AutoConfig.from_pretrained(teacher_path, num_labels=NUM_LABELS)

# initialize model using familiar HF AutoModel
model_kwargs = {"config": model_config}
model_kwargs["state_dict"], s_delayed = SparseAutoModel._loadable_state_dict(model_path)
model = AutoModelForTokenClassification.from_pretrained(model_path, **model_kwargs,)

# initialize teacher using familiar HF AutoModel
teacher_kwargs = {"config": teacher_config}
teacher_kwargs["state_dict"], t_delayed = SparseAutoModel._loadable_state_dict(teacher_path)
teacher = AutoModelForTokenClassification.from_pretrained(teacher_path, **teacher_kwargs,)

# optional - prints metrics about sparsity profiles of the models
SparseAutoModel.log_model_load(model, model_path, "student", s_delayed) # prints metrics on sparsity profile
SparseAutoModel.log_model_load(teacher, teacher_path, "teacher", t_delayed) # prints metrics on sparsity profile

## **Step 5: Tokenize Dataset**

Run the tokenizer on the dataset. 

In this function, we handle the case where an individual word is tokenized into multiple tokens. In particular, we set the `label_id = SPECIAL_TOKEN_ID` for each token besides the first token in a word. 

When evaluating the accuracy with `compute_metrics` (defined above), we filter out tokens with `SPECIAL_TOKEN_ID`, such that each word counts only once in the precision and recall calculations.

In [None]:
MAX_LEN = 128

def preprocess_fn(examples):
  tokenized_inputs = tokenizer(
    examples[INPUT_COL], 
    padding="max_length", 
    max_length=min(tokenizer.model_max_length, MAX_LEN), 
    truncation=True,
    is_split_into_words=True # the texts in our dataset are lists of words (with a label for each word)
  )
  
  labels = []
  for i, label in enumerate(examples[LABEL_COL]):
    word_ids = tokenized_inputs.word_ids(batch_index=i)
    previous_word_idx = None
    label_ids = []
    for word_idx in word_ids:
      # Special tokens have a word id that is None. We set the label to SPECIAL_TOKEN_ID
      # so they are automatically ignored in the loss function.
      if word_idx is None:
        label_ids.append(SPECIAL_TOKEN_ID)

      # We set the label for the first token of each word.
      elif word_idx != previous_word_idx:
        label_ids.append(label[word_idx])

      # We will not label the other tokens of a word, so set to SPECIAL_TOKEN_ID
      else:
        label_ids.append(SPECIAL_TOKEN_ID)
      previous_word_idx = word_idx

    labels.append(label_ids)

  tokenized_inputs["labels"] = labels
  return tokenized_inputs

# tokenize the dataset
tokenized_dataset = dataset_from_json.map(
    preprocess_fn,
    batched=True,
    desc="Running tokenizer on dataset"
)

## **Step 6: Run Training**

SparseML has a custom `Trainer` class that inherits from the [Hugging Face `Trainer` Class](https://huggingface.co/docs/transformers/main_classes/trainer). As such, the SparseML `Trainer` has all of the existing functionality of the HF trainer. However, in addition, we can supply a `recipe` and (optionally) a `teacher`. 


As we saw above, the `recipe` encodes the sparsity related algorithms and hyperparameters of the training process in a YAML file. The SparseML `Trainer` parses the `recipe` and adjusts the training workflow to apply the algorithms in the recipe.

The `teacher` is an optional argument that instructs SparseML to apply model distillation to support the training process.

In [None]:
# run with subset of dataset so we can complete in 10 minutes
MAX_SAMPLES = 2000
if MAX_SAMPLES is not None:
  train_dataset = tokenized_dataset["train"].select(range(MAX_SAMPLES))
else:
  train_dataset = tokenized_dataset["train"]
eval_dataset = tokenized_dataset["validation"]

# setup trainer arguments
training_args = TrainingArguments(
    output_dir="./training_output",
    do_train=True,
    do_eval=True,
    resume_from_checkpoint=False,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="epoch",
    save_total_limit=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    fp16=True)

# initialize trainer
trainer = Trainer(
    model=model,
    model_state_path=model_path,
    recipe=recipe_path,
    teacher=teacher,
    metadata_args=["per_device_train_batch_size","per_device_eval_batch_size","fp16"],
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=DataCollatorForTokenClassification(tokenizer, pad_to_multiple_of=8 if training_args.fp16 else None),
    compute_metrics=compute_metrics)

In [None]:
%rm -rf training_output
train_result = trainer.train(resume_from_checkpoint=False)
trainer.save_model()
trainer.save_state()
trainer.save_optimizer_and_scheduler(training_args.output_dir)

In [None]:
!sparseml.transformers.export_onnx \
  --model_path training_output \
  --task token_classification

## **Optional: Deploy with DeepSparse**

In [None]:
%pip install deepsparse

In [None]:
from deepsparse import Pipeline

pipeline = Pipeline.create("token_classification", model_path="./deployment")

In [None]:
from pprint import pprint
prediction = pipeline("Japan, co-hosts of the World Cup in 2002 and ranked 20th in the world by FIFA, are favourites to regain their title here.")
pprint(prediction.predictions)

In [None]:
prediction = pipeline("China controlled most of the match and saw several chances missed until the 78th minute when Uzbek striker Igor Shkvyrin took advantage of a misdirected defensive header to lob the ball over the advancing Chinese keeper and into an empty net.")
pprint(prediction.predictions)