# **Text Classification: Sparse Transfer Learning with the Python API**

In this example, you will fine-tune a 90% pruned BERT model onto the TweetEval Emotions dataset using SparseML's Hugging Face Integration.

### **Sparse Transfer Learning Overview**

Sparse Transfer Learning is very similiar to typical fine-tuning you are used to when training models. However, with Sparse Transfer Learning, we start the training process from a pre-sparsified checkpoint and maintain the sparsity structure while the fine tuning occurs. At the end, you will have a sparse model trained on your dataset, ready to be deployed with DeepSparse for GPU-class performance on CPUs!

### **Pre-Sparsified BERT**
SparseZoo, Neural Magic's open source repository of pre-sparsified models, contains a 90% pruned version of BERT, which has been sparsified on the upstream Wikipedia and BookCorpus datasets with the
masked language modeling objective. [Check out the model card](https://sparsezoo.neuralmagic.com/models/nlp%2Fmasked_language_modeling%2Fobert-base%2Fpytorch%2Fhuggingface%2Fwikipedia_bookcorpus%2Fpruned90-none). We will use this model as the starting point for the transfer learning process.


***Let's dive in!***

## **Installation**

Install SparseML via `pip`.



In [None]:
%pip uninstall torch -y
%pip install sparseml[torch]

If you are running on Google Colab, restart the runtime after this step.

In [None]:
import sparseml
from sparsezoo import Model
from sparseml.transformers.utils import SparseAutoModel
from sparseml.transformers.sparsification import Trainer, TrainingArguments
import numpy as np
from transformers import (
    AutoModelForSequenceClassification,
    AutoConfig, 
    AutoTokenizer, 
    EvalPrediction, 
    default_data_collator
)
from datasets import load_dataset, load_metric

## **Step 1: Load a Dataset**

SparseML is integrated with Hugging Face, so we can use the `datasets` class to load datasets from the Hugging Face hub or from local files. 

[TweetEval Emotions](https://huggingface.co/datasets/tweet_eval/viewer/emotion/train)

In [None]:
# load_dataset from HF hub
dataset = load_dataset("tweet_eval", "emotion")

# alternatively, load from a csv
dataset["train"].to_csv("tweet_eval_emotion-train.csv")
dataset["validation"].to_csv("tweet_eval_emotion-validation.csv")

data_files = {
  "train": "tweet_eval_emotion-train.csv",
  "validation": "tweet_eval_emotion-validation.csv"
}
dataset_from_json = load_dataset("csv", data_files=data_files)

In [None]:
!head tweet_eval_emotion-train.csv --lines=5

In [None]:
print(dataset_from_json)

In [None]:
# configs for below
INPUT_COL_1 = "text"
INPUT_COL_2 = None
LABEL_COL = "label"
NUM_LABELS = len(dataset_from_json["train"].unique(LABEL_COL))

## **Step 2: Setup Evaluation Metric**

Tweet Eval emotion is a simple multi-class classification problem (we are predicting one of 4 emotions). We will use the `accuracy` function as the evaluation metric. We can use the native Hugging Face `compute_metrics` function (which will be passed to the `Trainer` class below).

In [None]:
metric = load_metric("accuracy")

# setup metrics function
def compute_metrics(p: EvalPrediction):
  preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
  preds = np.argmax(preds, axis=1)
  result = metric.compute(predictions=preds, references=p.label_ids)
  if len(result) > 1:
      result["combined_score"] = np.mean(list(result.values())).item()
  return result

print(dataset["train"])

## **Step 3: Train the Teacher**

To support the sparse transfer learning process, we will first train a dense teacher model from scratch - which we can then distill during the sparse transfer learning process.

Although we will be fine-tuning BERT, we can use a different model as the teacher. In this case, we will use `roberta-base` as the teacher, training with native Hugging Face objects.

In [None]:
TEACHER = "roberta-base"
teacher_config = AutoConfig.from_pretrained(TEACHER, num_labels=NUM_LABELS)
teacher_tokenizer = AutoTokenizer.from_pretrained(TEACHER)
teacher = AutoModelForSequenceClassification.from_pretrained(TEACHER, config=teacher_config)

### **Teacher Training: Tokenize the Dataset**

Run the RoBERTa tokenizer on the dataset.

In [None]:
MAX_LEN = 128
def teacher_preprocess_fn(examples):
  args = None
  if INPUT_COL_2 is None:
    args = (examples[INPUT_COL_1], )
  else:
    args = (examples[INPUT_COL_1], examples[INPUT_COL_2])
  result = teacher_tokenizer(*args, 
                   padding="max_length", 
                   max_length=min(teacher_tokenizer.model_max_length, MAX_LEN), 
                   truncation=True)
  return result

# tokenize the dataset
teacher_tokenized_dataset = dataset_from_json.map(
    teacher_preprocess_fn,
    batched=True,
    desc="Running teacher tokenizer on dataset"
)

### **Teacher Training: Fine-Tune the Teacher**

We use the native Hugging Face `Trainer` (which we import as `HFTrainer`) to train the model. Check out the [Hugging Face documentation](https://huggingface.co/docs/transformers/main_classes/trainer) for more details on the `Trainer` as needed.


In [None]:
from transformers import Trainer as HFTrainer
from transformers import TrainingArguments as HFTrainingArguments

# setup trainer arguments
teacher_training_args = HFTrainingArguments(
    output_dir="./teacher_training",
    do_train=True,
    do_eval=True,
    num_train_epochs=6.0,
    learning_rate=5e-6,
    lr_scheduler_type="linear",
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32)

# initialize trainer
teacher_trainer = HFTrainer(
    model=teacher,
    args=teacher_training_args,
    train_dataset=teacher_tokenized_dataset["train"],
    eval_dataset=teacher_tokenized_dataset["validation"],
    tokenizer=teacher_tokenizer,
    data_collator=default_data_collator,
    compute_metrics=compute_metrics)

# run training
%rm -rf ./teacher_training
teacher_trainer.train(resume_from_checkpoint=False)

## **Step 4: Sparse Transfer Learning**

Now that we have the teacher trained, we can sparse transfer learn from the pre-sparsified version of BERT with distillation support. 

First, we need to select a sparse checkpoint to begin the training process. In this case, we are fine-tuning a 90% pruned version of BERT onto the TweetEval Emotion dataset. This model is available in SparseZoo, identified by the following stub:
```
zoo:nlp/masked_language_modeling/obert-base/pytorch/huggingface/wikipedia_bookcorpus/pruned90-none
```

We also need to create/select a sparsification recipe for usage in the training process. Recipes are YAML files that encode the sparsity related algorithms and parameters to be applied by SparseML. For Sparse Transfer Learning, we use a recipe that instructs SparseML to maintain sparsity during the training process and to apply quantization over the final few epochs. In SparseZoo, there is a transfer recipe which was used to fine-tune BERT onto the SST2 task (which is also a single sequence classification problem). Since the TweetEval Emotion task is a similiar problem to SST2, we will use the SST2 recipe, which is identified by the following stub:

```
zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none
```

Use the `sparsezoo` python client to download the models and recipe using their SparseZoo stubs.

In [None]:
# downloads 90% pruned upstream BERT trained on MLM objective
model_stub = "zoo:nlp/masked_language_modeling/obert-base/pytorch/huggingface/wikipedia_bookcorpus/pruned90-none" 
model_path = Model(model_stub, download_path="./model").training.path

# downloads transfer recipe for SST2 (pruned90_quant)
transfer_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"
recipe_path = Model(transfer_stub, download_path="./transfer_recipe").recipes.default.path

We can see that the upstream model (trained on Wikipedia BookCorpus) and  configuration files have been downloaded to the local directory.

In [None]:
%ls ./model/training

We can see that a transfer learning recipe has been downloaded. The `ConstantPruningModifier` instructs SparseML to maintain the sparsity structure of the network as the model trains and the `QuantizationModifier` instructs SparseML to run Quantization Aware Training at the end of training.

In [None]:
%cat ./transfer_recipe/recipe/recipe_original.md

Next, we will set up the Hugging Face `tokenizer, config, and model`. These are all native Hugging Face objects, so check out the Hugging Face docs for more details on `AutoModel`, `AutoConfig`, and `AutoTokenizer` as needed. We instantiate these classes by passing the local path to the directory containing the `pytorch_model.bin`, `tokenizer.json`, and `config.json` files from the SparseZoo download.

In [None]:
# tokenizer
config = AutoConfig.from_pretrained(model_path, num_labels=NUM_LABELS)

# initialize config
tokenizer = AutoTokenizer.from_pretrained(model_path)

# initialize model using familiar HF AutoModel
model_kwargs = {"config": config}
model_kwargs["state_dict"], s_delayed = SparseAutoModel._loadable_state_dict(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path, **model_kwargs,)

# prints metrics on sparsity profile
SparseAutoModel.log_model_load(model, model_path, "student", s_delayed)

### **Sparse Transfer Learning: Tokenize The Dataset**

Next, tokenize the dataset. 

Since we are using RoBERTa as the teacher model, which has a different tokenizer than BERT, we have to add results from teacher_tokenizer to results with `'distill_teacher': id`. SparseML parses the `result` and will send the tokens to the correct model during training.

Note that if the teacher and student share a tokenizer, we can skip adding the teacher tokens and SparseML will pass the single set of tokens to each model during Training.


In [None]:
MAX_LEN = 128
def preprocess_fn(examples):
  args = None
  if INPUT_COL_2 is None:
    args = (examples[INPUT_COL_1], )
  else:
    args = (examples[INPUT_COL_1], examples[INPUT_COL_2])
  result = tokenizer(*args, 
                   padding="max_length", 
                   max_length=min(tokenizer.model_max_length, MAX_LEN), 
                   truncation=True)
  
  teacher_result = teacher_tokenizer(*args, 
                   padding="max_length", 
                   max_length=min(teacher_tokenizer.model_max_length, MAX_LEN), 
                   truncation=True)
  
  teacher_result = {
      f"distill_teacher:{tokenizer_key}": value
      for tokenizer_key, value in teacher_result.items()
  }
  result.update(teacher_result)
  return result

# tokenize the dataset
tokenized_dataset = dataset_from_json.map(
    preprocess_fn,
    batched=True,
    desc="Running tokenizer on dataset"
)

### **Sparse Transfer Learning: Fine-Tune the Model**

SparseML has a custom `Trainer` class that inherits from the [Hugging Face `Trainer` Class](https://huggingface.co/docs/transformers/main_classes/trainer). As such, the SparseML `Trainer` has all of the existing functionality of the HF trainer. However, in addition, we can supply a `recipe` and (optionally) a `teacher`. 


As we saw above, the `recipe` encodes the sparsity related algorithms and hyperparameters of the training process in a YAML file. The SparseML `Trainer` parses the `recipe` and adjusts the training workflow to apply the algorithms in the recipe. We use the `recipe_args` function to modify the recipe slightly (training for fewer epochs than used for SST2).

The `teacher` is an optional argument that instructs SparseML to apply model distillation to support the training process. We are not using a teacher here, so setting to `disable` turns off distillation.

In [None]:
# setup trainer arguments
training_args = TrainingArguments(
    output_dir="./training_output",
    do_train=True,
    do_eval=True,
    resume_from_checkpoint=False,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="epoch",
    save_total_limit=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    fp16=False)

# initialize trainer
trainer = Trainer(
    model=model,
    model_state_path=model_path,
    recipe=recipe_path,
    recipe_args='{"num_epochs": 10.0, "qat_start_epoch": 7.0, "observer_epoch": 9.0}',
    teacher=teacher,
    metadata_args=["per_device_train_batch_size","per_device_eval_batch_size","fp16"],
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=default_data_collator,
    compute_metrics=compute_metrics)

# step 5: run training
%rm -rf training_output
train_result = trainer.train()
trainer.save_model()
trainer.save_state()
trainer.save_optimizer_and_scheduler(training_args.output_dir)

In [None]:
!sparseml.transformers.export_onnx \
  --model_path training_output \
  --task text_classification

## **Optional: Deploy with DeepSparse**

In [None]:
%pip install deepsparse

In [None]:
from deepsparse import Pipeline

pipeline = Pipeline.create("text_classification", model_path="./deployment")

In [None]:
prediction = pipeline("@user Get Donovan out of your soccer booth. He's awful. He's bitter. He makes me want to mute the tv. #horrid")
print(prediction) # label 0 is anger

In [None]:
prediction = pipeline("@user Welcome to #MPSVT! We are delighted to have you! #grateful #MPSVT #relationships")
print(prediction) # label 1 is joy

In [None]:
prediction = pipeline("“The #optimist proclaims that we live in the best of all possible worlds; and the #pessimist fears this is true.” ~ James Branch Cabell")
print(prediction) # label 2 is optimism

In [None]:
prediction = pipeline("In need of a change! #restless")
print(prediction) # label 3 is sadness