# Sparse Transfer Hugging Face NLP Models using SparseML!

This notebook uses Neural Magic's [SparseML](https://github.com/neuralmagic/sparseml) library to convert a dense Hugging Face model into a light and super fast sparsified model! This in turn has the potential to unlock 1000's of dense models previously fine-tuned and uploaded onto the Hugging Face Models Hub. 🚀🚀🚀

<br>

To learn more about sparse transfer learning, check out the docs [here](https://docs.neuralmagic.com/get-started/transfer-a-sparsified-model/nlp-text-classification).

<br>

This notebook allows devs to:
*   Install the SparseML library for sparse-transfer training.
*   Distill a dense model (teacher) onto a sparse pre-trained transformer (sparse student).
*   Export the sparse and dense models to ONNX format.
*   Benchmark the dense and sparse models using DeepSparse.

To know the eligibility for which Hugging Face models are able to be sparse-transferred, SparseML must:
*   Currently support the NLP task.
*   Currently support the model architecture.
*   Have the dataset of interest available for download.

<br>

---

<br>

In the example below, we'll sparse-transfer a dense BERT base uncased onto a pruned quantized oBERT from the Neural Magic [SparseZoo](https://sparsezoo.neuralmagic.com/?domain=nlp&sub_domain=masked_language_modeling&page=1). We'll use a [dense BERT](https://huggingface.co/nateraw/bert-base-uncased-emotion?text=I+like+you.+I+love+you) previously fine-tuned on the emotion dataset, which is a multi-class classification task, as our teacher model.

The [Emotion dataset](https://huggingface.co/datasets/emotion) consists of English Twitter messages with six basic emotions: `anger`, `fear`, `joy`, `love`, `sadness`, and `surprise`.

In [1]:
!nvidia-smi # double check you're in a GPU runtime

Thu Nov 17 15:35:19 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Download the SparseML library to get access to the Transformers library fork and the required version of PyTorch to do our training.

In [2]:
!pip install sparseml[torch]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sparseml[torch]
  Downloading sparseml-1.2.0-py3-none-any.whl (827 kB)
[K     |████████████████████████████████| 827 kB 5.0 MB/s 
[?25hCollecting onnx<=1.10.1,>=1.5.0
  Downloading onnx-1.10.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (12.3 MB)
[K     |████████████████████████████████| 12.3 MB 47.6 MB/s 
Collecting GPUtil>=1.4.0
  Downloading GPUtil-1.4.0.tar.gz (5.5 kB)
Collecting click~=8.0.0
  Downloading click-8.0.4-py3-none-any.whl (97 kB)
[K     |████████████████████████████████| 97 kB 6.9 MB/s 
Collecting merge-args>=0.1.0
  Downloading merge_args-0.1.4-py2.py3-none-any.whl (5.8 kB)
Collecting jupyter>=1.0.0
  Downloading jupyter-1.0.0-py2.py3-none-any.whl (2.7 kB)
Collecting sparsezoo~=1.2.0
  Downloading sparsezoo-1.2.0-py3-none-any.whl (90 kB)
[K     |████████████████████████████████| 90 kB 9.0 MB/s 
[?25hCollecting toposort>=1.0
  Downloading to

Run the following the CLI command to initiate the sparse transfer learning. The dense model, as seen in the `distill_teacher` argument, will transfer its knowledge onto a 6 layer pruned quantized bert base student model, as seen in the `model_name_or_path` argument.

The modifiers required to do this transfer can be found in the `recipe`. To learn mmore about recipes and its modifiers, you can read more in the [docs](https://docs.neuralmagic.com/user-guide/recipes).

Unlike the training parameters of traditional fine-tuning, the parameters when conducting sparse-transfer learning are a lot more sensitive. And a good heuristic to start with is learning to calibrate the most critical parameters during the training: the `initial learning rate` and `number of epochs`. These parameters are hard coded in the recipe, however, to speed things up for this example, we've already tinkered with various values for these two parameters and have overridden the recipe with custom values found in the `recipe_args` argument. Most likely, for any model you do sparse transfer learning, these values will be overriden during your tinkering process.

The following command will give you a sparse model with an accuracy close to ~92.5% on the validation dataset with a total training/evaluation time of ~55 mins with a T4 GPU.

In [3]:
!sparseml.transformers.train.text_classification \
  --output_dir sparse_model \
	--model_name_or_path zoo:nlp/masked_language_modeling/obert-medium/pytorch/huggingface/wikipedia_bookcorpus/pruned80_quant-none-vnni \
	--distill_teacher nateraw/bert-base-uncased-emotion \
	--recipe zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/wikipedia_bookcorpus/pruned80_quant-none-vnni?recipe_type=transfer-text_classification \
  --dataset_name "emotion" \
  --recipe_args '{"num_epochs":9, "init_lr":0.000057}' \
	--do_train \
	--do_eval \
  --eval_steps 200 \
	--max_seq_length 128 \
	--evaluation_strategy steps \
	--per_device_train_batch_size 32 \
	--per_device_eval_batch_size 32 \
	--preprocessing_num_workers 8 \
	--fp16 \
	--seed 42 \
	--save_strategy steps \
	--save_steps 200 \
	--save_total_limit 3 \
	--overwrite_output_dir \
	--load_best_model_at_end

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
        )
        (weight_fake_quant): FakeQuantize(
          fake_quant_enabled=tensor([1], device='cuda:0', dtype=torch.uint8), observer_enabled=tensor([1], device='cuda:0', dtype=torch.uint8), quant_min=-128, quant_max=127, dtype=torch.qint8, qscheme=torch.per_tensor_affine, ch_axis=-1, scale=tensor([0.0125], device='cuda:0'), zero_point=tensor([7], device='cuda:0')
          (activation_post_process): MovingAverageMinMaxObserver(min_val=-1.6807489395141602, max_val=1.498458743095398)
        )
      )
      (token_type_embeddings): Embedding(
        2, 768
        (activation_post_process): FakeQuantize(
          fake_quant_enabled=tensor([1], device='cuda:0', dtype=torch.uint8), observer_enabled=tensor([1], device='cuda:0', dtype=torch.uint8), quant_min=0, quant_max=255, dtype=torch.quint8, qscheme=torch.per_tensor_affine, ch_axis=-1, scale=tensor([0.0165], device='cuda:0'), zero_point=tensor([134], device='cuda:0

Now that we have a trained sparse model, we can export its PyTorch weights into ONNX format with the following command:

In [4]:
!sparseml.transformers.export_onnx --model_path sparse_model --task 'text_classification' --sequence_length 128

  warn(f"Failed to load image Python extension: {e}")
2022-11-17 16:25:55 sparseml.transformers.export INFO     Attempting onnx export for model at sparse_model for task text-classification
INFO:sparseml.transformers.export:Attempting onnx export for model at sparse_model for task text-classification
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sparse_model and are newly initialized: ['encoder.layer.7.attention.output.dense.weight', 'encoder.layer.0.attention.self.key.weight', 'encoder.layer.10.attention.self.value.weight', 'encoder.layer.5.attention.self.key.bias', 'encoder.layer.6.attention.self.key.weight', 'encoder.layer.7.attention.self.key.bias', 'encoder.layer.1.output.LayerNorm.weight', 'encoder.layer.6.attention.self.key.bias', 'encoder.layer.8.attention.self.key.weight', 'encoder.layer.6.attention.output.dense.bias', 'encoder.layer.9.intermediate.dense.weight', 'encoder.layer.5.output.dense.bias', 'encoder.layer.7.output.Laye

Let's do the same to the dense model. First we'll download it to a directory named `dense_model`, and then we'll export it to ONNX:

In [5]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("nateraw/bert-base-uncased-emotion")
model = AutoModelForSequenceClassification.from_pretrained("nateraw/bert-base-uncased-emotion")
tokenizer.save_pretrained("/content/dense_model")
model.save_pretrained("/content/dense_model")

In [6]:
!sparseml.transformers.export_onnx --model_path dense_model --task 'text_classification' --sequence_length 128

  warn(f"Failed to load image Python extension: {e}")
2022-11-17 16:26:29 sparseml.transformers.export INFO     Attempting onnx export for model at dense_model for task text-classification
INFO:sparseml.transformers.export:Attempting onnx export for model at dense_model for task text-classification
2022-11-17 16:26:31 sparseml.transformers.utils.model INFO     Loaded model from dense_model with 109486854 total params. Of those there are 85529088 prunable params which have 0.0 avg sparsity.
INFO:sparseml.transformers.utils.model:Loaded model from dense_model with 109486854 total params. Of those there are 85529088 prunable params which have 0.0 avg sparsity.
2022-11-17 16:26:33 sparseml.transformers.utils.model INFO     dense model detected, all sparsification info: {"params_summary": {"total": 109486854, "sparse": 0, "sparsity_percent": 0.0, "prunable": 85529088, "prunable_sparse": 0, "prunable_sparsity_percent": 0.0, "quantizable": 85612806, "quantized": 0, "quantized_percent": 0.0}, 

Let's now install [DeepSparse](https://github.com/neuralmagic/deepsparse) and benchmark these two models on the colab's single CPU and compare their speeds!

In [7]:
!pip install deepsparse

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting deepsparse
  Downloading deepsparse-1.2.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (39.0 MB)
[K     |████████████████████████████████| 39.0 MB 18.7 MB/s 
Installing collected packages: deepsparse
Successfully installed deepsparse-1.2.0


In [8]:
!deepsparse.benchmark dense_model/model.onnx --batch_size 1

2022-11-17 16:26:58 deepsparse.benchmark.benchmark_model INFO     Thread pinning to cores enabled
2022-11-17 16:26:58 deepsparse.benchmark.benchmark_model INFO     num_streams default value chosen of 1. This requires tuning and may be sub-optimal
DeepSparse Engine, Copyright 2021-present / Neuralmagic, Inc. version: 1.2.0 COMMUNITY EDITION | (45d54d49) (release) (optimized) (system=avx2, binary=avx2)
2022-11-17 16:27:30 deepsparse.benchmark.benchmark_model INFO     deepsparse.engine.Engine:
	onnx_file_path: dense_model/model.onnx
	batch_size: 1
	num_cores: 1
	num_streams: 1
	scheduler: Scheduler.multi_stream
	cpu_avx_type: avx2
	cpu_vnni: False
2022-11-17 16:27:30 deepsparse.utils.onnx INFO     Generating input 'input_ids', type = int64, shape = [1, 128]
2022-11-17 16:27:30 deepsparse.utils.onnx INFO     Generating input 'attention_mask', type = int64, shape = [1, 128]
2022-11-17 16:27:30 deepsparse.utils.onnx INFO     Generating input 'token_type_ids', type = int64, shape = [1, 128]
2

In [9]:
!deepsparse.benchmark sparse_model/model.onnx --batch_size 1

2022-11-17 16:27:44 deepsparse.benchmark.benchmark_model INFO     Thread pinning to cores enabled
2022-11-17 16:27:44 deepsparse.benchmark.benchmark_model INFO     num_streams default value chosen of 1. This requires tuning and may be sub-optimal
DeepSparse Engine, Copyright 2021-present / Neuralmagic, Inc. version: 1.2.0 COMMUNITY EDITION | (45d54d49) (release) (optimized) (system=avx2, binary=avx2)
2022-11-17 16:28:10 deepsparse.benchmark.benchmark_model INFO     deepsparse.engine.Engine:
	onnx_file_path: sparse_model/model.onnx
	batch_size: 1
	num_cores: 1
	num_streams: 1
	scheduler: Scheduler.multi_stream
	cpu_avx_type: avx2
	cpu_vnni: False
2022-11-17 16:28:10 deepsparse.utils.onnx INFO     Generating input 'input_ids', type = int64, shape = [1, 128]
2022-11-17 16:28:10 deepsparse.utils.onnx INFO     Generating input 'attention_mask', type = int64, shape = [1, 128]
2022-11-17 16:28:10 deepsparse.utils.onnx INFO     Generating input 'token_type_ids', type = int64, shape = [1, 128]


Pretty incredible, the dense model gives us a latency of `410 ms` while the new sparse model gives us a latency of only `54 ms`, nearly an 8X speedup!! 🤯🤯🤯

<br>

For more resources, you can always give [SparseML](https://github.com/neuralmagic/sparseml) and [DeepSparse](https://github.com/neuralmagic/deepsparse) a ⭐, and let us know what you think on our [slack community channel](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ)!