# Approach
- Explore techniques to reduce the size of a trained huggingface.co transformers model
    - serialization
      - ONNX
      - TorchScript
    - quantization
      - native PyTorch
      - ONNX Runtime
    - pruning
    - change model architecture (requires retraining)
- Measure and compare runtime and RAM usage for each technique
- Measure change in model performance metrics for a sentiment analysis model

In [1]:
import os

import torch
torch.set_num_threads(1)

import optimize_models
from model_utils import save_model_and_tokenizer

## Baseline - roberta-base HuggingFace model
### Run inference with memory profiling and create plots
Note: An extra 0.4s wait time is added to model loading to ensure that memory due to model load (rather than inference) is measured properly.

In [2]:
def profile_memory(model_path):
    # https://pypi.org/project/memory-profiler/
    model_name = model_path.replace("/", "_")
    model_name = model_name.replace(".", "")
    command = "mprof run --interval 0.1 inference_profiling.py " + model_path + "; mprof plot -o plots/" + model_name + ".png -w 0,30"
    os.system(command)

In [3]:
# save a roberta-base model for sequence classification - Note that this model will have
# random weights in the penultimate layer (since we haven't trained the base model
# for the classification task), making the actual predictions useless. The timing and memory
# consumption metrics, however, are still valid even if the predictions aren't.
save_model_and_tokenizer("roberta-base", "models/roberta-base")

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.bias', 'roberta.pooler.dense.weight', 'lm_head.layer_norm.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classi

In [4]:
profile_memory("models/roberta-base")

mprof: Sampling memory every 0.1s
running new process
running as a Python program...


***** Running Prediction *****
  Num examples = 16
  Batch size = 8


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 2/2 [00:09<00:00,  4.92s/it]


Using last profile data.


# Serialization

## TorchScript Serialization
PyTorch models can be compiled into TorchScript using either tracing or scripting. During tracing, sample input is fed into the trained model and followed (traced) through the model computation graph, which is then frozen. HuggingFace seems to currently only support tracing.

In [5]:
## PyTorch --> TorchScript
model_name = "models/roberta-base"
tokenizer_name = "models/roberta-base"
optimize_models.to_torchscript(model_name, tokenizer_name, "models/torchscript", "tracing")

In [6]:
profile_memory("models/torchscript/tracing.pt")

mprof: Sampling memory every 0.1s
running new process
running as a Python program...
Using last profile data.


## ONNX Serialization
ONNX (Open Neural Network Exchange) is an open standard format for representing ML models. Deep learning models like transformers are converted to a computation graph.

ONNX Runtime is maintained by Microsoft and provides tools for running inference and training with ONNX models. It also provides tools for optimizing models, e.g., quantization.

***Note: onnxruntime-tools contains tools for optimizing transformers-based models (https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers; https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/quantization/notebooks/bert/Bert-GLUE_OnnxRuntime_quantization.ipynb). I have not experimented with these tools yet

In [7]:
# https://github.com/huggingface/transformers/blob/af8afdc88dcb07261acf70aee75f2ad00a4208a4/src/transformers/convert_graph_to_onnx.py
# roberta-base
optimize_models.to_onnx("models/roberta-base", "models/onnx")

Using framework PyTorch: 1.10.2
Overriding 1 configuration item(s)
	- use_cache -> False
Validating ONNX model...
	-[✓] ONNX model output names match reference model ({'logits'})
	- Validating ONNX Model output "logits":
		-[✓] (2, 2) matches (2, 2)
		-[✓] all values close (atol: 1e-05)
All good, model saved at: models/onnx/model.onnx


In [8]:
profile_memory("models/onnx/model.onnx")

mprof: Sampling memory every 0.1s
running new process
running as a Python program...
Using last profile data.


# Quantization

Convert floating point (32-bit precision) to int8. There are 3 different types of quantization:

https://pytorch.org/blog/introduction-to-quantization-on-pytorch/

https://onnxruntime.ai/docs/how-to/quantization.html

1. Dynamic quantization: Convert weights to int8, convert activations to int8 prior to compute (but store as floating point).

2. Static quantization: int8 arithmetic (like dynamic quantization), but also int8 memory access

3. Quantization-aware training (QAT): During training, weights and activations are "fake quantized" during forward an backwards passes, meaning that computations are performed with floating point numbers, but float values are rounded to mimic int8 values for forward/backward passes.

QAT is the most accurate form of quantization, followed by dynamic, then static. 

Static quantization will run slightly faster than dynamic and use less compute.

From the ONNX documentation: "In general, it is recommended to use dynamic quantization for RNN and transformer-based models, and static quantization for CNN models."

## PyTorch dynamic quantization

Quantize the linear layers of a pytorch model, save as TorchScript model using torch.jit.trace

Note: No CUDA support currently, model inference should be performed on CPU

In [9]:
optimize_models.quantize_pytorch_model("models/roberta-base", "models/roberta-base", "models/quantized-int8")

In [10]:
profile_memory("models/quantized-int8/model.pt")

mprof: Sampling memory every 0.1s
running new process
running as a Python program...
Using last profile data.


## ONNX dynamic quantization

https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/quantization/notebooks/bert/Bert-GLUE_OnnxRuntime_quantization.ipynb

Quantizes both the linear and embedding layers.

In [11]:
optimize_models.quantize_onnx_model("models/onnx/model.onnx", "models/onnx/quantized-model.onnx")

In [12]:
profile_memory("models/onnx/quantized-model.onnx")

mprof: Sampling memory every 0.1s
running new process
running as a Python program...
Using last profile data.


# Pruning

## Magnitude pruning: Discard weights with low absolute values.
This approach is most effective for models trained from scratch for a specific task since the values of the weights dictate importance for the task the model was trained on.
In transfer learning however, this method is not as effective, since the values of the weights are more related to the task used to pre-train the network rather than the fine-tuning task.

## Movement pruning: Discard weights that decrease in absolute value during training.
This is more appropriate for the transfer learning/fine-tuning task - weights that shrink during training are not important for the fine-tuning task (their large values were actually counterproductive). These weights may be removed irrespective of their absolute value.

Movement Pruning: Adaptive Sparsity by Fine-Tuning
Victor Sanh et al., Hugging Face, Cornell
October 2020
Movement pruning demonstrates better ability to adapt to the end-task.
95% of the original BERT performance with only 5% of the encoder's weight on NLI and question-answering.

## Pruning heads
Are 16 Heads Really Better than One?
Michel et al., 2019
https://arxiv.org/abs/1905.10650
Models that are trained with many heads can be pruned at inference time without significantly affecting performance.
Compute relative importance of attention heads and prune the least important heads.

In [13]:
# pruning heads - randomly prune some heads to get a sense of the impact on inference time and RAM usage:
base_model = "models/roberta-base"
optimize_models.prune_random_heads(base_model, 0.25, "models/pruned/25percent")
optimize_models.prune_random_heads(base_model, 0.5, "models/pruned/50percent")
optimize_models.prune_random_heads(base_model, 0.9, "models/pruned/90percent")

In [14]:
profile_memory("models/pruned/25percent")
profile_memory("models/pruned/50percent")
profile_memory("models/pruned/90percent")

mprof: Sampling memory every 0.1s
running new process
running as a Python program...


***** Running Prediction *****
  Num examples = 16
  Batch size = 8


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 2/2 [00:08<00:00,  4.33s/it]


Using last profile data.
mprof: Sampling memory every 0.1s
running new process
running as a Python program...


***** Running Prediction *****
  Num examples = 16
  Batch size = 8


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 2/2 [00:07<00:00,  3.70s/it]


Using last profile data.
mprof: Sampling memory every 0.1s
running new process
running as a Python program...


***** Running Prediction *****
  Num examples = 16
  Batch size = 8


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 2/2 [00:06<00:00,  3.25s/it]


Using last profile data.


# Change Model Architecture (requires re-training)

DistilBERT: https://arxiv.org/abs/1910.01108
October 2019 - Demonstrated similar performance as BERT on GLUE benchmark dataset at 40% of the size.

DistilRoBERTa: 95% of RoBERTa-base's performance on GLUE, twice as fast as RoBERTa while being 35% smaller.

How does it work?

Knowledge Distillation / student-teacher: "student" - a smaller model - is trained to reproduce the behavior of a "teacher" - a larger model or ensemble of modles.

Triple loss function:

1. masked language modeling (MLM) objective
2. distillation loss - similarity between output probability distribution of student and teacher models
3. cosine distance similarity between student and teacher hidden states

### Question: Can we do our own student-teacher set-up?

Answer: Yes, it should be possible, though would require some custom code. It's probably better to start by trying to fine-tune a classification model from distilroberta-base rather than roberta-base.

In [15]:
save_model_and_tokenizer("distilroberta-base", "models/distilroberta-base")

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.bias', 'roberta.pooler.dense.weight', 'lm_head.layer_norm.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.out_proj.weig

In [16]:
profile_memory("models/distilroberta-base")

mprof: Sampling memory every 0.1s
running new process
running as a Python program...


***** Running Prediction *****
  Num examples = 16
  Batch size = 8


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 2/2 [00:05<00:00,  2.54s/it]


Using last profile data.


In [None]:
# distilroberta with serialization and dynamic quantization
optimize_models.to_onnx("models/distilroberta-base", "models/distilroberta-onnx")
optimize_models.quantize_onnx_model("models/distilroberta-onnx/model.onnx", "models/distilroberta-onnx/quantized-model.onnx")

In [None]:
# distilroberta with pruning (90%), serialization, and dynamic quantization
optimize_models.prune_random_heads("models/distilroberta-base", 0.9, "models/distilroberta-pruned/90percent")
optimize_models.to_onnx("models/distilroberta-pruned/90percent", "models/distilroberta-onnx-pruned")
optimize_models.quantize_onnx_model("models/distilroberta-onnx-pruned/model.onnx", "models/distilroberta-onnx-pruned/quantized-model.onnx")

In [18]:
profile_memory("models/distilroberta-onnx/quantized-model.onnx")
profile_memory("models/distilroberta-onnx-pruned/quantized-model.onnx")

mprof: Sampling memory every 0.1s
running new process
running as a Python program...
Using last profile data.


# Examine Model Performance Metrics
- Accuracy
- Confusion Matrix

## Task: sentiment analysis using a model available on huggingface.co, based on roberta-base, trained on tweets

In [19]:
def profile_memory_measure_performance(model_path):
    # https://pypi.org/project/memory-profiler/
    model_name = model_path.replace("/", "_")
    model_name = model_name.replace(".", "")
    command = "mprof run --interval 0.1 sentiment_inference.py " + model_path + "; mprof plot -o plots/" + model_name + ".png -w 0,16"
    os.system(command)

In [20]:
save_model_and_tokenizer("cardiffnlp/twitter-roberta-base-sentiment", "models/baseline-sentiment")

In [21]:
# Measure baseline model performance metrics
profile_memory_measure_performance("models/baseline-sentiment")

mprof: Sampling memory every 0.1s
running new process
running as a Python program...


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
***** Running Prediction *****
  Num examples = 100
  Batch size = 8


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 13/13 [00:13<00:00,  1.01s/it]



-------------------------------------
----- MODEL PERFORMANCE METRICS -----

accuracy 0.7

confusion matrix
[[19 10  0]
 [13 37  5]
 [ 1  1 14]]

-------------------------------------
-------------------------------------

Using last profile data.


## Model Serialization (shouldn't change model performance metrics)

### ONNX

In [22]:
optimize_models.to_onnx("models/baseline-sentiment", "models/onnx-sentiment")

Using framework PyTorch: 1.10.2
Overriding 1 configuration item(s)
	- use_cache -> False
Validating ONNX model...
	-[✓] ONNX model output names match reference model ({'logits'})
	- Validating ONNX Model output "logits":
		-[✓] (2, 3) matches (2, 3)
		-[✓] all values close (atol: 1e-05)
All good, model saved at: models/onnx-sentiment/model.onnx


In [23]:
profile_memory_measure_performance("models/onnx-sentiment/model.onnx")

mprof: Sampling memory every 0.1s
running new process
running as a Python program...


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.



-------------------------------------
----- MODEL PERFORMANCE METRICS -----

accuracy 0.7

confusion matrix
[[19 10  0]
 [13 37  5]
 [ 1  1 14]]

-------------------------------------
-------------------------------------

Using last profile data.


### TorchScript

In [24]:
model_name = "models/baseline-sentiment"
optimize_models.to_torchscript(model_name, model_name, "models/torchscript-sentiment", "tracing")

In [26]:
profile_memory_measure_performance("models/torchscript-sentiment/tracing.pt")

mprof: Sampling memory every 0.1s
running new process
running as a Python program...


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.



-------------------------------------
----- MODEL PERFORMANCE METRICS -----

accuracy 0.7

confusion matrix
[[19 10  0]
 [13 37  5]
 [ 1  1 14]]

-------------------------------------
-------------------------------------

Using last profile data.


## ONNX quantization (should change model performance metrics somewhat)

In [27]:
optimize_models.quantize_onnx_model("models/onnx-sentiment/model.onnx", "models/onnx-sentiment/quantized-model.onnx")

In [28]:
profile_memory_measure_performance("models/onnx-sentiment/quantized-model.onnx")

mprof: Sampling memory every 0.1s
running new process
running as a Python program...


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.



-------------------------------------
----- MODEL PERFORMANCE METRICS -----

accuracy 0.69

confusion matrix
[[19  9  1]
 [13 37  5]
 [ 1  2 13]]

-------------------------------------
-------------------------------------

Using last profile data.


## pytorch native quantization (should change model performance metrics somewhat)


In [29]:
model_name = "models/baseline-sentiment"
optimize_models.quantize_pytorch_model(model_name, model_name, "models/quantized-int8-sentiment")

In [31]:
profile_memory_measure_performance("models/quantized-int8-sentiment/model.pt")

mprof: Sampling memory every 0.1s
running new process
running as a Python program...


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.



-------------------------------------
----- MODEL PERFORMANCE METRICS -----

accuracy 0.71

confusion matrix
[[18 11  0]
 [10 40  5]
 [ 0  3 13]]

-------------------------------------
-------------------------------------

Using last profile data.
