diff --git a/integrations/huggingface-transformers/recipes/bert-base-12layers_prune80.md b/integrations/huggingface-transformers/recipes/bert-base-12layers_prune80.md index caf834a9d24..b6802a2d97c 100644 --- a/integrations/huggingface-transformers/recipes/bert-base-12layers_prune80.md +++ b/integrations/huggingface-transformers/recipes/bert-base-12layers_prune80.md @@ -15,10 +15,10 @@ limitations under the License. --> --- -# General variables +# General Variables num_epochs: &num_epochs 30 -# pruning hyperparameters +# Pruning Hyperparameters init_sparsity: &init_sparsity 0.00 final_sparsity: &final_sparsity 0.80 pruning_start_epoch: &pruning_start_epoch 2 @@ -26,7 +26,7 @@ pruning_end_epoch: &pruning_end_epoch 20 update_frequency: &pruning_update_frequency 0.01 -# modifiers: +# Modifiers training_modifiers: - !EpochRangeModifier end_epoch: 30 @@ -35,12 +35,12 @@ training_modifiers: pruning_modifiers: - !GMPruningModifier params: - - re:bert.encoder.layer.([0,2,4,6,8]|11).attention.self.query.weight - - re:bert.encoder.layer.([0,2,4,6,8]|11).attention.self.key.weight - - re:bert.encoder.layer.([0,2,4,6,8]|11).attention.self.value.weight - - re:bert.encoder.layer.([0,2,4,6,8]|11).attention.output.dense.weight - - re:bert.encoder.layer.([0,2,4,6,8]|11).intermediate.dense.weight - - re:bert.encoder.layer.([0,2,4,6,8]|11).output.dense.weight + - re:bert.encoder.layer.*.attention.self.query.weight + - re:bert.encoder.layer.*.attention.self.key.weight + - re:bert.encoder.layer.*.attention.self.value.weight + - re:bert.encoder.layer.*.attention.output.dense.weight + - re:bert.encoder.layer.*.intermediate.dense.weight + - re:bert.encoder.layer.*.output.dense.weight start_epoch: *pruning_start_epoch end_epoch: *pruning_end_epoch init_sparsity: *init_sparsity @@ -52,21 +52,21 @@ pruning_modifiers: log_types: __ALL__ --- -# Bert model with pruned encoder layers +# BERT Model with Pruned Encoder Layers -This recipe defines a pruning strategy to sparsify all encoder layers of a Bert model at 80% sparsity. It was used together with knowledge distillation to create sparse model that achives 100% recovery from its baseline accuracy on the Squad dataset. -Training was done using 1 GPU at half precision using a training batch size of 16 with the +This recipe defines a pruning strategy to sparsify all encoder layers of a BERT model at 80% sparsity. It was used together with knowledge distillation to create sparse model that completely recovers the F1 metric (88.596) of the baseline model by on the SQuAD dataset. (We use the checkpoint at the end of the first 2 epochs as the baseline model for comparison, right before the pruning takes effect.) +Training was done using one V100 GPU at half precision using a training batch size of 16 with the [SparseML integration with huggingface/transformers](https://github.com/neuralmagic/sparseml/tree/main/integrations/huggingface-transformers). ## Weights and Biases -- [Sparse Bert on Squad](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/18qdx7b3?workspace=user-neuralmagic) +- [Sparse BERT on SQuAD](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/18qdx7b3?workspace=user-neuralmagic) ## Training To set up the training environment, follow the instructions on the [integration README](https://github.com/neuralmagic/sparseml/blob/main/integrations/huggingface-transformers/README.md). Using the `run_qa.py` script from the question-answering examples, the following command can be used to launch this recipe with distillation. -Adjust the training command below with your setup for GPU device, checkpoint saving frequency and logging options. +Adjust the training command below with your setup for GPU device, checkpoint saving frequency, and logging options. *training command* ``` @@ -91,7 +91,7 @@ python transformers/examples/pytorch/question-answering/run_qa.py \ --distill_temperature 2.0 \ --save_steps 1000 \ --save_total_limit 2 \ - --recipe ../recipes/uni_80sparse_freq0.01_18prune10fine.md \ + --recipe ../recipes/bert-base-12layers_prune80.md \ --onnx_export_path MODELS_DIR/sparse80/onnx \ --report_to wandb ``` diff --git a/integrations/huggingface-transformers/recipes/bert-base-12layers_prune90.md b/integrations/huggingface-transformers/recipes/bert-base-12layers_prune90.md index 0518ee0dcec..778ec08020a 100644 --- a/integrations/huggingface-transformers/recipes/bert-base-12layers_prune90.md +++ b/integrations/huggingface-transformers/recipes/bert-base-12layers_prune90.md @@ -15,17 +15,17 @@ limitations under the License. --> --- -# General variables +# General Variables num_epochs: &num_epochs 30 -# pruning hyperparameters +# Pruning Hyperparameters init_sparsity: &init_sparsity 0.00 final_sparsity: &final_sparsity 0.90 pruning_start_epoch: &pruning_start_epoch 2 pruning_end_epoch: &pruning_end_epoch 20 update_frequency: &pruning_update_frequency 0.01 -# modifiers: +# Modifiers training_modifiers: - !EpochRangeModifier end_epoch: 30 @@ -34,12 +34,12 @@ training_modifiers: pruning_modifiers: - !GMPruningModifier params: - - re:bert.encoder.layer.([0,2,4,6,8]|11).attention.self.query.weight - - re:bert.encoder.layer.([0,2,4,6,8]|11).attention.self.key.weight - - re:bert.encoder.layer.([0,2,4,6,8]|11).attention.self.value.weight - - re:bert.encoder.layer.([0,2,4,6,8]|11).attention.output.dense.weight - - re:bert.encoder.layer.([0,2,4,6,8]|11).intermediate.dense.weight - - re:bert.encoder.layer.([0,2,4,6,8]|11).output.dense.weight + - re:bert.encoder.layer.*.attention.self.query.weight + - re:bert.encoder.layer.*.attention.self.key.weight + - re:bert.encoder.layer.*.attention.self.value.weight + - re:bert.encoder.layer.*.attention.output.dense.weight + - re:bert.encoder.layer.*.intermediate.dense.weight + - re:bert.encoder.layer.*.output.dense.weight start_epoch: *pruning_start_epoch end_epoch: *pruning_end_epoch init_sparsity: *init_sparsity @@ -50,21 +50,21 @@ pruning_modifiers: mask_type: unstructured log_types: __ALL__ --- -# Bert model with pruned encoder layers +# BERT Model with Pruned Encoder Layers -This recipe defines a pruning strategy to sparsify all encoder layers of a Bert model at 90% sparsity. It was used together with knowledge distillation to create sparse model that achives 98.4% recovery from its baseline accuracy on the Squad dataset. -Training was done using 1 GPU at half precision using a training batch size of 16 with the +This recipe defines a pruning strategy to sparsify all encoder layers of a BERT model at 90% sparsity. It was used together with knowledge distillation to create sparse model that achieves 98.4% recovery from the F1 metric (88.596) of the baseline model on the SQuAD dataset. (We use the checkpoint at the end of the first 2 epochs as the baseline model for comparison, right before the pruning takes effect.) +Training was done using one V100 GPU at half precision using a training batch size of 16 with the [SparseML integration with huggingface/transformers](https://github.com/neuralmagic/sparseml/tree/main/integrations/huggingface-transformers). ## Weights and Biases -- [Sparse Bert on Squad](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/2ht2eqsn?workspace=user-neuralmagic) +- [Sparse BERT on SQuAD](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/2ht2eqsn?workspace=user-neuralmagic) ## Training To set up the training environment, follow the instructions on the [integration README](https://github.com/neuralmagic/sparseml/blob/main/integrations/huggingface-transformers/README.md). Using the `run_qa.py` script from the question-answering examples, the following command can be used to launch this recipe with distillation. -Adjust the training command below with your setup for GPU device, checkpoint saving frequency and logging options. +Adjust the training command below with your setup for GPU device, checkpoint saving frequency, and logging options. *training command* ``` @@ -89,7 +89,7 @@ python transformers/examples/pytorch/question-answering/run_qa.py \ --distill_temperature 2.0 \ --save_steps 1000 \ --save_total_limit 2 \ - --recipe ../recipes/uni_90sparse_freq0.01_18prune10fine.md \ + --recipe ../recipes/bert-base-12layers_prune90.md \ --onnx_export_path MODELS_DIR/sparse90/onnx \ --report_to wandb ``` diff --git a/integrations/huggingface-transformers/recipes/bert-base-12layers_prune95.md b/integrations/huggingface-transformers/recipes/bert-base-12layers_prune95.md index 65fd9a7c121..b3b4e5e6516 100644 --- a/integrations/huggingface-transformers/recipes/bert-base-12layers_prune95.md +++ b/integrations/huggingface-transformers/recipes/bert-base-12layers_prune95.md @@ -15,17 +15,17 @@ limitations under the License. --> --- -# General variables +# General Variables num_epochs: &num_epochs 30 -# pruning hyperparameters +# Pruning Hyperparameters init_sparsity: &init_sparsity 0.00 final_sparsity: &final_sparsity 0.95 pruning_start_epoch: &pruning_start_epoch 2 pruning_end_epoch: &pruning_end_epoch 20 update_frequency: &pruning_update_frequency 0.01 -# modifiers: +# Modifiers training_modifiers: - !EpochRangeModifier end_epoch: 30 @@ -34,12 +34,12 @@ training_modifiers: pruning_modifiers: - !GMPruningModifier params: - - re:bert.encoder.layer.([0,2,4,6,8]|11).attention.self.query.weight - - re:bert.encoder.layer.([0,2,4,6,8]|11).attention.self.key.weight - - re:bert.encoder.layer.([0,2,4,6,8]|11).attention.self.value.weight - - re:bert.encoder.layer.([0,2,4,6,8]|11).attention.output.dense.weight - - re:bert.encoder.layer.([0,2,4,6,8]|11).intermediate.dense.weight - - re:bert.encoder.layer.([0,2,4,6,8]|11).output.dense.weight + - re:bert.encoder.layer.*.attention.self.query.weight + - re:bert.encoder.layer.*.attention.self.key.weight + - re:bert.encoder.layer.*.attention.self.value.weight + - re:bert.encoder.layer.*.attention.output.dense.weight + - re:bert.encoder.layer.*.intermediate.dense.weight + - re:bert.encoder.layer.*.output.dense.weight start_epoch: *pruning_start_epoch end_epoch: *pruning_end_epoch init_sparsity: *init_sparsity @@ -51,21 +51,21 @@ pruning_modifiers: log_types: __ALL__ --- -# Bert model with pruned encoder layers +# BERT Model with Pruned Encoder Layers -This recipe defines a pruning strategy to sparsify all encoder layers of a Bert model at 95% sparsity. It was used together with knowledge distillation to create sparse model that achives 94.7% recovery from its baseline accuracy on the Squad dataset. +This recipe defines a pruning strategy to sparsify all encoder layers of a BERT model at 95% sparsity. It was used together with knowledge distillation to create sparse model that achieves 94.7% recovery from the F1 metric of the baseline model on the SQuAD dataset. (We use the checkpoint at the end of the first 2 epochs as the baseline model for comparison, right before the pruning takes effect.) Training was done using 1 GPU at half precision using a training batch size of 16 with the [SparseML integration with huggingface/transformers](https://github.com/neuralmagic/sparseml/tree/main/integrations/huggingface-transformers). ## Weights and Biases -- [Sparse Bert on Squad](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/3gv0arxd?workspace=user-neuralmagic) +- [Sparse BERT on SQuAD](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/3gv0arxd?workspace=user-neuralmagic) ## Training To set up the training environment, follow the instructions on the [integration README](https://github.com/neuralmagic/sparseml/blob/main/integrations/huggingface-transformers/README.md). Using the `run_qa.py` script from the question-answering examples, the following command can be used to launch this recipe with distillation. -Adjust the training command below with your setup for GPU device, checkpoint saving frequency and logging options. +Adjust the training command below with your setup for GPU device, checkpoint saving frequency, and logging options. *training command* ``` @@ -90,7 +90,7 @@ python transformers/examples/pytorch/question-answering/run_qa.py \ --distill_temperature 2.0 \ --save_steps 1000 \ --save_total_limit 2 \ - --recipe ../recipes/uni_95sparse_freq0.01_18prune10fine.md \ + --recipe ../recipes/bert-base-12layers_prune95.md \ --onnx_export_path MODELS_DIR/sparse95/onnx \ --report_to wandb ``` diff --git a/integrations/huggingface-transformers/recipes/bert-base-6layers_prune80.md b/integrations/huggingface-transformers/recipes/bert-base-6layers_prune80.md index 7e81ed98059..235574b81c4 100644 --- a/integrations/huggingface-transformers/recipes/bert-base-6layers_prune80.md +++ b/integrations/huggingface-transformers/recipes/bert-base-6layers_prune80.md @@ -15,17 +15,17 @@ limitations under the License. --> --- -# General variables +# General Variables num_epochs: &num_epochs 30 -# pruning hyperparameters +# Pruning Hyperparameters init_sparsity: &init_sparsity 0.00 final_sparsity: &final_sparsity 0.80 pruning_start_epoch: &pruning_start_epoch 2 pruning_end_epoch: &pruning_end_epoch 20 update_frequency: &pruning_update_frequency 0.01 -# modifiers: +# Modifiers training_modifiers: - !EpochRangeModifier end_epoch: 30 @@ -60,21 +60,21 @@ pruning_modifiers: - bert.encoder.layer.10 --- -# Bert model with dropped and pruned encoder layers +# BERT Model with Dropped and Pruned Encoder Layers -This recipe defines a dropping and pruning strategy to sparsify 6 encoder layers of a Bert model at 80% sparsity. It was used together with knowledge distillation to create sparse model that achives 97% recovery from its (teacher) baseline accuracy on the Squad dataset. -Training was done using 1 GPU at half precision using a training batch size of 16 with the +This recipe defines a dropping and pruning strategy to sparsify six encoder layers of a BERT model at 80% sparsity. It was used together with knowledge distillation to create sparse model that exceeds the F1 metric (83.632) of the baseline model by 0.02% on the SQuAD dataset. (We use the checkpoint at the end of the first 2 epochs as the baseline model for comparison, right before the pruning takes effect.) +Training was done using one V100 GPU at half precision using a training batch size of 16 with the [SparseML integration with huggingface/transformers](https://github.com/neuralmagic/sparseml/tree/main/integrations/huggingface-transformers). ## Weights and Biases -- [Sparse Bert on Squad](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/ebab4np4?workspace=user-neuralmagic) +- [Sparse BERT on SQuAD](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/ebab4np4?workspace=user-neuralmagic) ## Training To set up the training environment, follow the instructions on the [integration README](https://github.com/neuralmagic/sparseml/blob/main/integrations/huggingface-transformers/README.md). Using the `run_qa.py` script from the question-answering examples, the following command can be used to launch this recipe with distillation. -Adjust the training command below with your setup for GPU device, checkpoint saving frequency and logging options. +Adjust the training command below with your setup for GPU device, checkpoint saving frequency, and logging options. *training command* ``` @@ -99,7 +99,7 @@ python transformers/examples/pytorch/question-answering/run_qa.py \ --distill_temperature 2.0 \ --save_steps 1000 \ --save_total_limit 2 \ - --recipe ../recipes/uni_80sparse_freq0.01_18prune10fine_6layers.md \ + --recipe ../recipes/bert-base-6layers_prune80.md \ --onnx_export_path MODELS_DIR/sparse80_6layers/onnx \ --report_to wandb ``` diff --git a/integrations/huggingface-transformers/recipes/bert-base-6layers_prune90.md b/integrations/huggingface-transformers/recipes/bert-base-6layers_prune90.md index 12ace7a668c..5a1bda3c294 100644 --- a/integrations/huggingface-transformers/recipes/bert-base-6layers_prune90.md +++ b/integrations/huggingface-transformers/recipes/bert-base-6layers_prune90.md @@ -15,17 +15,17 @@ limitations under the License. --> --- -# General Epoch/LR variables +# General Variables num_epochs: &num_epochs 30 -# pruning hyperparameters +# Pruning Hyperparameters init_sparsity: &init_sparsity 0.00 final_sparsity: &final_sparsity 0.90 pruning_start_epoch: &pruning_start_epoch 2 pruning_end_epoch: &pruning_end_epoch 20 update_frequency: &pruning_update_frequency 0.01 -# modifiers: +# Modifiers training_modifiers: - !EpochRangeModifier end_epoch: 30 @@ -60,21 +60,21 @@ pruning_modifiers: - bert.encoder.layer.10 --- -# Bert model with dropped and pruned encoder layers +# BERT Model with Dropped and Pruned Encoder Layers -This recipe defines a dropping and pruning strategy to sparsify 6 encoder layers of a Bert model at 90% sparsity. It was used together with knowledge distillation to create sparse model that achives 94.5% recovery from its (teacher) baseline accuracy on the Squad dataset. -Training was done using 1 GPU at half precision using a training batch size of 16 with the +This recipe defines a dropping and pruning strategy to sparsify six encoder layers of a BERT model at 90% sparsity. It was used together with knowledge distillation to create sparse model that achieves 99.9% recovery from the F1 metric (83.632) of the baseline model on the SQuAD dataset. (We use the checkpoint at the end of the first 2 epochs as the baseline model for comparison, right before the pruning takes effect.) +Training was done using one V100 GPU at half precision using a training batch size of 16 with the [SparseML integration with huggingface/transformers](https://github.com/neuralmagic/sparseml/tree/main/integrations/huggingface-transformers). ## Weights and Biases -- [Sparse Bert on Squad](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/3qvxoroz?workspace=user-neuralmagic) +- [Sparse BERT on SQuAD](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/3qvxoroz?workspace=user-neuralmagic) ## Training To set up the training environment, follow the instructions on the [integration README](https://github.com/neuralmagic/sparseml/blob/main/integrations/huggingface-transformers/README.md). Using the `run_qa.py` script from the question-answering examples, the following command can be used to launch this recipe with distillation. -Adjust the training command below with your setup for GPU device, checkpoint saving frequency and logging options. +Adjust the training command below with your setup for GPU device, checkpoint saving frequency, and logging options. *training command* ``` @@ -99,7 +99,7 @@ python transformers/examples/pytorch/question-answering/run_qa.py \ --distill_temperature 2.0 \ --save_steps 1000 \ --save_total_limit 2 \ - --recipe ../recipes/uni_90sparse_freq0.01_18prune10fine_6layers.md \ + --recipe ../recipes/bert-base-6layers_prune90.md \ --onnx_export_path MODELS_DIR/sparse90_6layers/onnx \ --report_to wandb ``` diff --git a/integrations/huggingface-transformers/recipes/bert-base-6layers_prune95.md b/integrations/huggingface-transformers/recipes/bert-base-6layers_prune95.md index a7291986147..c46d0d15001 100644 --- a/integrations/huggingface-transformers/recipes/bert-base-6layers_prune95.md +++ b/integrations/huggingface-transformers/recipes/bert-base-6layers_prune95.md @@ -15,17 +15,17 @@ limitations under the License. --> --- -# General Epoch/LR variables +# General Variables num_epochs: &num_epochs 30 -# pruning hyperparameters +# Pruning Hyperparameters init_sparsity: &init_sparsity 0.00 final_sparsity: &final_sparsity 0.95 pruning_start_epoch: &pruning_start_epoch 2 pruning_end_epoch: &pruning_end_epoch 20 update_frequency: &pruning_update_frequency 0.01 -# modifiers: +# Modifiers training_modifiers: - !EpochRangeModifier end_epoch: 30 @@ -60,21 +60,21 @@ pruning_modifiers: - bert.encoder.layer.10 --- -# Bert model with dropped and pruned encoder layers +# BERT Model with Dropped and Pruned Encoder Layers -This recipe defines a dropping and pruning strategy to sparsify 6 encoder layers of a Bert model at 95% sparsity. It was used together with knowledge distillation to create sparse model that achives 90% recovery from its (teacher) baseline accuracy on the Squad dataset. -Training was done using 1 GPU at half precision using a training batch size of 16 with the +This recipe defines a dropping and pruning strategy to sparsify six encoder layers of a BERT model at 95% sparsity. It was used together with knowledge distillation to create sparse model that achives 96.2% recovery from the F1 metric of the baseline model on the SQuAD dataset. (We use the checkpoint at the end of the first 2 epochs as the baseline model for comparison, right before the pruning takes effect.) +Training was done using one V100 GPU at half precision using a training batch size of 16 with the [SparseML integration with huggingface/transformers](https://github.com/neuralmagic/sparseml/tree/main/integrations/huggingface-transformers). ## Weights and Biases -- [Sparse Bert on Squad](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/3plynclw?workspace=user-neuralmagic) +- [Sparse BERT on SQuAD](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/3plynclw?workspace=user-neuralmagic) ## Training To set up the training environment, follow the instructions on the [integration README](https://github.com/neuralmagic/sparseml/blob/main/integrations/huggingface-transformers/README.md). Using the `run_qa.py` script from the question-answering examples, the following command can be used to launch this recipe with distillation. -Adjust the training command below with your setup for GPU device, checkpoint saving frequency and logging options. +Adjust the training command below with your setup for GPU device, checkpoint saving frequency, and logging options. *training command* ``` @@ -99,7 +99,7 @@ python transformers/examples/pytorch/question-answering/run_qa.py \ --distill_temperature 2.0 \ --save_steps 1000 \ --save_total_limit 2 \ - --recipe ../recipes/uni_95sparse_freq0.01_18prune10fine_6layers.md \ + --recipe ../recipes/bert-base-6layers_prune95.md \ --onnx_export_path MODELS_DIR/sparse95_6layers/onnx \ --report_to wandb ``` diff --git a/integrations/huggingface-transformers/setup_integration.sh b/integrations/huggingface-transformers/setup_integration.sh index d3ad9d377fd..7a851c50893 100644 --- a/integrations/huggingface-transformers/setup_integration.sh +++ b/integrations/huggingface-transformers/setup_integration.sh @@ -4,6 +4,7 @@ # Creates a transformers folder next to this script with all required dependencies from the huggingface/transformers repository. # Command: `bash setup_integration.sh` -git clone https://github.com/huggingface/transformers.git +git clone https://github.com/neuralmagic/transformers.git cd transformers pip install -e . +pip install datasets diff --git a/integrations/huggingface-transformers/tutorials/images/bert_12_6_layers_EM.png b/integrations/huggingface-transformers/tutorials/images/bert_12_6_layers_EM.png new file mode 100644 index 00000000000..3fedade6fd0 Binary files /dev/null and b/integrations/huggingface-transformers/tutorials/images/bert_12_6_layers_EM.png differ diff --git a/integrations/huggingface-transformers/tutorials/images/bert_12_6_layers_F1.png b/integrations/huggingface-transformers/tutorials/images/bert_12_6_layers_F1.png new file mode 100644 index 00000000000..d4c45f18402 Binary files /dev/null and b/integrations/huggingface-transformers/tutorials/images/bert_12_6_layers_F1.png differ diff --git a/integrations/huggingface-transformers/tutorials/sparsifying_bert_using_recipes.md b/integrations/huggingface-transformers/tutorials/sparsifying_bert_using_recipes.md new file mode 100644 index 00000000000..1f07f4ab322 --- /dev/null +++ b/integrations/huggingface-transformers/tutorials/sparsifying_bert_using_recipes.md @@ -0,0 +1,155 @@ + + +# Sparsifying BERT Models Using Recipes + +This tutorial presents an essential extension from SparseML to Hugging Face Transformers training workflow to support model sparsification that includes knowledge distillation, parameter pruning and layer dropping. The examples used in this tutorial are specifically for BERT base uncased models, trained and pruned on the SQuAD dataset; further support and results will be available for other datasets in the near future. + +## Overview +Neural Magic’s ML team creates recipes that allow anyone to plug in their data and leverage SparseML’s recipe-driven approach on top of Hugging Face’s robust training pipelines. Sparsifying involves removing redundant information from neural networks using algorithms such as pruning and quantization, among others. This sparsification process results in many benefits for deployment environments, including faster inference and smaller file sizes. Unfortunately, many have not realized the benefits due to the complicated process and number of hyperparameters involved. + +Working through this tutorial, you will experience how Neural Magic recipes simplify the sparsification process by: + +- Creating a pre-trained teacher model for knowledge distillation. + +- Applying a recipe to select the trade off between the amount of recovery to the baseline training performance with the amount of sparsification for inference performance. + +- Exporting a pruned model to the ONNX format to run with an inference engine such as DeepSparse. + +All the results listed in this tutorials are available publically through a [Weights and Biases project](https://wandb.ai/neuralmagic/sparse-bert-squad?workspace=user-neuralmagic). + +

+ + +

+ +## Need Help? +For Neural Magic Support, sign up or log in to get help with your questions in our **Tutorials channel:** [Discourse Forum](https://discuss.neuralmagic.com/) and/or [Slack](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ). + +## Creating a Pretrained Teacher Model + +Before applying one of the pruning recipes with the distillation approach, we need a "teacher" model pretrained on the dataset. In our experiments, we trained the BERT model adapted to SQuAD in two epochs, resulting in a teacher model with EM/F1 metrics of 80.9/88.4. The `run_qa.py` script could be used for this purpose as follows. + +```bash +python transformers/examples/pytorch/question-answering/run_qa.py \ + --model_name_or_path bert-base-uncased \ + --dataset_name squad \ + --do_train \ + --do_eval \ + --evaluation_strategy epoch \ + --per_device_train_batch_size 16 \ + --learning_rate 3e-5 \ + --max_seq_length 384 \ + --doc_stride 128 \ + --output_dir MODELS_DIR/bert-base-12layers \ + --cache_dir cache \ + --preprocessing_num_workers 8 \ + --fp16 \ + --num_train_epochs 2 \ + --warmup_steps 5400 \ + --report_to wandb +``` + +If the command runs successfully, you should have a model folder called `bert-base-12layers` in the provided model directory `MODELS_DIR`. + +## Applying Pruning Recipes + +Using the teacher model `bert-base-12layers` above, you can now train and prune a "student" BERT model on the same dataset using knowledge distillation. `SparseML` extends the training script `run_qa.py` with the following arguments to support recipes and knowledge distillation: + +- `--recipe`: path to a YAML recipe file that defines, among other information, the parameters and the desired sparsity levels to prune; +- `--distill_teacher`: path to the teacher model for distillation; the student model is trained to learn from both its correct targets and those "instructed" by the teacher model; +- `--distill_hardness`: ratio (in `[0.0, 1.0]`) of the loss defined on the teacher model targets; +- `--distill_temperature`: the temperature used to soften the distribution of the targets. + +Additionally, you will use the argument `--onnx_export_path` to specify the destination folder for the exported ONNX model. The resulting exported model could then be used for inference with the `DeepSparse Engine`. + +The following command prunes the model in 30 epochs to 80% sparsity of the encoder layers: + +```bash +python transformers/examples/pytorch/question-answering/run_qa.py \ + --model_name_or_path bert-base-uncased \ + --distill_teacher MODELS_DIR/bert-base-12layers \ + --distill_hardness 1.0 \ + --distill_temperature 2.0 \ + --dataset_name squad \ + --do_train \ + --do_eval \ + --evaluation_strategy epoch \ + --per_device_train_batch_size 16 \ + --learning_rate 5e-5 \ + --max_seq_length 384 \ + --doc_stride 128 \ + --output_dir MODELS_DIR/bert-base-12layers_prune80 \ + --cache_dir cache \ + --preprocessing_num_workers 6 \ + --fp16 \ + --num_train_epochs 30 \ + --recipe ../recipes/bert-base-12layers_prune80.md \ + --onnx_export_path MODELS_DIR/bert-base-12layers_prune80/onnx \ + --report_to wandb +``` + +The directory `recipes` contains information about recipes and training commands used to produce our BERT pruned models on the SQuAD dataset. + +### Dropping Layers + +In some situations, you might drop certain layers from a BERT model and retrain on your own dataset. `SparseML` supports these use cases with a modifier called `LayerPruningModifier` that can be used as part of a pruning recipe. As an example, below is a modifier that prunes layers 5th and 7th from a BERT model: +``` +!LayerPruningModifier + layers: ['bert.encoder.layer.5', bert.encoder.layer.7'] +``` + +The directory `recipes` contains recipes, for example `bert-base-6layers_prune80`, that drops six layers from a model before applying pruning. + +The following table presents the recipes in the directory, the corresponding results, and `wandb` logging for our pruned BERT models. + +| Recipe name | Description | EM / F1 | Weights and Biases | +|-------------|-------------|---------|-----------------------------------------------------------------------------------------------| +| bert-base-12layers | BERT fine-tuned on SQuAD | 80.927 / 88.435 | [wandb](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/w3b1ggyq?workspace=user-neuralmagic) | +| bert-base-12layers_prune80 | Prune baseline model fine-tuned on SQuAD at 80% sparsity of encoder units | 81.372 / 88.62 | [wandb](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/18qdx7b3?workspace=user-neuralmagic) | +| bert-base-12layers_prune90 | Prune baseline model fine-tuned on SQuAD at 90% sparsity of encoder units | 79.376 / 87.229 | [wandb](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/2ht2eqsn?workspace=user-neuralmagic) | +| bert-base-12layers_prune95 | Prune baseline model fine-tuned on SQuAD at 95% sparsity of encoder units | 74.939 / 83.929 | [wandb](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/3gv0arxd?workspace=user-neuralmagic) | +| bert-base-6layers_prune80 | Prune 6-layer model fine-tuned on SQuAD at 80% sparsity of encoder units | 78.042 / 85.915 | [wandb](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/ebab4np4?workspace=user-neuralmagic) | +| bert-base-6layers_prune90 | Prune 6-layer model fine-tuned on SQuAD at 90% sparsity of encoder units | 75.08 / 83.602 | [wandb](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/3qvxoroz?workspace=user-neuralmagic) | +| bert-base-6layers_prune95 | Prune 6-layer model fine-tuned on SQuAD at 95% sparsity of encoder units | 70.946 / 80.483 | [wandb](https://wandb.ai/neuralmagic/sparse-bert-squad/runs/3plynclw?workspace=user-neuralmagic) | + + +## Exporting for Inference + +The sparsification run with the argument `--export_onnx_path` will creates an ONNX model that can be used for benchmarking with `DeepSparse Engine`. You can export a model as part of the training and pruning process (as in the commands above), or after the model is pruned. + +The following command evaluates a pruned model and converts it to the ONNX format: + +```bash +python transformers/examples/pytorch/question-answering/run_qa.py \ + --model_name_or_path MODELS_DIR/bert-base-12layers_prune80 \ + --dataset_name squad \ + --do_eval \ + --per_device_eval_batch_size 64 \ + --max_seq_length 384 \ + --doc_stride 128 \ + --cache_dir cache \ + --preprocessing_num_workers 6 \ + --onnx_export_path MODELS_DIR/bert-base-12layers_prune80/onnx \ +``` + +If it runs successfully, you will have the converted `model.onnx` in `MODELS_DIR/bert-base-12layers_prune80/onnx`. You can now run it in ONNX-compatible inference engines such as [DeepSparse](https://github.com/neuralmagic/deepsparse). The `DeepSparse Engine` is explicitly coded to support running sparsified models for significant improvements in inference performance. + +## Wrap-Up + +Neural Magic recipes simplify the sparsification process by encoding the hyperparameters and instructions needed to create highly accurate pruned BERT models. In this tutorial, you created a pre-trained model to establish a baseline, applied a Neural Magic recipe for sparsification, and exported to ONNX to run through an inference engine. + +For Neural Magic Support, sign up or log in to get help with your questions in our **Tutorials channel:** [Discourse Forum](https://discuss.neuralmagic.com/) and/or [Slack](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ).