From cf4f6dcd51674fa303edee0da73e1410c7b9dd09 Mon Sep 17 00:00:00 2001 From: "Zhang, Weiwei1" Date: Thu, 14 Sep 2023 20:27:11 +0800 Subject: [PATCH 01/10] refine pruner docs Signed-off-by: Zhang, Weiwei1 --- docs/source/pruning.md | 124 ++++++++++++++++- .../language-modeling/pruning/eager/README.md | 16 ++- .../compression/pruner/README.md | 127 +++++++++++++++++- 3 files changed, 254 insertions(+), 13 deletions(-) diff --git a/docs/source/pruning.md b/docs/source/pruning.md index c91865a8862..10de002d8d4 100644 --- a/docs/source/pruning.md +++ b/docs/source/pruning.md @@ -243,6 +243,24 @@ Regularization is a technique that discourages learning a more complex model and +### Large Language Model Pruning + +To efficiently achieve pruning for Large Language Models (LLMs), we have implemented two post-training pruning methods that utilize different pruning patterns: **Retrain-free** (channel-wise) and **SparseGPT** (1x1 / 2:4 / 4:8). + +- Retrain-free + + The retrain-free algorithm is a lightweight method that utilizes mask retrieval and rearrangement techniques within the Transformer architecture. By incorporating channel pruning and sparse model slimming for the linear layer in Multi-Layer Perceptron (MLP), it effectively achieves a 20% sparsity per layer while preserving accuracy with an accuracy loss of less than 1%. This algorithm seamlessly supports popular models like GPT, OPT, LLaMA, and BLOOM. Its capability to enhance model efficiency while maintaining performance makes it a valuable pruning approach for LLMs. + + For a quick and efficient start with the retrain-free algorithm, please refer to the API instructions [Retrain-free Pruning API](#Retrain-free-Pruning-API) + +- SparseGPT + + This algorithm is an efficient post-training pruning method that operates on a block-wise basis. It supports multiple pruning patterns, including 1x1, 2:4, and 4:8, targeting the linear layers within the Multi-Head Attention (MHA) and Multi-Layer Perceptron (MLP) components. By applying this method, it is possible to achieve up to 50% sparsity on models with sizes larger than 10 billion parameters(More model parameters are less sensitive to sparsity), all while maintaining an accuracy loss of less than 1%. It is compatible with a wide range of models, including OPT, GPT, LLaMA, BLOOM, Dolly, MPT, Falcon, Stable-LM, and LaMini-LM, providing flexibility and effectiveness in pruning LLMs. Additionally, it is worth mentioning that larger model parameters tend to be less impacted by sparsity. + + For a smooth initiation with sparseGPT algorithm, please refer to the API instructions provided [SparseGPT Pruning API](#SparseGPT-Pruning-API) + + + ### Pruning Support Matrix @@ -255,12 +273,11 @@ Regularization is a technique that discourages learning a more complex model and ## Get Started with Pruning API - Neural Compressor `Pruning` API is defined under `neural_compressor.training`, which takes a user-defined configure object as input. Users can pass the customized training/evaluation functions to `Pruning` in various scenarios. - +### Training-aware pruning API The following section exemplifies how to use hooks in user pass-in training function to perform model pruning. Through the pruning API, multiple pruner objects are supported in one single Pruning object to enable layer-specific configurations and a default set is used as a complement. - Step 1: Define a dict-like configuration in your training codes. Usually only 5-7 configuration items need to be identified. For customized pruning, a configuration template is shown below: @@ -297,7 +314,7 @@ The following section exemplifies how to use hooks in user pass-in training func ] ``` -- Step 2: Enable pruning functionalities +- Step 2: Enable pruning functionalities [**Experimental option** ]Modify model and optimizer. @@ -345,7 +362,98 @@ The following section exemplifies how to use hooks in user pass-in training func compression_manager.callbacks.on_train_end() ``` -In the case mentioned above, pruning process can be done by pre-defined hooks in Neural Compressor. Users need to place those hooks inside the training function. + In the case mentioned above, pruning process can be done by pre-defined hooks in Neural Compressor. Users need to place those hooks inside the training function. + + +### Retrain-free Pruning API +- Step 1: Define a dict-like configuration in your training codes. Usually only 5-7 configuration items need to be identified. + + If the name of the layer to be pruned is known, you can create a pruning_config manually by inputting the relevant information. This allows for greater customization and control over the pruning process. + + ```python + pruning_configs = [ + { # config of a single pruner + "pruning_type": "retrain_free", + "pruning_scope": "global", + "op_names": ['.fc', '.mlp'], # MLP layer_names + "start_step": 1, + "end_step": 300, # set end_step for Few shot pruning. + "excluded_op_names": ["lm_head"], # A list of modules that would not be pruned. + "target_sparsity": 0.2, # Target sparsity ratio of modules. + "pruning_frequency": 50, # Frequency of applying pruning, The recommended setting is one fortieth of the pruning steps. + "pattern": "channelx1", # Default pruning pattern. + "pruning_op_types": ["Linear"], + }, + ] + ``` + + If you find yourself uncertain about the names of the linear modules within the MLP of the model or desire a simplified approach to setting up pruning, you can utilize a module that automatically generates a config: + + ```python + # auto config + from neural_compressor.compression.pruner import parse_auto_slim_config + pruning_configs=[] + auto_configs = parse_auto_slim_config( + model, + ffn2_sparsity = args.target_sparsity, #e.g. 0.2 + mha_sparsity = 0, + pruning_scope = "global", + pruning_type = "retrain_free", + ) + pruning_configs += auto_configs + ``` + + +- Step 2: Enable pruning functionalities + The process itself is quite straightforward. By passing the prepared config and the calibration dataset, the pruning process can be automatically carried out with a simple API call. + + ```python + from neural_compressor.training import prepare_pruning, WeightPruningConfig + configs = WeightPruningConfig( + pruning_configs, + target_sparsity=args.target_sparsity, # global setting for all pruners(optional) + pattern=args.pruning_pattern, + start_step=pruning_start, + end_step=pruning_end, + ) + config = WeightPruningConfig(pruning_configs) + + pruning = prepare_pruning(model, configs, dataloader=train_dataloader) # modify the model and complete the pruning + ``` + + + +### SparseGPT Pruning API +- Step 1: Define a dict-like configuration in your training codes. Usually only 3-5 configuration items need to be identified, for example: + + ```python + pruning_configs = [ + { #example pruner + "pruning_type": "sparse_gpt", + "op_names": [".*"], # Prunes all linear modules by default. + "pruning_op_types": ["Linear"], + "excluded_op_names": ["lm_head", "embed_out"], # A list of modules that would not be pruned. + "target_sparsity": 0.5, # Target sparsity ratio of modules. + "pattern": "1x1", # Default pruning pattern. + } + ] + ``` + +- Step 2: Enable pruning functionalities + By providing the pruning config, calibration dataset, and specifying the desired device card number, the pruning process can be executed automatically with a simple API call. + + ```python + from neural_compressor.training import prepare_pruning, WeightPruningConfig + configs = WeightPruningConfig( + pruning_configs, + target_sparsity=args.target_sparsity, # global setting for all pruners + pattern=args.pruning_pattern, # e.g. 1x1 / 2:4 + ) + config = WeightPruningConfig(pruning_configs) + # for example: device = "cuda:1" + pruning = prepare_pruning(model, configs, dataloader=train_dataloader, device=device) # modify the model and complete the pruning + ``` + ## Examples @@ -360,6 +468,10 @@ The pruning technique is validated on typical models across various domains (in "Experimental" annotation means these examples codes are ready but pruning results are under improvements. Please don't hesitate to try these codes with different configurations to get better pruning results! +- Language Modeling + + Sparsity is effectively implemented through various pruning patterns in Causal language modeling (CLM) tasks. [Language-modeling examples](../../../examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager). + - Text Classification Sparsity is implemented in different pruning patterns of MRPC and SST-2 tasks [Text-classification examples](../../examples/pytorch/nlp/huggingface_models/text-classification/pruning/eager). @@ -395,3 +507,7 @@ For more details, please refer to [HPO document](../../neural_compressor/compres [1] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. SNIP: Single-shot network pruning based on connection sensitivity. In International Conference on Learning Representations, 2019. [2] Zafrir, Ofir, Ariel Larey, Guy Boudoukh, Haihao Shen, and Moshe Wasserblat. "Prune once for all: Sparse pre-trained language models." arXiv preprint arXiv:2111.05754 (2021). + +[3] Kwon, W., Kim, S., Mahoney, M.W., Hassoun, J., Keutzer, K. and Gholami, A., 2022. A fast post-training pruning framework for transformers. Advances in Neural Information Processing Systems, 35, pp.24101-24116. + +[4] Frantar, E. and Alistarh, D., Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023. URL https://arxiv. org/abs/2301.00774. diff --git a/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/README.md b/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/README.md index 93e2d009ab1..010c74d9aef 100644 --- a/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/README.md +++ b/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/README.md @@ -23,11 +23,14 @@ The dataset will be downloaded automatically from the datasets Hub. See more about loading [huggingface dataset](https://huggingface.co/docs/datasets/loading_datasets.html) + # Run Examples -Intel® Neural Compressor supports pruning and slimming operations for LLMs without retraining. Experimentally verified pruning at the MLP layers with channel-wise pattern, which can achieve 10%-20% sparsity and speed up inference while accuracy drops < 1% [Retrain-free Example](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/run_clm_no_trainer.py). -The 1x1 and N:M pruning formats are supported using the [SparseGPT Example](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/run_clm_sparsegpt.py), with models up to 70B being able to be pruned in less than an hour and achieving 40%-50% sparsity in the MHA and MLP layers, while models of 7B and above may have a accuracy drop < 1%. -There are pruning scripts for LLM sparse models (GPT-j, BLOOM, OPT, LLaMA etc). The sparse model can be obtained by modifying pruning parameters. [Pruning Scripts](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/scripts/). +Intel® Neural Compressor provides pruning and slimming capabilities for Large Language Models(LLMs) without the need for retraining. +Through experimental verification, we have found that pruning MLP layers using channel-wise patterns can achieve a sparsity of 10%-20% without sacrificing accuracy, thereby significantly speeding up inference by <1%. [Retrain-free Example](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/run_clm_no_trainer.py). + +The pruning patterns of 1x1, 2:4 and 4:8 are supported through the use of the [SparseGPT Example](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/run_clm_sparsegpt.py), It is possible to prune models up to 70B in size within two hour, achieving a sparsity of 40%-50% in both the MHA and MLP layers. For models of 7B and above, the drop in accuracy is less than 1%. +Pruning scripts are available for LLM sparse models such as GPT-j, BLOOM, OPT, LLaMA, and the sparse model can be obtained by modifying the pruning parameters. [Pruning Scripts](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/scripts/). ### Results @@ -82,8 +85,11 @@ The last word acc of the 1x1 pattern sparse model using the sparseGPT algorithm ## References -* [A Fast Post-Training Pruning Framework for Transformers](https://arxiv.org/abs/2204.09656) -* [SparseGPT: Massive Language Models Can be Accurately Pruned in One-shot](https://arxiv.org/abs/2301.00774) + +[1] Kwon, W., Kim, S., Mahoney, M.W., Hassoun, J., Keutzer, K. and Gholami, A., 2022. A fast post-training pruning framework for transformers. Advances in Neural Information Processing Systems, 35, pp.24101-24116. + +[2] Frantar, E. and Alistarh, D., Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023. URL https://arxiv. org/abs/2301.00774. + diff --git a/neural_compressor/compression/pruner/README.md b/neural_compressor/compression/pruner/README.md index e6d403c979d..f5cce88a4db 100644 --- a/neural_compressor/compression/pruner/README.md +++ b/neural_compressor/compression/pruner/README.md @@ -244,6 +244,25 @@ Regularization is a technique that discourages learning a more complex model and +### Large Language Model Pruning + +To efficiently achieve pruning for Large Language Models (LLMs), we have implemented two post-training pruning methods that utilize different pruning patterns: **Retrain-free** (channel-wise) and **SparseGPT** (1x1 / 2:4 / 4:8). + +- Retrain-free + + The retrain-free algorithm is a lightweight method that utilizes mask retrieval and rearrangement techniques within the Transformer architecture. By incorporating channel pruning and sparse model slimming for the linear layer in Multi-Layer Perceptron (MLP), it effectively achieves a 20% sparsity per layer while preserving accuracy with an accuracy loss of less than 1%. This algorithm seamlessly supports popular models like GPT, OPT, LLaMA, and BLOOM. Its capability to enhance model efficiency while maintaining performance makes it a valuable pruning approach for LLMs. + + For a quick and efficient start with the retrain-free algorithm, please refer to the API instructions [Retrain-free Pruning API](#Retrain-free-Pruning-API) + +- SparseGPT + + This algorithm is an efficient post-training pruning method that operates on a block-wise basis. It supports multiple pruning patterns, including 1x1, 2:4, and 4:8, targeting the linear layers within the Multi-Head Attention (MHA) and Multi-Layer Perceptron (MLP) components. By applying this method, it is possible to achieve up to 50% sparsity on models with sizes larger than 10 billion parameters(More model parameters are less sensitive to sparsity), all while maintaining an accuracy loss of less than 1%. It is compatible with a wide range of models, including OPT, GPT, LLaMA, BLOOM, Dolly, MPT, Falcon, Stable-LM, and LaMini-LM, providing flexibility and effectiveness in pruning LLMs. Additionally, it is worth mentioning that larger model parameters tend to be less impacted by sparsity. + + For a smooth initiation with sparseGPT algorithm, please refer to the API instructions provided [SparseGPT Pruning API](#SparseGPT-Pruning-API) + + + + ### Pruning Support Matrix |Framework | Status | @@ -255,12 +274,11 @@ Regularization is a technique that discourages learning a more complex model and ## Get Started with Pruning API - Neural Compressor `Pruning` API is defined under `neural_compressor.training`, which takes a user-defined configure object as input. Users can pass the customized training/evaluation functions to `Pruning` in various scenarios. - +### Training-aware pruning API The following section exemplifies how to use hooks in user pass-in training function to perform model pruning. Through the pruning API, multiple pruner objects are supported in one single Pruning object to enable layer-specific configurations and a default set is used as a complement. - Step 1: Define a dict-like configuration in your training codes. Usually only 5-7 configuration items need to be identified. For customized pruning, a configuration template is shown below: @@ -297,7 +315,7 @@ The following section exemplifies how to use hooks in user pass-in training func ] ``` -- Step 2: Enable pruning functionalities +- Step 2: Enable pruning functionalities [**Experimental option** ]Modify model and optimizer. @@ -345,7 +363,99 @@ The following section exemplifies how to use hooks in user pass-in training func compression_manager.callbacks.on_train_end() ``` -In the case mentioned above, pruning process can be done by pre-defined hooks in Neural Compressor. Users need to place those hooks inside the training function. + In the case mentioned above, pruning process can be done by pre-defined hooks in Neural Compressor. Users need to place those hooks inside the training function. + + +### Retrain-free Pruning API +- Step 1: Define a dict-like configuration in your training codes. Usually only 5-7 configuration items need to be identified. + + If the name of the layer to be pruned is known, you can create a pruning_config manually by inputting the relevant information. This allows for greater customization and control over the pruning process. + + ```python + pruning_configs = [ + { # config of a single pruner + "pruning_type": "retrain_free", + "pruning_scope": "global", + "op_names": ['.fc', '.mlp'], # MLP layer_names + "start_step": 1, + "end_step": 300, # set end_step for Few shot pruning. + "excluded_op_names": ["lm_head"], # A list of modules that would not be pruned. + "target_sparsity": 0.2, # Target sparsity ratio of modules. + "pruning_frequency": 50, # Frequency of applying pruning, The recommended setting is one fortieth of the pruning steps. + "pattern": "channelx1", # Default pruning pattern. + "pruning_op_types": ["Linear"], + }, + ] + ``` + + If you find yourself uncertain about the names of the linear modules within the MLP of the model or desire a simplified approach to setting up pruning, you can utilize a module that automatically generates a config: + + ```python + # auto config + from neural_compressor.compression.pruner import parse_auto_slim_config + pruning_configs=[] + auto_configs = parse_auto_slim_config( + model, + ffn2_sparsity = args.target_sparsity, #e.g. 0.2 + mha_sparsity = 0, + pruning_scope = "global", + pruning_type = "retrain_free", + ) + pruning_configs += auto_configs + ``` + + +- Step 2: Enable pruning functionalities + The process itself is quite straightforward. By passing the prepared config and the calibration dataset, the pruning process can be automatically carried out with a simple API call. + + ```python + from neural_compressor.training import prepare_pruning, WeightPruningConfig + configs = WeightPruningConfig( + pruning_configs, + target_sparsity=args.target_sparsity, # global setting for all pruners(optional) + pattern=args.pruning_pattern, + start_step=pruning_start, + end_step=pruning_end, + ) + config = WeightPruningConfig(pruning_configs) + + pruning = prepare_pruning(model, configs, dataloader=train_dataloader) # modify the model and complete the pruning + ``` + + + +### SparseGPT Pruning API +- Step 1: Define a dict-like configuration in your training codes. Usually only 3-5 configuration items need to be identified, for example: + + ```python + pruning_configs = [ + { #example pruner + "pruning_type": "sparse_gpt", + "op_names": [".*"], # Prunes all linear modules by default. + "pruning_op_types": ["Linear"], + "excluded_op_names": ["lm_head", "embed_out"], # A list of modules that would not be pruned. + "target_sparsity": 0.5, # Target sparsity ratio of modules. + "pattern": "1x1", # Default pruning pattern. + } + ] + ``` + +- Step 2: Enable pruning functionalities + By providing the pruning config, calibration dataset, and specifying the desired device card number, the pruning process can be executed automatically with a simple API call. + + ```python + from neural_compressor.training import prepare_pruning, WeightPruningConfig + configs = WeightPruningConfig( + pruning_configs, + target_sparsity=args.target_sparsity, # global setting for all pruners + pattern=args.pruning_pattern, # e.g. 1x1 / 2:4 + ) + config = WeightPruningConfig(pruning_configs) + # for example: device = "cuda:1" + pruning = prepare_pruning(model, configs, dataloader=train_dataloader, device=device) # modify the model and complete the pruning + ``` + + ## Examples @@ -360,6 +470,10 @@ The pruning technique is validated on typical models across various domains (in "Experimental" annotation means these examples codes are ready but pruning results are under improvements. Please don't hesitate to try these codes with different configurations to get better pruning results! +- Language Modeling + + Sparsity is effectively implemented through various pruning patterns in Causal language modeling (CLM) tasks. [Language-modeling examples](../../../examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager). + - Text Classification Sparsity is implemented in different pruning patterns of MRPC and SST-2 tasks [Text-classification examples](../../../examples/pytorch/nlp/huggingface_models/text-classification/pruning/eager). @@ -395,3 +509,8 @@ For more details, please refer to [HPO document](../../neural_compressor/compres [1] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. SNIP: Single-shot network pruning based on connection sensitivity. In International Conference on Learning Representations, 2019. [2] Zafrir, Ofir, Ariel Larey, Guy Boudoukh, Haihao Shen, and Moshe Wasserblat. "Prune once for all: Sparse pre-trained language models." arXiv preprint arXiv:2111.05754 (2021). + +[3] Kwon, W., Kim, S., Mahoney, M.W., Hassoun, J., Keutzer, K. and Gholami, A., 2022. A fast post-training pruning framework for transformers. Advances in Neural Information Processing Systems, 35, pp.24101-24116. + +[4] Frantar, E. and Alistarh, D., Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023. URL https://arxiv. org/abs/2301.00774. + From c17d6448e11e79e6d6dab4230a00772a5a86fce2 Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Thu, 14 Sep 2023 12:30:28 +0000 Subject: [PATCH 02/10] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- docs/source/pruning.md | 91 +++++++++--------- .../compression/pruner/README.md | 92 +++++++++---------- 2 files changed, 92 insertions(+), 91 deletions(-) diff --git a/docs/source/pruning.md b/docs/source/pruning.md index 10de002d8d4..6fba34e59f5 100644 --- a/docs/source/pruning.md +++ b/docs/source/pruning.md @@ -372,17 +372,17 @@ The following section exemplifies how to use hooks in user pass-in training func ```python pruning_configs = [ - { # config of a single pruner - "pruning_type": "retrain_free", - "pruning_scope": "global", - "op_names": ['.fc', '.mlp'], # MLP layer_names - "start_step": 1, - "end_step": 300, # set end_step for Few shot pruning. - "excluded_op_names": ["lm_head"], # A list of modules that would not be pruned. - "target_sparsity": 0.2, # Target sparsity ratio of modules. - "pruning_frequency": 50, # Frequency of applying pruning, The recommended setting is one fortieth of the pruning steps. - "pattern": "channelx1", # Default pruning pattern. - "pruning_op_types": ["Linear"], + { # config of a single pruner + "pruning_type": "retrain_free", + "pruning_scope": "global", + "op_names": [".fc", ".mlp"], # MLP layer_names + "start_step": 1, + "end_step": 300, # set end_step for Few shot pruning. + "excluded_op_names": ["lm_head"], # A list of modules that would not be pruned. + "target_sparsity": 0.2, # Target sparsity ratio of modules. + "pruning_frequency": 50, # Frequency of applying pruning, The recommended setting is one fortieth of the pruning steps. + "pattern": "channelx1", # Default pruning pattern. + "pruning_op_types": ["Linear"], }, ] ``` @@ -392,13 +392,14 @@ The following section exemplifies how to use hooks in user pass-in training func ```python # auto config from neural_compressor.compression.pruner import parse_auto_slim_config - pruning_configs=[] + + pruning_configs = [] auto_configs = parse_auto_slim_config( model, - ffn2_sparsity = args.target_sparsity, #e.g. 0.2 - mha_sparsity = 0, - pruning_scope = "global", - pruning_type = "retrain_free", + ffn2_sparsity=args.target_sparsity, # e.g. 0.2 + mha_sparsity=0, + pruning_scope="global", + pruning_type="retrain_free", ) pruning_configs += auto_configs ``` @@ -408,35 +409,35 @@ The following section exemplifies how to use hooks in user pass-in training func The process itself is quite straightforward. By passing the prepared config and the calibration dataset, the pruning process can be automatically carried out with a simple API call. ```python - from neural_compressor.training import prepare_pruning, WeightPruningConfig - configs = WeightPruningConfig( - pruning_configs, - target_sparsity=args.target_sparsity, # global setting for all pruners(optional) - pattern=args.pruning_pattern, - start_step=pruning_start, - end_step=pruning_end, - ) - config = WeightPruningConfig(pruning_configs) - - pruning = prepare_pruning(model, configs, dataloader=train_dataloader) # modify the model and complete the pruning - ``` - - - -### SparseGPT Pruning API -- Step 1: Define a dict-like configuration in your training codes. Usually only 3-5 configuration items need to be identified, for example: - - ```python - pruning_configs = [ - { #example pruner - "pruning_type": "sparse_gpt", - "op_names": [".*"], # Prunes all linear modules by default. - "pruning_op_types": ["Linear"], - "excluded_op_names": ["lm_head", "embed_out"], # A list of modules that would not be pruned. - "target_sparsity": 0.5, # Target sparsity ratio of modules. - "pattern": "1x1", # Default pruning pattern. - } - ] + from neural_compressor.training import prepare_pruning, WeightPruningConfig + configs = WeightPruningConfig( + pruning_configs, + target_sparsity=args.target_sparsity, # global setting for all pruners(optional) + pattern=args.pruning_pattern, + start_step=pruning_start, + end_step=pruning_end, + ) + config = WeightPruningConfig(pruning_configs) + + pruning = prepare_pruning(model, configs, dataloader=train_dataloader) # modify the model and complete the pruning + ``` + + + + ### SparseGPT Pruning API + - Step 1: Define a dict-like configuration in your training codes. Usually only 3-5 configuration items need to be identified, for example: + + ```python + pruning_configs = [ + { #example pruner + "pruning_type": "sparse_gpt", + "op_names": [".*"], # Prunes all linear modules by default. + "pruning_op_types": ["Linear"], + "excluded_op_names": ["lm_head", "embed_out"], # A list of modules that would not be pruned. + "target_sparsity": 0.5, # Target sparsity ratio of modules. + "pattern": "1x1", # Default pruning pattern. + } + ] ``` - Step 2: Enable pruning functionalities diff --git a/neural_compressor/compression/pruner/README.md b/neural_compressor/compression/pruner/README.md index f5cce88a4db..9f9c113d1d9 100644 --- a/neural_compressor/compression/pruner/README.md +++ b/neural_compressor/compression/pruner/README.md @@ -373,17 +373,17 @@ The following section exemplifies how to use hooks in user pass-in training func ```python pruning_configs = [ - { # config of a single pruner - "pruning_type": "retrain_free", - "pruning_scope": "global", - "op_names": ['.fc', '.mlp'], # MLP layer_names - "start_step": 1, - "end_step": 300, # set end_step for Few shot pruning. - "excluded_op_names": ["lm_head"], # A list of modules that would not be pruned. - "target_sparsity": 0.2, # Target sparsity ratio of modules. - "pruning_frequency": 50, # Frequency of applying pruning, The recommended setting is one fortieth of the pruning steps. - "pattern": "channelx1", # Default pruning pattern. - "pruning_op_types": ["Linear"], + { # config of a single pruner + "pruning_type": "retrain_free", + "pruning_scope": "global", + "op_names": [".fc", ".mlp"], # MLP layer_names + "start_step": 1, + "end_step": 300, # set end_step for Few shot pruning. + "excluded_op_names": ["lm_head"], # A list of modules that would not be pruned. + "target_sparsity": 0.2, # Target sparsity ratio of modules. + "pruning_frequency": 50, # Frequency of applying pruning, The recommended setting is one fortieth of the pruning steps. + "pattern": "channelx1", # Default pruning pattern. + "pruning_op_types": ["Linear"], }, ] ``` @@ -393,13 +393,14 @@ The following section exemplifies how to use hooks in user pass-in training func ```python # auto config from neural_compressor.compression.pruner import parse_auto_slim_config - pruning_configs=[] + + pruning_configs = [] auto_configs = parse_auto_slim_config( model, - ffn2_sparsity = args.target_sparsity, #e.g. 0.2 - mha_sparsity = 0, - pruning_scope = "global", - pruning_type = "retrain_free", + ffn2_sparsity=args.target_sparsity, # e.g. 0.2 + mha_sparsity=0, + pruning_scope="global", + pruning_type="retrain_free", ) pruning_configs += auto_configs ``` @@ -409,35 +410,35 @@ The following section exemplifies how to use hooks in user pass-in training func The process itself is quite straightforward. By passing the prepared config and the calibration dataset, the pruning process can be automatically carried out with a simple API call. ```python - from neural_compressor.training import prepare_pruning, WeightPruningConfig - configs = WeightPruningConfig( - pruning_configs, - target_sparsity=args.target_sparsity, # global setting for all pruners(optional) - pattern=args.pruning_pattern, - start_step=pruning_start, - end_step=pruning_end, - ) - config = WeightPruningConfig(pruning_configs) - - pruning = prepare_pruning(model, configs, dataloader=train_dataloader) # modify the model and complete the pruning - ``` - - - -### SparseGPT Pruning API -- Step 1: Define a dict-like configuration in your training codes. Usually only 3-5 configuration items need to be identified, for example: - - ```python - pruning_configs = [ - { #example pruner - "pruning_type": "sparse_gpt", - "op_names": [".*"], # Prunes all linear modules by default. - "pruning_op_types": ["Linear"], - "excluded_op_names": ["lm_head", "embed_out"], # A list of modules that would not be pruned. - "target_sparsity": 0.5, # Target sparsity ratio of modules. - "pattern": "1x1", # Default pruning pattern. - } - ] + from neural_compressor.training import prepare_pruning, WeightPruningConfig + configs = WeightPruningConfig( + pruning_configs, + target_sparsity=args.target_sparsity, # global setting for all pruners(optional) + pattern=args.pruning_pattern, + start_step=pruning_start, + end_step=pruning_end, + ) + config = WeightPruningConfig(pruning_configs) + + pruning = prepare_pruning(model, configs, dataloader=train_dataloader) # modify the model and complete the pruning + ``` + + + + ### SparseGPT Pruning API + - Step 1: Define a dict-like configuration in your training codes. Usually only 3-5 configuration items need to be identified, for example: + + ```python + pruning_configs = [ + { #example pruner + "pruning_type": "sparse_gpt", + "op_names": [".*"], # Prunes all linear modules by default. + "pruning_op_types": ["Linear"], + "excluded_op_names": ["lm_head", "embed_out"], # A list of modules that would not be pruned. + "target_sparsity": 0.5, # Target sparsity ratio of modules. + "pattern": "1x1", # Default pruning pattern. + } + ] ``` - Step 2: Enable pruning functionalities @@ -513,4 +514,3 @@ For more details, please refer to [HPO document](../../neural_compressor/compres [3] Kwon, W., Kim, S., Mahoney, M.W., Hassoun, J., Keutzer, K. and Gholami, A., 2022. A fast post-training pruning framework for transformers. Advances in Neural Information Processing Systems, 35, pp.24101-24116. [4] Frantar, E. and Alistarh, D., Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023. URL https://arxiv. org/abs/2301.00774. - From f50c254e56914377fd802d38c39d1507deb7b78c Mon Sep 17 00:00:00 2001 From: "Zhang, Weiwei1" Date: Thu, 14 Sep 2023 20:33:38 +0800 Subject: [PATCH 03/10] fixtypo Signed-off-by: Zhang, Weiwei1 --- neural_compressor/compression/pruner/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/neural_compressor/compression/pruner/README.md b/neural_compressor/compression/pruner/README.md index f5cce88a4db..a9eea837c07 100644 --- a/neural_compressor/compression/pruner/README.md +++ b/neural_compressor/compression/pruner/README.md @@ -65,8 +65,8 @@ Pruning Neural network pruning is a promising model compression technique that removes the least important parameters/neurons in the network and achieves compact architectures of minimal accuracy drop and maximal inference acceleration. As state-of-the-art model sizes have grown at an unprecedented speed, pruning has become increasingly crucial to reducing the computational and memory footprint that huge neural networks require.
- -    pruning intro + +    pruning intro
From d813a52e063a9032dbd7fd7f353ef292b9b5e44b Mon Sep 17 00:00:00 2001 From: "Zhang, Weiwei1" Date: Fri, 15 Sep 2023 10:09:32 +0800 Subject: [PATCH 04/10] Remove UT with randomness issue Signed-off-by: Zhang, Weiwei1 --- .../pruning_1.x_v2/test_pruning_regs.py | 105 ------------------ 1 file changed, 105 deletions(-) delete mode 100644 test/pruning_with_pt/pruning_1.x_v2/test_pruning_regs.py diff --git a/test/pruning_with_pt/pruning_1.x_v2/test_pruning_regs.py b/test/pruning_with_pt/pruning_1.x_v2/test_pruning_regs.py deleted file mode 100644 index df37b1c7010..00000000000 --- a/test/pruning_with_pt/pruning_1.x_v2/test_pruning_regs.py +++ /dev/null @@ -1,105 +0,0 @@ -import unittest - -import torch -import torch.nn as nn -import torchvision - -from neural_compressor.conf.pythonic_config import Config, WeightPruningConfig -from neural_compressor.data import Datasets -from neural_compressor.experimental.data.dataloaders.pytorch_dataloader import PyTorchDataLoader -from neural_compressor.experimental.pruning_v2 import Pruning - -local_regs_config = [ - { - "start_step": 0, - "end_step": 10, - "pruning_type": "magnitude", - "op_names": ["layer1.*"], - "excluded_op_names": ["layer2.*"], - "pruning_scope": "global", - "target_sparsity": 0.5, - "pattern": "4x1", - "reg_type": "group_lasso", - "parameters": {"reg_coeff": 0.2}, - }, - { - "start_step": 1, - "end_step": 1, - "target_sparsity": 0.5, - "pruning_type": "snip_momentum", - "pruning_frequency": 2, - "op_names": ["layer2.*"], - "pruning_scope": "local", - "pattern": "1x1", - "sparsity_decay_type": "exp", - "reg_type": "group_lasso", - "parameters": {"reg_coeff": 0.1}, - }, - { - "start_step": 2, - "end_step": 8, - "pruning_type": "gradient", - "pruning_frequency": 2, - "op_names": ["fc"], - "pruning_scope": "local", - "target_sparsity": 0.75, - "pattern": "1x1", - "sparsity_decay_type": "cube", - "reg_type": "group_lasso", - "parameters": {"reg_coeff": 0.0}, - }, -] - -fake_snip_config = WeightPruningConfig( - local_regs_config, target_sparsity=0.9, start_step=0, end_step=10, pruning_frequency=1, sparsity_decay_type="exp" -) - - -class TestPruningRegs(unittest.TestCase): - model = torchvision.models.resnet18() - - def test_pruning_regs(self): - config = Config(quantization=None, benchmark=None, pruning=fake_snip_config, distillation=None) - prune = Pruning(config) - prune.model = self.model - criterion = nn.CrossEntropyLoss() - optimizer = torch.optim.SGD(self.model.parameters(), lr=0.0001) - datasets = Datasets("pytorch") - dummy_dataset = datasets["dummy"](shape=(10, 3, 224, 224), low=0.0, high=1.0, label=True) - dummy_dataloader = PyTorchDataLoader(dummy_dataset) - prune.on_train_begin() - prune.update_config(pruning_frequency=1) - for epoch in range(2): - self.model.train() - prune.on_epoch_begin(epoch) - local_step = 0 - for image, target in dummy_dataloader: - prune.on_step_begin(local_step) - output = self.model(image) - loss = criterion(output, target) - optimizer.zero_grad() - loss.backward() - prune.on_before_optimizer_step() - optimizer.step() - prune.on_after_optimizer_step() - prune.on_step_end() - local_step += 1 - - prune.on_epoch_end() - ( - elementwise_over_matmul_gemm_conv, - elementwise_over_all, - blockwise_over_matmul_gemm_conv, - ) = prune.get_sparsity_ratio() - prune.on_train_end() - prune.on_before_eval() - prune.on_after_eval() - - # assert sparsity ratio - self.assertAlmostEqual(elementwise_over_matmul_gemm_conv, 0.02244, delta=0.0001) - self.assertAlmostEqual(elementwise_over_all, 0.02242, delta=0.0001) - self.assertAlmostEqual(blockwise_over_matmul_gemm_conv, 0.02244, delta=0.0001) - - -if __name__ == "__main__": - unittest.main() From e8997012828b2e83d62444dccb517f13b4f159ef Mon Sep 17 00:00:00 2001 From: "Zhang, Weiwei1" Date: Fri, 15 Sep 2023 10:59:01 +0800 Subject: [PATCH 05/10] fixtypos Signed-off-by: Zhang, Weiwei1 --- docs/source/pruning.md | 99 +++++++++---------- .../language-modeling/pruning/eager/README.md | 8 +- .../compression/pruner/README.md | 99 +++++++++---------- 3 files changed, 103 insertions(+), 103 deletions(-) diff --git a/docs/source/pruning.md b/docs/source/pruning.md index 6fba34e59f5..c728b23c6ae 100644 --- a/docs/source/pruning.md +++ b/docs/source/pruning.md @@ -245,17 +245,17 @@ Regularization is a technique that discourages learning a more complex model and ### Large Language Model Pruning -To efficiently achieve pruning for Large Language Models (LLMs), we have implemented two post-training pruning methods that utilize different pruning patterns: **Retrain-free** (channel-wise) and **SparseGPT** (1x1 / 2:4 / 4:8). +To efficiently achieve pruning for Large Language Models (LLMs), we have implemented two post-training pruning methods that utilize different pruning patterns: **Retrain-free** (channel-wise) and **SparseGPT** (1x1/N:M). - Retrain-free - The retrain-free algorithm is a lightweight method that utilizes mask retrieval and rearrangement techniques within the Transformer architecture. By incorporating channel pruning and sparse model slimming for the linear layer in Multi-Layer Perceptron (MLP), it effectively achieves a 20% sparsity per layer while preserving accuracy with an accuracy loss of less than 1%. This algorithm seamlessly supports popular models like GPT, OPT, LLaMA, and BLOOM. Its capability to enhance model efficiency while maintaining performance makes it a valuable pruning approach for LLMs. + [The retrain-free algorithm](https://arxiv.org/abs/2204.09656) is a lightweight method that utilizes mask retrieval and rearrangement techniques within the Transformer architecture. By incorporating channel pruning and sparse model slimming for the linear layer in Multi-Layer Perceptron (MLP), it effectively achieves a 20% sparsity per layer while preserving accuracy with an accuracy loss of less than 1%. This algorithm seamlessly supports popular models like GPT, OPT, LLaMA, and BLOOM. Its capability to enhance model efficiency while maintaining performance makes it a valuable pruning approach for LLMs. For a quick and efficient start with the retrain-free algorithm, please refer to the API instructions [Retrain-free Pruning API](#Retrain-free-Pruning-API) - SparseGPT - This algorithm is an efficient post-training pruning method that operates on a block-wise basis. It supports multiple pruning patterns, including 1x1, 2:4, and 4:8, targeting the linear layers within the Multi-Head Attention (MHA) and Multi-Layer Perceptron (MLP) components. By applying this method, it is possible to achieve up to 50% sparsity on models with sizes larger than 10 billion parameters(More model parameters are less sensitive to sparsity), all while maintaining an accuracy loss of less than 1%. It is compatible with a wide range of models, including OPT, GPT, LLaMA, BLOOM, Dolly, MPT, Falcon, Stable-LM, and LaMini-LM, providing flexibility and effectiveness in pruning LLMs. Additionally, it is worth mentioning that larger model parameters tend to be less impacted by sparsity. + [The SparseGPT algorithm](https://arxiv.org/abs/2301.00774) is an efficient post-training pruning method that operates on a block-wise basis. It supports multiple pruning patterns, including 1x1 and N:M, targeting the linear layers within the Multi-Head Attention (MHA) and Multi-Layer Perceptron (MLP) components. By applying this method, it is possible to achieve up to 50% sparsity on models with sizes larger than 10 billion parameters(More model parameters are less sensitive to sparsity), all while maintaining an accuracy loss of less than 1%. It is compatible with a wide range of models, including OPT, GPT, LLaMA, BLOOM, Dolly, MPT, Falcon, Stable-LM, and LaMini-LM, providing flexibility and effectiveness in pruning LLMs. Additionally, it is worth mentioning that larger model parameters tend to be less impacted by sparsity. For a smooth initiation with sparseGPT algorithm, please refer to the API instructions provided [SparseGPT Pruning API](#SparseGPT-Pruning-API) @@ -372,17 +372,16 @@ The following section exemplifies how to use hooks in user pass-in training func ```python pruning_configs = [ - { # config of a single pruner - "pruning_type": "retrain_free", - "pruning_scope": "global", - "op_names": [".fc", ".mlp"], # MLP layer_names - "start_step": 1, - "end_step": 300, # set end_step for Few shot pruning. - "excluded_op_names": ["lm_head"], # A list of modules that would not be pruned. - "target_sparsity": 0.2, # Target sparsity ratio of modules. - "pruning_frequency": 50, # Frequency of applying pruning, The recommended setting is one fortieth of the pruning steps. - "pattern": "channelx1", # Default pruning pattern. - "pruning_op_types": ["Linear"], + { # config of a single pruner + "pruning_type": "retrain_free", + "pruning_scope": "global", + "op_names": ['.fc', '.mlp'], # MLP layer_names + "start_step": 1, + "end_step": 300, # set end_step for Few shot pruning. + "excluded_op_names": ["lm_head"], # A list of modules that would not be pruned. + "target_sparsity": 0.2, # Target sparsity ratio of modules. + "pruning_frequency": 50, # Frequency of applying pruning, + "pattern": "channelx1", # Default pruning pattern. }, ] ``` @@ -392,14 +391,13 @@ The following section exemplifies how to use hooks in user pass-in training func ```python # auto config from neural_compressor.compression.pruner import parse_auto_slim_config - pruning_configs = [] auto_configs = parse_auto_slim_config( model, - ffn2_sparsity=args.target_sparsity, # e.g. 0.2 - mha_sparsity=0, - pruning_scope="global", - pruning_type="retrain_free", + ffn2_sparsity = args.target_sparsity, #e.g. 0.2 + mha_sparsity = 0, + pruning_scope = "global", + pruning_type = "retrain_free", ) pruning_configs += auto_configs ``` @@ -409,35 +407,35 @@ The following section exemplifies how to use hooks in user pass-in training func The process itself is quite straightforward. By passing the prepared config and the calibration dataset, the pruning process can be automatically carried out with a simple API call. ```python - from neural_compressor.training import prepare_pruning, WeightPruningConfig - configs = WeightPruningConfig( - pruning_configs, - target_sparsity=args.target_sparsity, # global setting for all pruners(optional) - pattern=args.pruning_pattern, - start_step=pruning_start, - end_step=pruning_end, - ) - config = WeightPruningConfig(pruning_configs) - - pruning = prepare_pruning(model, configs, dataloader=train_dataloader) # modify the model and complete the pruning - ``` - - - - ### SparseGPT Pruning API - - Step 1: Define a dict-like configuration in your training codes. Usually only 3-5 configuration items need to be identified, for example: - - ```python - pruning_configs = [ - { #example pruner - "pruning_type": "sparse_gpt", - "op_names": [".*"], # Prunes all linear modules by default. - "pruning_op_types": ["Linear"], - "excluded_op_names": ["lm_head", "embed_out"], # A list of modules that would not be pruned. - "target_sparsity": 0.5, # Target sparsity ratio of modules. - "pattern": "1x1", # Default pruning pattern. - } - ] + from neural_compressor.training import prepare_pruning, WeightPruningConfig + configs = WeightPruningConfig( + pruning_configs, + target_sparsity = args.target_sparsity, # global setting for all pruners(optional) + pattern = args.pruning_pattern, + start_step = pruning_start, + end_step = pruning_end, + ) + config = WeightPruningConfig(pruning_configs) + + pruning = prepare_pruning(model, configs, dataloader=train_dataloader) # modify the model and complete the pruning + ``` + + + +### SparseGPT Pruning API +- Step 1: Define a dict-like configuration in your training codes. Usually only 3-5 configuration items need to be identified, for example: + + ```python + pruning_configs = [ + { #example pruner + "pruning_type": "sparse_gpt", + "op_names": [".*"], # Prunes all linear modules by default. + "pruning_op_types": ["Linear"], + "excluded_op_names": ["lm_head", "embed_out"], # A list of modules that would not be pruned. + "target_sparsity": 0.5, # Target sparsity ratio of modules. + "pattern": "1x1", # Default pruning pattern. + } + ] ``` - Step 2: Enable pruning functionalities @@ -447,8 +445,8 @@ The following section exemplifies how to use hooks in user pass-in training func from neural_compressor.training import prepare_pruning, WeightPruningConfig configs = WeightPruningConfig( pruning_configs, - target_sparsity=args.target_sparsity, # global setting for all pruners - pattern=args.pruning_pattern, # e.g. 1x1 / 2:4 + target_sparsity = args.target_sparsity, # global setting for all pruners + pattern = args.pruning_pattern, # e.g. 1x1 / 2:4 ) config = WeightPruningConfig(pruning_configs) # for example: device = "cuda:1" @@ -457,6 +455,7 @@ The following section exemplifies how to use hooks in user pass-in training func + ## Examples The pruning technique is validated on typical models across various domains (including CV and NLP). diff --git a/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/README.md b/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/README.md index 010c74d9aef..93b19d150c9 100644 --- a/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/README.md +++ b/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/README.md @@ -26,10 +26,12 @@ See more about loading [huggingface dataset](https://huggingface.co/docs/dataset # Run Examples -Intel® Neural Compressor provides pruning and slimming capabilities for Large Language Models(LLMs) without the need for retraining. -Through experimental verification, we have found that pruning MLP layers using channel-wise patterns can achieve a sparsity of 10%-20% without sacrificing accuracy, thereby significantly speeding up inference by <1%. [Retrain-free Example](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/run_clm_no_trainer.py). +Intel® Neural Compressor provides support for pruning and model slimming operations in Large Language Models (LLMs) without the need for retraining. + +Through experimental verification, it has been observed that pruning the Multi-Layer Perceptron (MLP) layers using a channel-wise pattern can achieve a sparsity level of 10%-20%. This pruning technique speeds up inference while maintaining an accuracy drop of less than 1%. [Retrain-free Example](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/run_clm_no_trainer.py). + +The pruning patterns of 1x1 and N:M are supported through the use of the [SparseGPT Example](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/run_clm_sparsegpt.py), It is possible to prune models up to 70B in size within two hours, achieving a sparsity of 40%-50% in both the Multi-Head Attention (MHA) and MLP layers. For models of 7B and above, the drop in accuracy is less than 1%. -The pruning patterns of 1x1, 2:4 and 4:8 are supported through the use of the [SparseGPT Example](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/run_clm_sparsegpt.py), It is possible to prune models up to 70B in size within two hour, achieving a sparsity of 40%-50% in both the MHA and MLP layers. For models of 7B and above, the drop in accuracy is less than 1%. Pruning scripts are available for LLM sparse models such as GPT-j, BLOOM, OPT, LLaMA, and the sparse model can be obtained by modifying the pruning parameters. [Pruning Scripts](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/scripts/). diff --git a/neural_compressor/compression/pruner/README.md b/neural_compressor/compression/pruner/README.md index a156ac9d131..2878caa2849 100644 --- a/neural_compressor/compression/pruner/README.md +++ b/neural_compressor/compression/pruner/README.md @@ -246,17 +246,17 @@ Regularization is a technique that discourages learning a more complex model and ### Large Language Model Pruning -To efficiently achieve pruning for Large Language Models (LLMs), we have implemented two post-training pruning methods that utilize different pruning patterns: **Retrain-free** (channel-wise) and **SparseGPT** (1x1 / 2:4 / 4:8). +To efficiently achieve pruning for Large Language Models (LLMs), we have implemented two post-training pruning methods that utilize different pruning patterns: **Retrain-free** (channel-wise) and **SparseGPT** (1x1/N:M). - Retrain-free - The retrain-free algorithm is a lightweight method that utilizes mask retrieval and rearrangement techniques within the Transformer architecture. By incorporating channel pruning and sparse model slimming for the linear layer in Multi-Layer Perceptron (MLP), it effectively achieves a 20% sparsity per layer while preserving accuracy with an accuracy loss of less than 1%. This algorithm seamlessly supports popular models like GPT, OPT, LLaMA, and BLOOM. Its capability to enhance model efficiency while maintaining performance makes it a valuable pruning approach for LLMs. + [The retrain-free algorithm](https://arxiv.org/abs/2204.09656) is a lightweight method that utilizes mask retrieval and rearrangement techniques within the Transformer architecture. By incorporating channel pruning and sparse model slimming for the linear layer in Multi-Layer Perceptron (MLP), it effectively achieves a 20% sparsity per layer while preserving accuracy with an accuracy loss of less than 1%. This algorithm seamlessly supports popular models like GPT, OPT, LLaMA, and BLOOM. Its capability to enhance model efficiency while maintaining performance makes it a valuable pruning approach for LLMs. For a quick and efficient start with the retrain-free algorithm, please refer to the API instructions [Retrain-free Pruning API](#Retrain-free-Pruning-API) - SparseGPT - This algorithm is an efficient post-training pruning method that operates on a block-wise basis. It supports multiple pruning patterns, including 1x1, 2:4, and 4:8, targeting the linear layers within the Multi-Head Attention (MHA) and Multi-Layer Perceptron (MLP) components. By applying this method, it is possible to achieve up to 50% sparsity on models with sizes larger than 10 billion parameters(More model parameters are less sensitive to sparsity), all while maintaining an accuracy loss of less than 1%. It is compatible with a wide range of models, including OPT, GPT, LLaMA, BLOOM, Dolly, MPT, Falcon, Stable-LM, and LaMini-LM, providing flexibility and effectiveness in pruning LLMs. Additionally, it is worth mentioning that larger model parameters tend to be less impacted by sparsity. + [The SparseGPT algorithm](https://arxiv.org/abs/2301.00774) is an efficient post-training pruning method that operates on a block-wise basis. It supports multiple pruning patterns, including 1x1 and N:M, targeting the linear layers within the Multi-Head Attention (MHA) and Multi-Layer Perceptron (MLP) components. By applying this method, it is possible to achieve up to 50% sparsity on models with sizes larger than 10 billion parameters(More model parameters are less sensitive to sparsity), all while maintaining an accuracy loss of less than 1%. It is compatible with a wide range of models, including OPT, GPT, LLaMA, BLOOM, Dolly, MPT, Falcon, Stable-LM, and LaMini-LM, providing flexibility and effectiveness in pruning LLMs. Additionally, it is worth mentioning that larger model parameters tend to be less impacted by sparsity. For a smooth initiation with sparseGPT algorithm, please refer to the API instructions provided [SparseGPT Pruning API](#SparseGPT-Pruning-API) @@ -373,17 +373,16 @@ The following section exemplifies how to use hooks in user pass-in training func ```python pruning_configs = [ - { # config of a single pruner - "pruning_type": "retrain_free", - "pruning_scope": "global", - "op_names": [".fc", ".mlp"], # MLP layer_names - "start_step": 1, - "end_step": 300, # set end_step for Few shot pruning. - "excluded_op_names": ["lm_head"], # A list of modules that would not be pruned. - "target_sparsity": 0.2, # Target sparsity ratio of modules. - "pruning_frequency": 50, # Frequency of applying pruning, The recommended setting is one fortieth of the pruning steps. - "pattern": "channelx1", # Default pruning pattern. - "pruning_op_types": ["Linear"], + { # config of a single pruner + "pruning_type": "retrain_free", + "pruning_scope": "global", + "op_names": ['.fc', '.mlp'], # MLP layer_names + "start_step": 1, + "end_step": 300, # set end_step for Few shot pruning. + "excluded_op_names": ["lm_head"], # A list of modules that would not be pruned. + "target_sparsity": 0.2, # Target sparsity ratio of modules. + "pruning_frequency": 50, # Frequency of applying pruning, + "pattern": "channelx1", # Default pruning pattern. }, ] ``` @@ -393,14 +392,13 @@ The following section exemplifies how to use hooks in user pass-in training func ```python # auto config from neural_compressor.compression.pruner import parse_auto_slim_config - pruning_configs = [] auto_configs = parse_auto_slim_config( model, - ffn2_sparsity=args.target_sparsity, # e.g. 0.2 - mha_sparsity=0, - pruning_scope="global", - pruning_type="retrain_free", + ffn2_sparsity = args.target_sparsity, #e.g. 0.2 + mha_sparsity = 0, + pruning_scope = "global", + pruning_type = "retrain_free", ) pruning_configs += auto_configs ``` @@ -410,35 +408,35 @@ The following section exemplifies how to use hooks in user pass-in training func The process itself is quite straightforward. By passing the prepared config and the calibration dataset, the pruning process can be automatically carried out with a simple API call. ```python - from neural_compressor.training import prepare_pruning, WeightPruningConfig - configs = WeightPruningConfig( - pruning_configs, - target_sparsity=args.target_sparsity, # global setting for all pruners(optional) - pattern=args.pruning_pattern, - start_step=pruning_start, - end_step=pruning_end, - ) - config = WeightPruningConfig(pruning_configs) - - pruning = prepare_pruning(model, configs, dataloader=train_dataloader) # modify the model and complete the pruning - ``` - - - - ### SparseGPT Pruning API - - Step 1: Define a dict-like configuration in your training codes. Usually only 3-5 configuration items need to be identified, for example: - - ```python - pruning_configs = [ - { #example pruner - "pruning_type": "sparse_gpt", - "op_names": [".*"], # Prunes all linear modules by default. - "pruning_op_types": ["Linear"], - "excluded_op_names": ["lm_head", "embed_out"], # A list of modules that would not be pruned. - "target_sparsity": 0.5, # Target sparsity ratio of modules. - "pattern": "1x1", # Default pruning pattern. - } - ] + from neural_compressor.training import prepare_pruning, WeightPruningConfig + configs = WeightPruningConfig( + pruning_configs, + target_sparsity = args.target_sparsity, # global setting for all pruners(optional) + pattern = args.pruning_pattern, + start_step = pruning_start, + end_step = pruning_end, + ) + config = WeightPruningConfig(pruning_configs) + + pruning = prepare_pruning(model, configs, dataloader=train_dataloader) # modify the model and complete the pruning + ``` + + + +### SparseGPT Pruning API +- Step 1: Define a dict-like configuration in your training codes. Usually only 3-5 configuration items need to be identified, for example: + + ```python + pruning_configs = [ + { #example pruner + "pruning_type": "sparse_gpt", + "op_names": [".*"], # Prunes all linear modules by default. + "pruning_op_types": ["Linear"], + "excluded_op_names": ["lm_head", "embed_out"], # A list of modules that would not be pruned. + "target_sparsity": 0.5, # Target sparsity ratio of modules. + "pattern": "1x1", # Default pruning pattern. + } + ] ``` - Step 2: Enable pruning functionalities @@ -448,8 +446,8 @@ The following section exemplifies how to use hooks in user pass-in training func from neural_compressor.training import prepare_pruning, WeightPruningConfig configs = WeightPruningConfig( pruning_configs, - target_sparsity=args.target_sparsity, # global setting for all pruners - pattern=args.pruning_pattern, # e.g. 1x1 / 2:4 + target_sparsity = args.target_sparsity, # global setting for all pruners + pattern = args.pruning_pattern, # e.g. 1x1 / 2:4 ) config = WeightPruningConfig(pruning_configs) # for example: device = "cuda:1" @@ -514,3 +512,4 @@ For more details, please refer to [HPO document](../../neural_compressor/compres [3] Kwon, W., Kim, S., Mahoney, M.W., Hassoun, J., Keutzer, K. and Gholami, A., 2022. A fast post-training pruning framework for transformers. Advances in Neural Information Processing Systems, 35, pp.24101-24116. [4] Frantar, E. and Alistarh, D., Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023. URL https://arxiv. org/abs/2301.00774. + From d4d8df72926058e1e560d2c6d9cd8a8accceb27b Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Fri, 15 Sep 2023 03:00:15 +0000 Subject: [PATCH 06/10] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- docs/source/pruning.md | 87 +++++++++--------- .../compression/pruner/README.md | 88 +++++++++---------- 2 files changed, 88 insertions(+), 87 deletions(-) diff --git a/docs/source/pruning.md b/docs/source/pruning.md index c728b23c6ae..7c139734683 100644 --- a/docs/source/pruning.md +++ b/docs/source/pruning.md @@ -372,16 +372,16 @@ The following section exemplifies how to use hooks in user pass-in training func ```python pruning_configs = [ - { # config of a single pruner - "pruning_type": "retrain_free", - "pruning_scope": "global", - "op_names": ['.fc', '.mlp'], # MLP layer_names - "start_step": 1, - "end_step": 300, # set end_step for Few shot pruning. - "excluded_op_names": ["lm_head"], # A list of modules that would not be pruned. - "target_sparsity": 0.2, # Target sparsity ratio of modules. - "pruning_frequency": 50, # Frequency of applying pruning, - "pattern": "channelx1", # Default pruning pattern. + { # config of a single pruner + "pruning_type": "retrain_free", + "pruning_scope": "global", + "op_names": [".fc", ".mlp"], # MLP layer_names + "start_step": 1, + "end_step": 300, # set end_step for Few shot pruning. + "excluded_op_names": ["lm_head"], # A list of modules that would not be pruned. + "target_sparsity": 0.2, # Target sparsity ratio of modules. + "pruning_frequency": 50, # Frequency of applying pruning, + "pattern": "channelx1", # Default pruning pattern. }, ] ``` @@ -391,13 +391,14 @@ The following section exemplifies how to use hooks in user pass-in training func ```python # auto config from neural_compressor.compression.pruner import parse_auto_slim_config + pruning_configs = [] auto_configs = parse_auto_slim_config( model, - ffn2_sparsity = args.target_sparsity, #e.g. 0.2 - mha_sparsity = 0, - pruning_scope = "global", - pruning_type = "retrain_free", + ffn2_sparsity=args.target_sparsity, # e.g. 0.2 + mha_sparsity=0, + pruning_scope="global", + pruning_type="retrain_free", ) pruning_configs += auto_configs ``` @@ -407,35 +408,35 @@ The following section exemplifies how to use hooks in user pass-in training func The process itself is quite straightforward. By passing the prepared config and the calibration dataset, the pruning process can be automatically carried out with a simple API call. ```python - from neural_compressor.training import prepare_pruning, WeightPruningConfig - configs = WeightPruningConfig( - pruning_configs, - target_sparsity = args.target_sparsity, # global setting for all pruners(optional) - pattern = args.pruning_pattern, - start_step = pruning_start, - end_step = pruning_end, - ) - config = WeightPruningConfig(pruning_configs) - - pruning = prepare_pruning(model, configs, dataloader=train_dataloader) # modify the model and complete the pruning - ``` - - - -### SparseGPT Pruning API -- Step 1: Define a dict-like configuration in your training codes. Usually only 3-5 configuration items need to be identified, for example: - - ```python - pruning_configs = [ - { #example pruner - "pruning_type": "sparse_gpt", - "op_names": [".*"], # Prunes all linear modules by default. - "pruning_op_types": ["Linear"], - "excluded_op_names": ["lm_head", "embed_out"], # A list of modules that would not be pruned. - "target_sparsity": 0.5, # Target sparsity ratio of modules. - "pattern": "1x1", # Default pruning pattern. - } - ] + from neural_compressor.training import prepare_pruning, WeightPruningConfig + configs = WeightPruningConfig( + pruning_configs, + target_sparsity = args.target_sparsity, # global setting for all pruners(optional) + pattern = args.pruning_pattern, + start_step = pruning_start, + end_step = pruning_end, + ) + config = WeightPruningConfig(pruning_configs) + + pruning = prepare_pruning(model, configs, dataloader=train_dataloader) # modify the model and complete the pruning + ``` + + + + ### SparseGPT Pruning API + - Step 1: Define a dict-like configuration in your training codes. Usually only 3-5 configuration items need to be identified, for example: + + ```python + pruning_configs = [ + { #example pruner + "pruning_type": "sparse_gpt", + "op_names": [".*"], # Prunes all linear modules by default. + "pruning_op_types": ["Linear"], + "excluded_op_names": ["lm_head", "embed_out"], # A list of modules that would not be pruned. + "target_sparsity": 0.5, # Target sparsity ratio of modules. + "pattern": "1x1", # Default pruning pattern. + } + ] ``` - Step 2: Enable pruning functionalities diff --git a/neural_compressor/compression/pruner/README.md b/neural_compressor/compression/pruner/README.md index 2878caa2849..54f5b32fb7c 100644 --- a/neural_compressor/compression/pruner/README.md +++ b/neural_compressor/compression/pruner/README.md @@ -373,16 +373,16 @@ The following section exemplifies how to use hooks in user pass-in training func ```python pruning_configs = [ - { # config of a single pruner - "pruning_type": "retrain_free", - "pruning_scope": "global", - "op_names": ['.fc', '.mlp'], # MLP layer_names - "start_step": 1, - "end_step": 300, # set end_step for Few shot pruning. - "excluded_op_names": ["lm_head"], # A list of modules that would not be pruned. - "target_sparsity": 0.2, # Target sparsity ratio of modules. - "pruning_frequency": 50, # Frequency of applying pruning, - "pattern": "channelx1", # Default pruning pattern. + { # config of a single pruner + "pruning_type": "retrain_free", + "pruning_scope": "global", + "op_names": [".fc", ".mlp"], # MLP layer_names + "start_step": 1, + "end_step": 300, # set end_step for Few shot pruning. + "excluded_op_names": ["lm_head"], # A list of modules that would not be pruned. + "target_sparsity": 0.2, # Target sparsity ratio of modules. + "pruning_frequency": 50, # Frequency of applying pruning, + "pattern": "channelx1", # Default pruning pattern. }, ] ``` @@ -392,13 +392,14 @@ The following section exemplifies how to use hooks in user pass-in training func ```python # auto config from neural_compressor.compression.pruner import parse_auto_slim_config + pruning_configs = [] auto_configs = parse_auto_slim_config( model, - ffn2_sparsity = args.target_sparsity, #e.g. 0.2 - mha_sparsity = 0, - pruning_scope = "global", - pruning_type = "retrain_free", + ffn2_sparsity=args.target_sparsity, # e.g. 0.2 + mha_sparsity=0, + pruning_scope="global", + pruning_type="retrain_free", ) pruning_configs += auto_configs ``` @@ -408,35 +409,35 @@ The following section exemplifies how to use hooks in user pass-in training func The process itself is quite straightforward. By passing the prepared config and the calibration dataset, the pruning process can be automatically carried out with a simple API call. ```python - from neural_compressor.training import prepare_pruning, WeightPruningConfig - configs = WeightPruningConfig( - pruning_configs, - target_sparsity = args.target_sparsity, # global setting for all pruners(optional) - pattern = args.pruning_pattern, - start_step = pruning_start, - end_step = pruning_end, - ) - config = WeightPruningConfig(pruning_configs) - - pruning = prepare_pruning(model, configs, dataloader=train_dataloader) # modify the model and complete the pruning - ``` - - - -### SparseGPT Pruning API -- Step 1: Define a dict-like configuration in your training codes. Usually only 3-5 configuration items need to be identified, for example: - - ```python - pruning_configs = [ - { #example pruner - "pruning_type": "sparse_gpt", - "op_names": [".*"], # Prunes all linear modules by default. - "pruning_op_types": ["Linear"], - "excluded_op_names": ["lm_head", "embed_out"], # A list of modules that would not be pruned. - "target_sparsity": 0.5, # Target sparsity ratio of modules. - "pattern": "1x1", # Default pruning pattern. - } - ] + from neural_compressor.training import prepare_pruning, WeightPruningConfig + configs = WeightPruningConfig( + pruning_configs, + target_sparsity = args.target_sparsity, # global setting for all pruners(optional) + pattern = args.pruning_pattern, + start_step = pruning_start, + end_step = pruning_end, + ) + config = WeightPruningConfig(pruning_configs) + + pruning = prepare_pruning(model, configs, dataloader=train_dataloader) # modify the model and complete the pruning + ``` + + + + ### SparseGPT Pruning API + - Step 1: Define a dict-like configuration in your training codes. Usually only 3-5 configuration items need to be identified, for example: + + ```python + pruning_configs = [ + { #example pruner + "pruning_type": "sparse_gpt", + "op_names": [".*"], # Prunes all linear modules by default. + "pruning_op_types": ["Linear"], + "excluded_op_names": ["lm_head", "embed_out"], # A list of modules that would not be pruned. + "target_sparsity": 0.5, # Target sparsity ratio of modules. + "pattern": "1x1", # Default pruning pattern. + } + ] ``` - Step 2: Enable pruning functionalities @@ -512,4 +513,3 @@ For more details, please refer to [HPO document](../../neural_compressor/compres [3] Kwon, W., Kim, S., Mahoney, M.W., Hassoun, J., Keutzer, K. and Gholami, A., 2022. A fast post-training pruning framework for transformers. Advances in Neural Information Processing Systems, 35, pp.24101-24116. [4] Frantar, E. and Alistarh, D., Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023. URL https://arxiv. org/abs/2301.00774. - From 3546fe34948f9139bcf83c138d4294de02ebb55f Mon Sep 17 00:00:00 2001 From: "Zhang, Weiwei1" Date: Fri, 15 Sep 2023 11:03:56 +0800 Subject: [PATCH 07/10] fixtypos_1 Signed-off-by: Zhang, Weiwei1 --- docs/source/pruning.md | 1 - neural_compressor/compression/pruner/README.md | 1 - 2 files changed, 2 deletions(-) diff --git a/docs/source/pruning.md b/docs/source/pruning.md index c728b23c6ae..76a2abc39a2 100644 --- a/docs/source/pruning.md +++ b/docs/source/pruning.md @@ -430,7 +430,6 @@ The following section exemplifies how to use hooks in user pass-in training func { #example pruner "pruning_type": "sparse_gpt", "op_names": [".*"], # Prunes all linear modules by default. - "pruning_op_types": ["Linear"], "excluded_op_names": ["lm_head", "embed_out"], # A list of modules that would not be pruned. "target_sparsity": 0.5, # Target sparsity ratio of modules. "pattern": "1x1", # Default pruning pattern. diff --git a/neural_compressor/compression/pruner/README.md b/neural_compressor/compression/pruner/README.md index 2878caa2849..867759c4154 100644 --- a/neural_compressor/compression/pruner/README.md +++ b/neural_compressor/compression/pruner/README.md @@ -431,7 +431,6 @@ The following section exemplifies how to use hooks in user pass-in training func { #example pruner "pruning_type": "sparse_gpt", "op_names": [".*"], # Prunes all linear modules by default. - "pruning_op_types": ["Linear"], "excluded_op_names": ["lm_head", "embed_out"], # A list of modules that would not be pruned. "target_sparsity": 0.5, # Target sparsity ratio of modules. "pattern": "1x1", # Default pruning pattern. From c898c32e22ebb6ce47c79d5484419e5733894613 Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Fri, 15 Sep 2023 03:13:41 +0000 Subject: [PATCH 08/10] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- docs/source/pruning.md | 94 ++++++++++--------- .../compression/pruner/README.md | 62 ++++++------ 2 files changed, 82 insertions(+), 74 deletions(-) diff --git a/docs/source/pruning.md b/docs/source/pruning.md index 2a848bb09e8..bc88a6829b2 100644 --- a/docs/source/pruning.md +++ b/docs/source/pruning.md @@ -372,16 +372,16 @@ The following section exemplifies how to use hooks in user pass-in training func ```python pruning_configs = [ - { # config of a single pruner - "pruning_type": "retrain_free", - "pruning_scope": "global", - "op_names": ['.fc', '.mlp'], # MLP layer_names - "start_step": 1, - "end_step": 300, # set end_step for Few shot pruning. - "excluded_op_names": ["lm_head"], # A list of modules that would not be pruned. - "target_sparsity": 0.2, # Target sparsity ratio of modules. - "pruning_frequency": 50, # Frequency of applying pruning, - "pattern": "channelx1", # Default pruning pattern. + { # config of a single pruner + "pruning_type": "retrain_free", + "pruning_scope": "global", + "op_names": [".fc", ".mlp"], # MLP layer_names + "start_step": 1, + "end_step": 300, # set end_step for Few shot pruning. + "excluded_op_names": ["lm_head"], # A list of modules that would not be pruned. + "target_sparsity": 0.2, # Target sparsity ratio of modules. + "pruning_frequency": 50, # Frequency of applying pruning, + "pattern": "channelx1", # Default pruning pattern. }, ] ``` @@ -391,13 +391,14 @@ The following section exemplifies how to use hooks in user pass-in training func ```python # auto config from neural_compressor.compression.pruner import parse_auto_slim_config + pruning_configs = [] auto_configs = parse_auto_slim_config( model, - ffn2_sparsity = args.target_sparsity, #e.g. 0.2 - mha_sparsity = 0, - pruning_scope = "global", - pruning_type = "retrain_free", + ffn2_sparsity=args.target_sparsity, # e.g. 0.2 + mha_sparsity=0, + pruning_scope="global", + pruning_type="retrain_free", ) pruning_configs += auto_configs ``` @@ -407,34 +408,34 @@ The following section exemplifies how to use hooks in user pass-in training func The process itself is quite straightforward. By passing the prepared config and the calibration dataset, the pruning process can be automatically carried out with a simple API call. ```python - from neural_compressor.training import prepare_pruning, WeightPruningConfig - configs = WeightPruningConfig( - pruning_configs, - target_sparsity = args.target_sparsity, # global setting for all pruners(optional) - pattern = args.pruning_pattern, - start_step = pruning_start, - end_step = pruning_end, - ) - config = WeightPruningConfig(pruning_configs) - - pruning = prepare_pruning(model, configs, dataloader=train_dataloader) # modify the model and complete the pruning - ``` - - - -### SparseGPT Pruning API -- Step 1: Define a dict-like configuration in your training codes. Usually only 3-5 configuration items need to be identified, for example: - - ```python - pruning_configs = [ - { #example pruner - "pruning_type": "sparse_gpt", - "op_names": [".*"], # Prunes all linear modules by default. - "excluded_op_names": ["lm_head", "embed_out"], # A list of modules that would not be pruned. - "target_sparsity": 0.5, # Target sparsity ratio of modules. - "pattern": "1x1", # Default pruning pattern. - } - ] + from neural_compressor.training import prepare_pruning, WeightPruningConfig + configs = WeightPruningConfig( + pruning_configs, + target_sparsity = args.target_sparsity, # global setting for all pruners(optional) + pattern = args.pruning_pattern, + start_step = pruning_start, + end_step = pruning_end, + ) + config = WeightPruningConfig(pruning_configs) + + pruning = prepare_pruning(model, configs, dataloader=train_dataloader) # modify the model and complete the pruning + ``` + + + + ### SparseGPT Pruning API + - Step 1: Define a dict-like configuration in your training codes. Usually only 3-5 configuration items need to be identified, for example: + + ```python + pruning_configs = [ + { #example pruner + "pruning_type": "sparse_gpt", + "op_names": [".*"], # Prunes all linear modules by default. + "excluded_op_names": ["lm_head", "embed_out"], # A list of modules that would not be pruned. + "target_sparsity": 0.5, # Target sparsity ratio of modules. + "pattern": "1x1", # Default pruning pattern. + } + ] ``` - Step 2: Enable pruning functionalities @@ -442,14 +443,17 @@ The following section exemplifies how to use hooks in user pass-in training func ```python from neural_compressor.training import prepare_pruning, WeightPruningConfig + configs = WeightPruningConfig( pruning_configs, - target_sparsity = args.target_sparsity, # global setting for all pruners - pattern = args.pruning_pattern, # e.g. 1x1 / 2:4 + target_sparsity=args.target_sparsity, # global setting for all pruners + pattern=args.pruning_pattern, # e.g. 1x1 / 2:4 ) config = WeightPruningConfig(pruning_configs) # for example: device = "cuda:1" - pruning = prepare_pruning(model, configs, dataloader=train_dataloader, device=device) # modify the model and complete the pruning + pruning = prepare_pruning( + model, configs, dataloader=train_dataloader, device=device + ) # modify the model and complete the pruning ``` diff --git a/neural_compressor/compression/pruner/README.md b/neural_compressor/compression/pruner/README.md index 7b4fc02a130..d6aab362bb9 100644 --- a/neural_compressor/compression/pruner/README.md +++ b/neural_compressor/compression/pruner/README.md @@ -373,16 +373,16 @@ The following section exemplifies how to use hooks in user pass-in training func ```python pruning_configs = [ - { # config of a single pruner - "pruning_type": "retrain_free", - "pruning_scope": "global", - "op_names": ['.fc', '.mlp'], # MLP layer_names - "start_step": 1, - "end_step": 300, # set end_step for Few shot pruning. - "excluded_op_names": ["lm_head"], # A list of modules that would not be pruned. - "target_sparsity": 0.2, # Target sparsity ratio of modules. - "pruning_frequency": 50, # Frequency of applying pruning, - "pattern": "channelx1", # Default pruning pattern. + { # config of a single pruner + "pruning_type": "retrain_free", + "pruning_scope": "global", + "op_names": [".fc", ".mlp"], # MLP layer_names + "start_step": 1, + "end_step": 300, # set end_step for Few shot pruning. + "excluded_op_names": ["lm_head"], # A list of modules that would not be pruned. + "target_sparsity": 0.2, # Target sparsity ratio of modules. + "pruning_frequency": 50, # Frequency of applying pruning, + "pattern": "channelx1", # Default pruning pattern. }, ] ``` @@ -392,13 +392,14 @@ The following section exemplifies how to use hooks in user pass-in training func ```python # auto config from neural_compressor.compression.pruner import parse_auto_slim_config + pruning_configs = [] auto_configs = parse_auto_slim_config( model, - ffn2_sparsity = args.target_sparsity, #e.g. 0.2 - mha_sparsity = 0, - pruning_scope = "global", - pruning_type = "retrain_free", + ffn2_sparsity=args.target_sparsity, # e.g. 0.2 + mha_sparsity=0, + pruning_scope="global", + pruning_type="retrain_free", ) pruning_configs += auto_configs ``` @@ -409,15 +410,16 @@ The following section exemplifies how to use hooks in user pass-in training func ```python from neural_compressor.training import prepare_pruning, WeightPruningConfig + configs = WeightPruningConfig( pruning_configs, - target_sparsity = args.target_sparsity, # global setting for all pruners(optional) - pattern = args.pruning_pattern, - start_step = pruning_start, - end_step = pruning_end, + target_sparsity=args.target_sparsity, # global setting for all pruners(optional) + pattern=args.pruning_pattern, + start_step=pruning_start, + end_step=pruning_end, ) config = WeightPruningConfig(pruning_configs) - + pruning = prepare_pruning(model, configs, dataloader=train_dataloader) # modify the model and complete the pruning ``` @@ -428,12 +430,12 @@ The following section exemplifies how to use hooks in user pass-in training func ```python pruning_configs = [ - { #example pruner - "pruning_type": "sparse_gpt", - "op_names": [".*"], # Prunes all linear modules by default. - "excluded_op_names": ["lm_head", "embed_out"], # A list of modules that would not be pruned. - "target_sparsity": 0.5, # Target sparsity ratio of modules. - "pattern": "1x1", # Default pruning pattern. + { # example pruner + "pruning_type": "sparse_gpt", + "op_names": [".*"], # Prunes all linear modules by default. + "excluded_op_names": ["lm_head", "embed_out"], # A list of modules that would not be pruned. + "target_sparsity": 0.5, # Target sparsity ratio of modules. + "pattern": "1x1", # Default pruning pattern. } ] ``` @@ -443,14 +445,17 @@ The following section exemplifies how to use hooks in user pass-in training func ```python from neural_compressor.training import prepare_pruning, WeightPruningConfig + configs = WeightPruningConfig( pruning_configs, - target_sparsity = args.target_sparsity, # global setting for all pruners - pattern = args.pruning_pattern, # e.g. 1x1 / 2:4 + target_sparsity=args.target_sparsity, # global setting for all pruners + pattern=args.pruning_pattern, # e.g. 1x1 / 2:4 ) config = WeightPruningConfig(pruning_configs) # for example: device = "cuda:1" - pruning = prepare_pruning(model, configs, dataloader=train_dataloader, device=device) # modify the model and complete the pruning + pruning = prepare_pruning( + model, configs, dataloader=train_dataloader, device=device + ) # modify the model and complete the pruning ``` @@ -511,4 +516,3 @@ For more details, please refer to [HPO document](../../neural_compressor/compres [3] Kwon, W., Kim, S., Mahoney, M.W., Hassoun, J., Keutzer, K. and Gholami, A., 2022. A fast post-training pruning framework for transformers. Advances in Neural Information Processing Systems, 35, pp.24101-24116. [4] Frantar, E. and Alistarh, D., Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023. URL https://arxiv. org/abs/2301.00774. - From d5f2691734ed7ddb5e2375522e3f38b5935cc83b Mon Sep 17 00:00:00 2001 From: "Zhang, Weiwei1" Date: Fri, 15 Sep 2023 11:51:18 +0800 Subject: [PATCH 09/10] add scripts link Signed-off-by: Zhang, Weiwei1 --- .../language-modeling/pruning/eager/README.md | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/README.md b/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/README.md index 93b19d150c9..28ac1c98a2f 100644 --- a/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/README.md +++ b/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/README.md @@ -22,7 +22,7 @@ pip install -r examples/pytorch/nlp/huggingface_models/language-modeling/pruning The dataset will be downloaded automatically from the datasets Hub. See more about loading [huggingface dataset](https://huggingface.co/docs/datasets/loading_datasets.html) - +
# Run Examples @@ -34,10 +34,11 @@ The pruning patterns of 1x1 and N:M are supported through the use of the [Sparse Pruning scripts are available for LLM sparse models such as GPT-j, BLOOM, OPT, LLaMA, and the sparse model can be obtained by modifying the pruning parameters. [Pruning Scripts](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/scripts/). +
-### Results +## Retrain-free Results -The last token accuracy for channel pruning using the retrain-free algorithm is presented in the following table. +The last token accuracy for channel pruning using [the retrain-free scripts](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/scripts/run_gptj_pruning.sh) is presented in the following table. | Model | Calibration dataset | Evaluation dataset | Sparsity pattern | Over MLP block sparsity |Element-wise/matmul, Gemm, conv ratio | Dense last token accuracy | Sparse last token accuracy | Relative drop | | :----: | :----: | :----: | :----: | :----: | :----: |:----: |:----:| :----: | | EleutherAI/gpt-j-6b | lambada | lambada | channelx1 | 0.1999 | 0.1242 | 0.7917 | 0.8038 | +1.50% | @@ -69,7 +70,9 @@ The last word acc of the channel-wise sparse model is shown in the following tab
-The last word acc of the 1x1 pattern sparse model using the sparseGPT algorithm is shown in the following table. +## SparseGPT Results + +The last word acc of the 1x1 pattern sparse model using [the sparseGPT script](https://github.com/intel/neural-compressor/tree/master/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/scripts/run_llm_sparsegpt.sh) is shown in the following table. | Model | Task | Calibration dataset | Evaluation dataset | Sparsity | Precision | Dense last word accuracy | Sparse last word accuracy | Relative drop | | :----: | :----: | :----: | :----: | :----: | :----: | :----: |:----: |:----:| | EleutherAI/gpt-j-6b | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | FP32 | 0.6831 | 0.6911 | +1.17% | From ecbb0efbc7c22570757446e9094247bf2b75f450 Mon Sep 17 00:00:00 2001 From: "Zhang, Weiwei1" Date: Fri, 15 Sep 2023 14:23:38 +0800 Subject: [PATCH 10/10] add scripts link Signed-off-by: Zhang, Weiwei1 --- .../language-modeling/pruning/eager/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/README.md b/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/README.md index 28ac1c98a2f..34ea6b780b3 100644 --- a/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/README.md +++ b/examples/pytorch/nlp/huggingface_models/language-modeling/pruning/eager/README.md @@ -86,7 +86,7 @@ The last word acc of the 1x1 pattern sparse model using [the sparseGPT script](h | bigscience/bloom-7b1 | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | FP32 | 0.5764 | 0.5575 | -3.28% | | bigscience/bloom-7b1 | CLM | wikitext-2-raw-v1 | lambada_openai | 40% | BF16 | 0.5723 | 0.5513 | -3.67% | | decapoda-research/llama-13b-hf | CLM | wikitext-2-raw-v1 | lambada_openai | 50% | FP32 | 0.7627 | 0.7584 | -0.56% | -| decapoda-research/llama-13b-hf | CLM | wikitext-2-raw-v1 | lambada_openai | 50% | BF16 | 0.7601 | 0.7545 | -0.74% | +| decapoda-research/llama-13b-hf | CLM | wikitext-2-raw-v1 | lambada_openai | 50% | BF16 | 0.7601 | 0.7545 | -0.74% | ## References