[Feature] Add Channel Wise Quantization Support #441

rahul-tuli · 2024-02-12T14:28:13Z

This PR adds Channel Wise Quantization Support to deepsparse.analyze API and ModelAnalysis class

Before this PR:

deepsparse.analyze /network/alexandre/tyler/single_layer/deployment/model.onnx
  File "/home/ubuntu/venv/bin/deepsparse.analyze", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/venv/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/sparsezoo/analyze/cli.py", line 98, in wrap_common_options
    return command(*args, **kwargs)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/sparsezoo/analyze/cli.py", line 152, in wrap_with_performance_options
    return command(*args, **kwargs)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/deepsparse/analyze.py", line 77, in main
    analysis = ModelAnalysis.create(model_path)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/sparsezoo/analyze/analysis.py", line 1308, in create
    result = ModelAnalysis.from_onnx(
  File "/home/ubuntu/venv/lib/python3.10/site-packages/sparsezoo/analyze/analysis.py", line 922, in from_onnx
    node_analyses = cls.analyze_nodes(model_graph)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/sparsezoo/analyze/analysis.py", line 1371, in analyze_nodes
    node_analysis = NodeAnalysis.from_node(
  File "/home/ubuntu/venv/lib/python3.10/site-packages/sparsezoo/analyze/analysis.py", line 365, in from_node
    sparse_node = is_sparse_layer(model_graph, node)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/sparsezoo/utils/onnx/analysis.py", line 257, in is_sparse_layer
    return get_node_sparsity(model_graph, node) > 0
  File "/home/ubuntu/venv/lib/python3.10/site-packages/sparsezoo/utils/onnx/analysis.py", line 318, in get_node_sparsity
    num_zeros, weight_size = get_node_num_zeros_and_size(model_graph, node)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/sparsezoo/utils/onnx/analysis.py", line 148, in get_node_num_zeros_and_size
    zero_point = get_zero_point(model_graph, node)
  File "/home/ubuntu/venv/lib/python3.10/site-packages/sparsezoo/utils/onnx/analysis.py", line 244, in get_zero_point
    raise NotImplementedError("Channel-wise zero points are not supported")
NotImplementedError: Channel-wise zero points are not supported

After This PR (command runs successfully):

2024-02-12 09:00:04 deepsparse.analyze INFO     Starting Analysis ...
INFO:deepsparse.analyze:Starting Analysis ...
2024-02-12 09:00:40 deepsparse.analyze INFO     Analysis complete, collating results...
INFO:deepsparse.analyze:Analysis complete, collating results...
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.7.0.20240104 COMMUNITY | (86c38139) (release) (optimized) (system=avx512, binary=avx512)
[7f3c2eff4740 >WARN<  operator() ./src/include/wand/utility/warnings.hpp:14] Generating emulated code for quantized (INT8) operations since no VNNI instructions were detected. Set NM_FAST_VNNI_EMULATION=1 to increase performance at the expense of accuracy.
Node Timings for Benchmark # 1:
 NODE_NAME                      AVG_RUNTIME                    
 /model/layers.0/input_layernor 2.10                           
 m/ReduceMean                                                  
 /model/layers.0/input_layernor 3.89                           
 m/Mul                                                         
 /model/layers.0/self_attn/v_pr 183.68                         
 oj/module/MatMul_quant                                        
 /model/layers.0/self_attn/k_pr 182.47                         
 oj/module/MatMul_quant                                        
 /model/layers.0/self_attn/q_pr 181.16                         
 oj/module/MatMul_quant                                        
 /model/layers.0/self_attn/attn 125.34                         
 _weights_matmul/MatMul_quant                                  
 /model/Sub                     5.61                           
 /model/Add_1                   3.01                           
 /model/layers.0/self_attn/attn 157.15                         
 _output_matmul/MatMul_quant                                   
 /model/layers.0/self_attn/o_pr 186.67                         
 oj/module/MatMul_quant                                        
 /model/layers.0/mlp/up_proj/mo 565.44                         
 dule/MatMul_quant                                             
 /model/layers.0/mlp/gate_proj/ 567.57                         
 module/MatMul_quant                                           
 /model/layers.0/mlp/Mul        8.28                           
 /model/layers.0/mlp/down_proj/ 506.74                         
 module/MatMul_quant                                           
 /lm_head/module/MatMul_quant   25.92                          

Params:
 MODEL                          SPARSITY                       QUANTIZED                      COUNT                          SIZE                           
 /home/rahul/models/llama-      29.77                          100.00                         464519168                      2609854493                     
 single-layer-channel-                                                                                                                                      
 quant/deployment/model.onnx                                                                                                                                

Ops:
 MODEL                          SPARSITY                       QUANTIZED                      COUNT                          SIZE                           
 /home/rahul/models/llama-      29.77                          100.00                         464519374                      2609859089                     
 single-layer-channel-                                                                                                                                      
 quant/deployment/model.onnx                                                                                                                                

Overall:
 MODEL                          LATENCY                        THROUGHPUT                     SUPPORTED_GRAPH                SPARSITY                       QUANTIZED                      
 /home/rahul/models/llama-      2895.26                        0.35                           1.00                           29.77                          100.00                         
 single-layer-channel-                                                                                                                                                                     
 quant/deployment/model.onnx

Can Also be tested with Sparsezoo using the following snippet:

from sparsezoo.analyze import ModelAnalysis

model_path ="/network/alexandre/tyler/single_layer/deployment/model.onnx"
analysis = ModelAnalysis.create(model_path)
my_yaml = analysis.yaml()
print(my_yaml)

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1206459509992243

bfineran

lgtm - let's sync more on why we need to group four block for the zero points - I don't believe we need to

Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>

* `RegistryMixin` improved alias management (#404) * initial commit * add docstrings * simplify * hardening * refactor * format registry lookup strings to be lowercases * standardise aliases * Move evaluator registry (#411) * More control over external data size (#412) * When splitting external data, avoid renaming `model.data` to `model.data.1` if only one external data file gets eventually saved (#414) * [model.download] fix function returning nothing (#420) * [BugFix] Path not expanded (#418) * [Fix] Allow for processing Path in the sparsezoo analysis (#417) * Raise TypeError instead of ValueError (#426) * Fix misleading docstring (#416) Add test * add support for benchmark.yaml (#415) * add support for benchmark.yaml recent zoo models use `benchmark.yaml` instead of `benchmarks.yaml`. adding this additional pathway so `benchmark.yaml` is downloaded in the bulk model download * update files filter * fix tests --------- Co-authored-by: dbogunowicz <damian@neuralmagic.com> * [BugFix] Add analyze to init (#421) * Add analyze to init * Move onnxruntime to deps * Print model analysis (#423) * [model.download] fix function returning nothing (#420) * [BugFix] Path not expanded (#418) * print model-analysis * [Fix] Allow for processing Path in the sparsezoo analysis (#417) * add print statement at the end of cli run --------- Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com> Co-authored-by: Rahul Tuli <rahul@neuralmagic.com> Co-authored-by: dbogunowicz <97082108+dbogunowicz@users.noreply.github.com> * Omit scalar weight (#424) * ommit scalar weights: * remove unwanted files * comment * Update src/sparsezoo/utils/onnx/analysis.py Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com> --------- Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com> --------- Co-authored-by: George <george@neuralmagic.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com> Co-authored-by: dbogunowicz <97082108+dbogunowicz@users.noreply.github.com> Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com> * update analyze help message for correctness (#432) * initial commit (#430) * [sparsezoo.analyze] Fix pathway such that it works for larger models (#437) * fix analyze to work with larger models * update for failing tests; add comments * Update src/sparsezoo/utils/onnx/external_data.py Co-authored-by: dbogunowicz <97082108+dbogunowicz@users.noreply.github.com> --------- Co-authored-by: Dipika Sikka <dipikasikka1@gmail.coom> Co-authored-by: dbogunowicz <97082108+dbogunowicz@users.noreply.github.com> * Delete hehe.py (#439) * Download deployment dir for llms (#435) * Download deployment dir for llms * Use path instead of download * only set save_as_external_data to true if the model originally had external data (#442) * Add Channel Wise Quantization Support (#441) * Chunk download (#429) * chunk download, break down into 10 * lint * threads download * draft * chunk download draft * job based download and combining/deleteing chunks * delete old code * lint * fix num jobs if file_size is less than the chunk size * doc string and return types * test * lint * fix type hints (#445) * fix bug if the value is a dict (#447) * [deepsparse.analyze] Fix v1 functionality to work with llms (#451) * fix equivalent changes made to analyze_v2 such that inference session works for llms; update wanrings to be debug printouts * typo * overwrite file (#450) Co-authored-by: 21 <a21@21s-MacBook-Pro.local> * Adds a `numpy_array_representer` to yaml (#454) on runtime, to avoid serialization issues * Avoid division by zero (#457) Avoid log of zero * op analysis total counts had double sparse counts (#461) * Rename legacy analyze to analyze_v1 (#459) * Fixing Quant % Calcuation (#462) * initial fix * style * Include Sparsity in Size Calculation (#463) * initial fix * style * incorporate sparsity into size calculation * quality * op analysis total counts had double sparse counts (#461) * Fixing Quant % Calcuation (#462) * initial fix * style * Include Sparsity in Size Calculation (#463) * initial fix * style * incorporate sparsity into size calculation * quality * Revert "Merge branch 'main' into analyze_cherry_picks" This reverts commit 509fa1a, reversing changes made to 08f94c4. --------- Co-authored-by: dbogunowicz <97082108+dbogunowicz@users.noreply.github.com> Co-authored-by: Rahul Tuli <rahul@neuralmagic.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com> Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com> Co-authored-by: dbogunowicz <damian@neuralmagic.com> Co-authored-by: George <george@neuralmagic.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.coom> Co-authored-by: 21 <a21@21s-MacBook-Pro.local>

Add Channel Wise Quantization Support

68b685a

rahul-tuli force-pushed the analyze/add-channel-wise-quantization-support branch from ff51003 to 68b685a Compare February 12, 2024 14:32

rahul-tuli requested review from Satrat, bfineran, dsikka, horheynm and dbogunowicz February 12, 2024 14:33

rahul-tuli self-assigned this Feb 12, 2024

rahul-tuli added the mle-team label Feb 12, 2024

rahul-tuli marked this pull request as ready for review February 12, 2024 14:34

bfineran reviewed Feb 12, 2024

View reviewed changes

bfineran approved these changes Feb 12, 2024

View reviewed changes

bfineran merged commit 944128f into main Feb 12, 2024
4 checks passed

bfineran deleted the analyze/add-channel-wise-quantization-support branch February 12, 2024 19:38

rahul-tuli added a commit that referenced this pull request Feb 12, 2024

Add Channel Wise Quantization Support (#441)

78a0f3c

rahul-tuli mentioned this pull request Feb 12, 2024

[Cherry-pick] Channelwise Quantization Support #444

Merged

bfineran added a commit that referenced this pull request Feb 13, 2024

Add Channel Wise Quantization Support (#441) (#444)

91083c3

Co-authored-by: Benjamin Fineran <bfineran@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add Channel Wise Quantization Support #441

[Feature] Add Channel Wise Quantization Support #441

rahul-tuli commented Feb 12, 2024 •

edited

bfineran left a comment

[Feature] Add Channel Wise Quantization Support #441

[Feature] Add Channel Wise Quantization Support #441

Conversation

rahul-tuli commented Feb 12, 2024 • edited

bfineran left a comment

Choose a reason for hiding this comment

rahul-tuli commented Feb 12, 2024 •

edited