Enable Intel® Neural Compressor 4-bits weight-only quantization and add related example #614

yuwenzho · 2023-09-27T08:49:31Z

Describe your changes

Support 4-bits weight-only quantization with Intel® Neural Compressor and add related example.

As large language models (LLMs) become more prevalent, there is a growing need for new and improved quantization methods that can meet the computational demands of these modern architectures while maintaining the accuracy. Compared to normal quantization like W8A8, weight only quantization (WOQ) is probably a better trade-off to balance the performance and the accuracy.

Two weight only algorithms are provided in this PR. Round-to-nearest (RTN) is the most straightforward way to quantize weight using scale maps. GPTQ algorithm provides more accurate quantization but requires more computational resources.

Checklist before requesting a review

Add unit tests for this change.
Make sure all tests can pass.
Update documents if necessary.
Format your code by running pre-commit run --all-files
Is this a user-facing change? If yes, give a description of this change to be included in the release notes.

(Optional) Issue link

Signed-off-by: yuwenzho <yuwen.zhou@intel.com>

github-advanced-security

lintrunner found more than 10 potential problems in the proposed changes. Check the Files changed tab for more details.

Signed-off-by: yuwenzho <yuwen.zhou@intel.com>

guotuofeng · 2023-09-27T09:08:56Z

/azp run

azure-pipelines · 2023-09-27T09:09:10Z

Azure Pipelines successfully started running 2 pipeline(s).

enable inc weight-only quantization and add related example

d88f5cb

Signed-off-by: yuwenzho <yuwen.zhou@intel.com>

github-advanced-security bot found potential problems Sep 27, 2023

View reviewed changes

fix format

4c5c46c

Signed-off-by: yuwenzho <yuwen.zhou@intel.com>

guotuofeng approved these changes Sep 28, 2023

View reviewed changes

guotuofeng merged commit 6ea3e72 into microsoft:main Sep 28, 2023
32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Intel® Neural Compressor 4-bits weight-only quantization and add related example #614

Enable Intel® Neural Compressor 4-bits weight-only quantization and add related example #614

yuwenzho commented Sep 27, 2023

github-advanced-security bot left a comment

guotuofeng commented Sep 27, 2023

azure-pipelines bot commented Sep 27, 2023

Enable Intel® Neural Compressor 4-bits weight-only quantization and add related example #614

Enable Intel® Neural Compressor 4-bits weight-only quantization and add related example #614

Conversation

yuwenzho commented Sep 27, 2023

Describe your changes

Checklist before requesting a review

(Optional) Issue link

github-advanced-security bot left a comment

Choose a reason for hiding this comment

guotuofeng commented Sep 27, 2023

azure-pipelines bot commented Sep 27, 2023