Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added AWQ option to llm-chatbot notebook #2043

Merged

Conversation

nikita-savelyevv
Copy link
Collaborator

@nikita-savelyevv nikita-savelyevv commented May 24, 2024

Add an option to run AWQ algorithm during INT4 model compression in llm-chatbot and llm-rag-langchain notebooks. Applying AWQ slightly improves model generation quality, but requires significant amount of additional memory and time so it is disabled by default.

Some evaluation results are below. The wikitext task is considered as a more accurate one. If not stated explicitly, AWQ was calibrated on wikitext2 dataset.

Model Compression PPL on lambada-openai PPL on wikitext Compression Time
gemma-2b-it FP16 8.23    
gemma-2b-it INT8_asym 8.36    
gemma-2b-it INT4_sym group size 64 ratio 60% 8.9   59 sec.
gemma-2b-it INT4_sym group size 64 ratio 60% + AWQ 8.62   202 sec.
         
llama-2-chat-7b FP16 3.26 11.6  
llama-2-chat-7b INT8_asym 3.27 11.6  
llama-2-chat-7b INT4_sym group size 128 ratio 80% 3.38 11.95 215 sec.
llama-2-chat-7b INT4_sym group size 128 ratio 80% + AWQ (wikitext2) 3.44 11.88 768 sec.
llama-2-chat-7b INT4_sym group size 128 ratio 80% + AWQ (ptb) 3.42 11.87  
         
llama-3-8b-instruct FP16 3.1    
llama-3-8b-instruct INT8_asym 3.08    
llama-3-8b-instruct INT4_sym group size 128 ratio 80% 3.38   242 sec.
llama-3-8b-instruct INT4_sym group size 128 ratio 80% + AWQ 3.26   956 sec.

Ticket
141233

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@nikita-savelyevv nikita-savelyevv marked this pull request as ready for review June 7, 2024 07:47
Copy link

@andreyanufr andreyanufr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it’s worth writing on the web page exactly what precision of the model is used for chat.

@@ -32,6 +32,7 @@
"- [Convert model using Optimum-CLI tool](#Convert-model-using-Optimum-CLI-tool)\n",
Copy link
Contributor

@eaidova eaidova Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please high-light note about unapplicability AWQ and could you please provide details what do you mean under skip (there will be some warning message, there is explicit configuration that skip it or something else?).

Also please provide details about used dataset


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Highlighted the note about skipping of the algorithm and added information on which dataset is used for calibration.

When AWQ is skipped, there will be an NNCF INFO level log message: "No matching patterns were found for applying AWQ algorithm, it will be skipped."

@eaidova eaidova merged commit 0238a6e into openvinotoolkit:latest Jun 11, 2024
5 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants