Skip to content

Commit

Permalink
Added AWQ option to llm-chatbot notebook (#2043)
Browse files Browse the repository at this point in the history
Add an option to run AWQ algorithm during INT4 model compression in
`llm-chatbot` and `llm-rag-langchain` notebooks. Applying AWQ slightly
improves model generation quality, but requires significant amount of
additional memory and time so it is disabled by default.

Some evaluation results are below. The wikitext task is considered as a
more accurate one. If not stated explicitly, AWQ was calibrated on
wikitext2 dataset.

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File

href="file:///C:/Users/nsavelye/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List

href="file:///C:/Users/nsavelye/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">

</head>

<body link="#467886" vlink="#96607D">


Model | Compression | PPL on lambada-openai | PPL on wikitext |
Compression Time
-- | -- | -- | -- | --
gemma-2b-it | FP16 | 8.23 |   |  
gemma-2b-it | INT8_asym | 8.36 |   |  
gemma-2b-it | INT4_sym group size 64   ratio 60% | 8.9 |   | 59 sec.
gemma-2b-it | INT4_sym group size 64 ratio 60% + AWQ | 8.62 |   | 202
sec.
  |   |   |   |  
llama-2-chat-7b | FP16 | 3.26 | 11.6 |  
llama-2-chat-7b | INT8_asym | 3.27 | 11.6 |  
llama-2-chat-7b | INT4_sym group size 128 ratio 80% | 3.38 | 11.95 | 215
sec.
llama-2-chat-7b | INT4_sym group size 128 ratio 80% + AWQ (wikitext2) |
3.44 | 11.88 | 768 sec.
llama-2-chat-7b | INT4_sym group size 128 ratio 80% + AWQ (ptb) | 3.42 |
11.87 |  
  |   |   |   |  
llama-3-8b-instruct | FP16 | 3.1 |   |  
llama-3-8b-instruct | INT8_asym | 3.08 |   |  
llama-3-8b-instruct | INT4_sym group size 128 ratio 80% | 3.38 |   | 242
sec.
llama-3-8b-instruct | INT4_sym group size 128 ratio 80% + AWQ | 3.26 |  
| 956 sec.



</body>

</html>


**Ticket**
141233
  • Loading branch information
nikita-savelyevv committed Jun 11, 2024
1 parent 5c03914 commit 0238a6e
Show file tree
Hide file tree
Showing 3 changed files with 70 additions and 0 deletions.
2 changes: 2 additions & 0 deletions .ci/spellcheck/.pyspelling.wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ autogenerated
autoregressive
autoregressively
AutoTokenizer
AWQ
backend
backends
Baevski
Expand Down Expand Up @@ -853,6 +854,7 @@ WebUI
WER
WIKISQL
WikiTable
Wikitext
WIKITQ
Wofk
WTQ
Expand Down
34 changes: 34 additions & 0 deletions notebooks/llm-chatbot/llm-chatbot.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@
"- [Convert model using Optimum-CLI tool](#Convert-model-using-Optimum-CLI-tool)\n",
"- [Compress model weights](#Compress-model-weights)\n",
" - [Weights Compression using Optimum-CLI](#Weights-Compression-using-Optimum-CLI)\n",
" - [Weight compression with AWQ](#Weight-compression-with-AWQ)\n",
"- [Select device for inference and model variant](#Select-device-for-inference-and-model-variant)\n",
"- [Instantiate Model using Optimum Intel](#Instantiate-Model-using-Optimum-Intel)\n",
"- [Run Chatbot](#Run-Chatbot)\n",
Expand Down Expand Up @@ -453,6 +454,37 @@
"display(prepare_fp16_model)"
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"### Weight compression with AWQ\n",
"[back to top ⬆️](#Table-of-contents:)\n",
"\n",
"[Activation-aware Weight Quantization](https://arxiv.org/abs/2306.00978) (AWQ) is an algorithm that tunes model weights for more accurate INT4 compression. It slightly improves generation quality of compressed LLMs, but requires significant additional time for tuning weights on a calibration dataset. We use `wikitext-2-raw-v1/train` subset of the [Wikitext](https://huggingface.co/datasets/Salesforce/wikitext) dataset for calibration.\n",
"\n",
"Below you can enable AWQ to be additionally applied during model export with INT4 precision.\n",
"\n",
">**Note**: Applying AWQ requires significant memory and time.\n",
"\n",
">**Note**: It is possible that there will be no matching patterns in the model to apply AWQ, in such case it will be skipped."
]
},
{
"metadata": {},
"cell_type": "code",
"outputs": [],
"execution_count": null,
"source": [
"enable_awq = widgets.Checkbox(\n",
" value=False,\n",
" description=\"Enable AWQ\",\n",
" disabled=not prepare_int4_model.value,\n",
")\n",
"display(enable_awq)"
],
"id": "11a8473e509aa040"
},
{
"attachments": {},
"cell_type": "markdown",
Expand Down Expand Up @@ -613,6 +645,8 @@
" int4_compression_args = \" --group-size {} --ratio {}\".format(model_compression_params[\"group_size\"], model_compression_params[\"ratio\"])\n",
" if model_compression_params[\"sym\"]:\n",
" int4_compression_args += \" --sym\"\n",
" if enable_awq.value:\n",
" int4_compression_args += \" --awq --dataset wikitext2 --num-samples 128\"\n",
" export_command_base += int4_compression_args\n",
" if remote_code:\n",
" export_command_base += \" --trust-remote-code\"\n",
Expand Down
34 changes: 34 additions & 0 deletions notebooks/llm-rag-langchain/llm-rag-langchain.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
"- [login to huggingfacehub to get access to pretrained model](#login-to-huggingfacehub-to-get-access-to-pretrained-model)\n",
"- [Convert model and compress model weights](#convert-model-and-compress-model-weights)\n",
" - [LLM conversion and Weights Compression using Optimum-CLI](#LLM-conversion-and-Weights-Compression-using-Optimum-CLI)\n",
" - [Weight compression with AWQ](#Weight-compression-with-AWQ)\n",
" - [Convert embedding model using Optimum-CLI](#Convert-embedding-model-using-Optimum-CLI)\n",
" - [Convert rerank model using Optimum-CLI](#Convert-rerank-model-using-Optimum-CLI)\n",
"- [Select device for inference and model variant](#Select-device-for-inference-and-model-variant)\n",
Expand Down Expand Up @@ -417,6 +418,37 @@
"display(prepare_fp16_model)"
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"#### Weight compression with AWQ\n",
"[back to top ⬆️](#Table-of-contents:)\n",
"\n",
"[Activation-aware Weight Quantization](https://arxiv.org/abs/2306.00978) (AWQ) is an algorithm that tunes model weights for more accurate INT4 compression. It slightly improves generation quality of compressed LLMs, but requires significant additional time for tuning weights on a calibration dataset. We use `wikitext-2-raw-v1/train` subset of the [Wikitext](https://huggingface.co/datasets/Salesforce/wikitext) dataset for calibration.\n",
"\n",
"Below you can enable AWQ to be additionally applied during model export with INT4 precision.\n",
"\n",
">**Note**: Applying AWQ requires significant memory and time.\n",
"\n",
">**Note**: It is possible that there will be no matching patterns in the model to apply AWQ, in such case it will be skipped."
]
},
{
"metadata": {},
"cell_type": "code",
"outputs": [],
"execution_count": null,
"source": [
"enable_awq = widgets.Checkbox(\n",
" value=False,\n",
" description=\"Enable AWQ\",\n",
" disabled=not prepare_int4_model.value,\n",
")\n",
"display(enable_awq)"
],
"id": "e4531bbd67d8753d"
},
{
"cell_type": "code",
"execution_count": 8,
Expand Down Expand Up @@ -531,6 +563,8 @@
" int4_compression_args = \" --group-size {} --ratio {}\".format(model_compression_params[\"group_size\"], model_compression_params[\"ratio\"])\n",
" if model_compression_params[\"sym\"]:\n",
" int4_compression_args += \" --sym\"\n",
" if enable_awq.value:\n",
" int4_compression_args += \" --awq --dataset wikitext2 --num-samples 128\"\n",
" export_command_base += int4_compression_args\n",
" if remote_code:\n",
" export_command_base += \" --trust-remote-code\"\n",
Expand Down

0 comments on commit 0238a6e

Please sign in to comment.