Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added AWQ option to llm-chatbot notebook #2043

Merged
2 changes: 2 additions & 0 deletions .ci/spellcheck/.pyspelling.wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ autogenerated
autoregressive
autoregressively
AutoTokenizer
AWQ
backend
backends
Baevski
Expand Down Expand Up @@ -853,6 +854,7 @@ WebUI
WER
WIKISQL
WikiTable
Wikitext
WIKITQ
Wofk
WTQ
Expand Down
34 changes: 34 additions & 0 deletions notebooks/llm-chatbot/llm-chatbot.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@
"- [Convert model using Optimum-CLI tool](#Convert-model-using-Optimum-CLI-tool)\n",
Copy link
Contributor

@eaidova eaidova Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please high-light note about unapplicability AWQ and could you please provide details what do you mean under skip (there will be some warning message, there is explicit configuration that skip it or something else?).

Also please provide details about used dataset


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Highlighted the note about skipping of the algorithm and added information on which dataset is used for calibration.

When AWQ is skipped, there will be an NNCF INFO level log message: "No matching patterns were found for applying AWQ algorithm, it will be skipped."

"- [Compress model weights](#Compress-model-weights)\n",
" - [Weights Compression using Optimum-CLI](#Weights-Compression-using-Optimum-CLI)\n",
" - [Weight compression with AWQ](#Weight-compression-with-AWQ)\n",
"- [Select device for inference and model variant](#Select-device-for-inference-and-model-variant)\n",
"- [Instantiate Model using Optimum Intel](#Instantiate-Model-using-Optimum-Intel)\n",
"- [Run Chatbot](#Run-Chatbot)\n",
Expand Down Expand Up @@ -453,6 +454,37 @@
"display(prepare_fp16_model)"
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"### Weight compression with AWQ\n",
"[back to top ⬆️](#Table-of-contents:)\n",
"\n",
"[Activation-aware Weight Quantization](https://arxiv.org/abs/2306.00978) (AWQ) is an algorithm that tunes model weights for more accurate INT4 compression. It slightly improves generation quality of compressed LLMs, but requires significant additional time for tuning weights on a calibration dataset. We use `wikitext-2-raw-v1/train` subset of the [Wikitext](https://huggingface.co/datasets/Salesforce/wikitext) dataset for calibration.\n",
"\n",
"Below you can enable AWQ to be additionally applied during model export with INT4 precision.\n",
"\n",
">**Note**: Applying AWQ requires significant memory and time.\n",
"\n",
">**Note**: It is possible that there will be no matching patterns in the model to apply AWQ, in such case it will be skipped."
]
},
{
"metadata": {},
"cell_type": "code",
"outputs": [],
"execution_count": null,
"source": [
"enable_awq = widgets.Checkbox(\n",
" value=False,\n",
" description=\"Enable AWQ\",\n",
" disabled=not prepare_int4_model.value,\n",
")\n",
"display(enable_awq)"
],
"id": "11a8473e509aa040"
},
{
"attachments": {},
"cell_type": "markdown",
Expand Down Expand Up @@ -614,6 +646,8 @@
" int4_compression_args = \" --group-size {} --ratio {}\".format(model_compression_params[\"group_size\"], model_compression_params[\"ratio\"])\n",
" if model_compression_params[\"sym\"]:\n",
" int4_compression_args += \" --sym\"\n",
" if enable_awq.value:\n",
" int4_compression_args += \" --awq --dataset wikitext2 --num-samples 128\"\n",
" export_command_base += int4_compression_args\n",
" if remote_code:\n",
" export_command_base += \" --trust-remote-code\"\n",
Expand Down
34 changes: 34 additions & 0 deletions notebooks/llm-rag-langchain/llm-rag-langchain.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
"- [login to huggingfacehub to get access to pretrained model](#login-to-huggingfacehub-to-get-access-to-pretrained-model)\n",
"- [Convert model and compress model weights](#convert-model-and-compress-model-weights)\n",
" - [LLM conversion and Weights Compression using Optimum-CLI](#LLM-conversion-and-Weights-Compression-using-Optimum-CLI)\n",
" - [Weight compression with AWQ](#Weight-compression-with-AWQ)\n",
" - [Convert embedding model using Optimum-CLI](#Convert-embedding-model-using-Optimum-CLI)\n",
" - [Convert rerank model using Optimum-CLI](#Convert-rerank-model-using-Optimum-CLI)\n",
"- [Select device for inference and model variant](#Select-device-for-inference-and-model-variant)\n",
Expand Down Expand Up @@ -417,6 +418,37 @@
"display(prepare_fp16_model)"
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"#### Weight compression with AWQ\n",
"[back to top ⬆️](#Table-of-contents:)\n",
"\n",
"[Activation-aware Weight Quantization](https://arxiv.org/abs/2306.00978) (AWQ) is an algorithm that tunes model weights for more accurate INT4 compression. It slightly improves generation quality of compressed LLMs, but requires significant additional time for tuning weights on a calibration dataset. We use `wikitext-2-raw-v1/train` subset of the [Wikitext](https://huggingface.co/datasets/Salesforce/wikitext) dataset for calibration.\n",
"\n",
"Below you can enable AWQ to be additionally applied during model export with INT4 precision.\n",
"\n",
">**Note**: Applying AWQ requires significant memory and time.\n",
"\n",
">**Note**: It is possible that there will be no matching patterns in the model to apply AWQ, in such case it will be skipped."
]
},
{
"metadata": {},
"cell_type": "code",
"outputs": [],
"execution_count": null,
"source": [
"enable_awq = widgets.Checkbox(\n",
" value=False,\n",
" description=\"Enable AWQ\",\n",
" disabled=not prepare_int4_model.value,\n",
")\n",
"display(enable_awq)"
],
"id": "e4531bbd67d8753d"
},
{
"cell_type": "code",
"execution_count": 8,
Expand Down Expand Up @@ -531,6 +563,8 @@
" int4_compression_args = \" --group-size {} --ratio {}\".format(model_compression_params[\"group_size\"], model_compression_params[\"ratio\"])\n",
" if model_compression_params[\"sym\"]:\n",
" int4_compression_args += \" --sym\"\n",
" if enable_awq.value:\n",
" int4_compression_args += \" --awq --dataset wikitext2 --num-samples 128\"\n",
" export_command_base += int4_compression_args\n",
" if remote_code:\n",
" export_command_base += \" --trust-remote-code\"\n",
Expand Down
Loading