Added AWQ option to llm-chatbot notebook (#2043)

Add an option to run AWQ algorithm during INT4 model compression in `llm-chatbot` and `llm-rag-langchain` notebooks. Applying AWQ slightly improves model generation quality, but requires significant amount of additional memory and time so it is disabled by default. Some evaluation results are below. The wikitext task is considered as a more accurate one. If not stated explicitly, AWQ was calibrated on wikitext2 dataset. <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/nsavelye/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/nsavelye/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> </head> <body link="#467886" vlink="#96607D"> Model | Compression | PPL on lambada-openai | PPL on wikitext | Compression Time -- | -- | -- | -- | -- gemma-2b-it | FP16 | 8.23 | | gemma-2b-it | INT8_asym | 8.36 | | gemma-2b-it | INT4_sym group size 64 ratio 60% | 8.9 | | 59 sec. gemma-2b-it | INT4_sym group size 64 ratio 60% + AWQ | 8.62 | | 202 sec. | | | | llama-2-chat-7b | FP16 | 3.26 | 11.6 | llama-2-chat-7b | INT8_asym | 3.27 | 11.6 | llama-2-chat-7b | INT4_sym group size 128 ratio 80% | 3.38 | 11.95 | 215 sec. llama-2-chat-7b | INT4_sym group size 128 ratio 80% + AWQ (wikitext2) | 3.44 | 11.88 | 768 sec. llama-2-chat-7b | INT4_sym group size 128 ratio 80% + AWQ (ptb) | 3.42 | 11.87 | | | | | llama-3-8b-instruct | FP16 | 3.1 | | llama-3-8b-instruct | INT8_asym | 3.08 | | llama-3-8b-instruct | INT4_sym group size 128 ratio 80% | 3.38 | | 242 sec. llama-3-8b-instruct | INT4_sym group size 128 ratio 80% + AWQ | 3.26 | | 956 sec. </body> </html> **Ticket** 141233
openvinotoolkit · Jun 11, 2024 · 0238a6e · 0238a6e
1 parent 5c03914
commit 0238a6e
Show file tree

Hide file tree

Showing 3 changed files with 70 additions and 0 deletions.
diff --git a/.ci/spellcheck/.pyspelling.wordlist.txt b/.ci/spellcheck/.pyspelling.wordlist.txt
@@ -41,6 +41,7 @@ autogenerated
 autoregressive
 autoregressively
 AutoTokenizer
+AWQ
 backend
 backends
 Baevski
@@ -853,6 +854,7 @@ WebUI
 WER
 WIKISQL
 WikiTable
+Wikitext
 WIKITQ
 Wofk
 WTQ

diff --git a/notebooks/llm-chatbot/llm-chatbot.ipynb b/notebooks/llm-chatbot/llm-chatbot.ipynb
@@ -32,6 +32,7 @@
     "- [Convert model using Optimum-CLI tool](#Convert-model-using-Optimum-CLI-tool)\n",
     "- [Compress model weights](#Compress-model-weights)\n",
     "    - [Weights Compression using Optimum-CLI](#Weights-Compression-using-Optimum-CLI)\n",
+    "    - [Weight compression with AWQ](#Weight-compression-with-AWQ)\n",
     "- [Select device for inference and model variant](#Select-device-for-inference-and-model-variant)\n",
     "- [Instantiate Model using Optimum Intel](#Instantiate-Model-using-Optimum-Intel)\n",
     "- [Run Chatbot](#Run-Chatbot)\n",
@@ -453,6 +454,37 @@
     "display(prepare_fp16_model)"
    ]
   },
+  {
+   "metadata": {},
+   "cell_type": "markdown",
+   "source": [
+    "### Weight compression with AWQ\n",
+    "[back to top ⬆️](#Table-of-contents:)\n",
+    "\n",
+    "[Activation-aware Weight Quantization](https://arxiv.org/abs/2306.00978) (AWQ) is an algorithm that tunes model weights for more accurate INT4 compression. It slightly improves generation quality of compressed LLMs, but requires significant additional time for tuning weights on a calibration dataset. We use `wikitext-2-raw-v1/train` subset of the [Wikitext](https://huggingface.co/datasets/Salesforce/wikitext) dataset for calibration.\n",
+    "\n",
+    "Below you can enable AWQ to be additionally applied during model export with INT4 precision.\n",
+    "\n",
+    ">**Note**: Applying AWQ requires significant memory and time.\n",
+    "\n",
+    ">**Note**: It is possible that there will be no matching patterns in the model to apply AWQ, in such case it will be skipped."
+   ]
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "outputs": [],
+   "execution_count": null,
+   "source": [
+    "enable_awq = widgets.Checkbox(\n",
+    "    value=False,\n",
+    "    description=\"Enable AWQ\",\n",
+    "    disabled=not prepare_int4_model.value,\n",
+    ")\n",
+    "display(enable_awq)"
+   ],
+   "id": "11a8473e509aa040"
+  },
   {
    "attachments": {},
    "cell_type": "markdown",
@@ -613,6 +645,8 @@
     "    int4_compression_args = \" --group-size {} --ratio {}\".format(model_compression_params[\"group_size\"], model_compression_params[\"ratio\"])\n",
     "    if model_compression_params[\"sym\"]:\n",
     "        int4_compression_args += \" --sym\"\n",
+    "    if enable_awq.value:\n",
+    "        int4_compression_args += \" --awq --dataset wikitext2 --num-samples 128\"\n",
     "    export_command_base += int4_compression_args\n",
     "    if remote_code:\n",
     "        export_command_base += \" --trust-remote-code\"\n",

diff --git a/notebooks/llm-rag-langchain/llm-rag-langchain.ipynb b/notebooks/llm-rag-langchain/llm-rag-langchain.ipynb
@@ -31,6 +31,7 @@
     "- [login to huggingfacehub to get access to pretrained model](#login-to-huggingfacehub-to-get-access-to-pretrained-model)\n",
     "- [Convert model and compress model weights](#convert-model-and-compress-model-weights)\n",
     "  - [LLM conversion and Weights Compression using Optimum-CLI](#LLM-conversion-and-Weights-Compression-using-Optimum-CLI)\n",
+    "    - [Weight compression with AWQ](#Weight-compression-with-AWQ)\n",
     "  - [Convert embedding model using Optimum-CLI](#Convert-embedding-model-using-Optimum-CLI)\n",
     "  - [Convert rerank model using Optimum-CLI](#Convert-rerank-model-using-Optimum-CLI)\n",
     "- [Select device for inference and model variant](#Select-device-for-inference-and-model-variant)\n",
@@ -417,6 +418,37 @@
     "display(prepare_fp16_model)"
    ]
   },
+  {
+   "metadata": {},
+   "cell_type": "markdown",
+   "source": [
+    "#### Weight compression with AWQ\n",
+    "[back to top ⬆️](#Table-of-contents:)\n",
+    "\n",
+    "[Activation-aware Weight Quantization](https://arxiv.org/abs/2306.00978) (AWQ) is an algorithm that tunes model weights for more accurate INT4 compression. It slightly improves generation quality of compressed LLMs, but requires significant additional time for tuning weights on a calibration dataset. We use `wikitext-2-raw-v1/train` subset of the [Wikitext](https://huggingface.co/datasets/Salesforce/wikitext) dataset for calibration.\n",
+    "\n",
+    "Below you can enable AWQ to be additionally applied during model export with INT4 precision.\n",
+    "\n",
+    ">**Note**: Applying AWQ requires significant memory and time.\n",
+    "\n",
+    ">**Note**: It is possible that there will be no matching patterns in the model to apply AWQ, in such case it will be skipped."
+   ]
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "outputs": [],
+   "execution_count": null,
+   "source": [
+    "enable_awq = widgets.Checkbox(\n",
+    "    value=False,\n",
+    "    description=\"Enable AWQ\",\n",
+    "    disabled=not prepare_int4_model.value,\n",
+    ")\n",
+    "display(enable_awq)"
+   ],
+   "id": "e4531bbd67d8753d"
+  },
   {
    "cell_type": "code",
    "execution_count": 8,
@@ -531,6 +563,8 @@
     "    int4_compression_args = \" --group-size {} --ratio {}\".format(model_compression_params[\"group_size\"], model_compression_params[\"ratio\"])\n",
     "    if model_compression_params[\"sym\"]:\n",
     "        int4_compression_args += \" --sym\"\n",
+    "    if enable_awq.value:\n",
+    "        int4_compression_args += \" --awq --dataset wikitext2 --num-samples 128\"\n",
     "    export_command_base += int4_compression_args\n",
     "    if remote_code:\n",
     "        export_command_base += \" --trust-remote-code\"\n",