openvinotoolkit · eaidova · Jun 11, 2024 · May 24, 2024 · May 24, 2024 · Jun 5, 2024
diff --git a/.ci/spellcheck/.pyspelling.wordlist.txt b/.ci/spellcheck/.pyspelling.wordlist.txt
@@ -41,6 +41,7 @@ autogenerated
 autoregressive
 autoregressively
 AutoTokenizer
+AWQ
 backend
 backends
 Baevski
@@ -853,6 +854,7 @@ WebUI
 WER
 WIKISQL
 WikiTable
+Wikitext
 WIKITQ
 Wofk
 WTQ

diff --git a/notebooks/llm-chatbot/llm-chatbot.ipynb b/notebooks/llm-chatbot/llm-chatbot.ipynb
@@ -32,6 +32,7 @@
     "- [Convert model using Optimum-CLI tool](#Convert-model-using-Optimum-CLI-tool)\n",
     "- [Compress model weights](#Compress-model-weights)\n",
     "    - [Weights Compression using Optimum-CLI](#Weights-Compression-using-Optimum-CLI)\n",
+    "    - [Weight compression with AWQ](#Weight-compression-with-AWQ)\n",
     "- [Select device for inference and model variant](#Select-device-for-inference-and-model-variant)\n",
     "- [Instantiate Model using Optimum Intel](#Instantiate-Model-using-Optimum-Intel)\n",
     "- [Run Chatbot](#Run-Chatbot)\n",
@@ -453,6 +454,37 @@
     "display(prepare_fp16_model)"
    ]
   },
+  {
+   "metadata": {},
+   "cell_type": "markdown",
+   "source": [
+    "### Weight compression with AWQ\n",
+    "[back to top ⬆️](#Table-of-contents:)\n",
+    "\n",
+    "[Activation-aware Weight Quantization](https://arxiv.org/abs/2306.00978) (AWQ) is an algorithm that tunes model weights for more accurate INT4 compression. It slightly improves generation quality of compressed LLMs, but requires significant additional time for tuning weights on a calibration dataset. We use `wikitext-2-raw-v1/train` subset of the [Wikitext](https://huggingface.co/datasets/Salesforce/wikitext) dataset for calibration.\n",
+    "\n",
+    "Below you can enable AWQ to be additionally applied during model export with INT4 precision.\n",
+    "\n",
+    ">**Note**: Applying AWQ requires significant memory and time.\n",
+    "\n",
+    ">**Note**: It is possible that there will be no matching patterns in the model to apply AWQ, in such case it will be skipped."
+   ]
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "outputs": [],
+   "execution_count": null,
+   "source": [
+    "enable_awq = widgets.Checkbox(\n",
+    "    value=False,\n",
+    "    description=\"Enable AWQ\",\n",
+    "    disabled=not prepare_int4_model.value,\n",
+    ")\n",
+    "display(enable_awq)"
+   ],
+   "id": "11a8473e509aa040"
+  },
   {
    "attachments": {},
    "cell_type": "markdown",
@@ -614,6 +646,8 @@
     "    int4_compression_args = \" --group-size {} --ratio {}\".format(model_compression_params[\"group_size\"], model_compression_params[\"ratio\"])\n",
     "    if model_compression_params[\"sym\"]:\n",
     "        int4_compression_args += \" --sym\"\n",
+    "    if enable_awq.value:\n",
+    "        int4_compression_args += \" --awq --dataset wikitext2 --num-samples 128\"\n",
     "    export_command_base += int4_compression_args\n",
     "    if remote_code:\n",
     "        export_command_base += \" --trust-remote-code\"\n",

diff --git a/notebooks/llm-rag-langchain/llm-rag-langchain.ipynb b/notebooks/llm-rag-langchain/llm-rag-langchain.ipynb
@@ -31,6 +31,7 @@
     "- [login to huggingfacehub to get access to pretrained model](#login-to-huggingfacehub-to-get-access-to-pretrained-model)\n",
     "- [Convert model and compress model weights](#convert-model-and-compress-model-weights)\n",
     "  - [LLM conversion and Weights Compression using Optimum-CLI](#LLM-conversion-and-Weights-Compression-using-Optimum-CLI)\n",
+    "    - [Weight compression with AWQ](#Weight-compression-with-AWQ)\n",
     "  - [Convert embedding model using Optimum-CLI](#Convert-embedding-model-using-Optimum-CLI)\n",
     "  - [Convert rerank model using Optimum-CLI](#Convert-rerank-model-using-Optimum-CLI)\n",
     "- [Select device for inference and model variant](#Select-device-for-inference-and-model-variant)\n",
@@ -417,6 +418,37 @@
     "display(prepare_fp16_model)"
    ]
   },
+  {
+   "metadata": {},
+   "cell_type": "markdown",
+   "source": [
+    "#### Weight compression with AWQ\n",
+    "[back to top ⬆️](#Table-of-contents:)\n",
+    "\n",
+    "[Activation-aware Weight Quantization](https://arxiv.org/abs/2306.00978) (AWQ) is an algorithm that tunes model weights for more accurate INT4 compression. It slightly improves generation quality of compressed LLMs, but requires significant additional time for tuning weights on a calibration dataset. We use `wikitext-2-raw-v1/train` subset of the [Wikitext](https://huggingface.co/datasets/Salesforce/wikitext) dataset for calibration.\n",
+    "\n",
+    "Below you can enable AWQ to be additionally applied during model export with INT4 precision.\n",
+    "\n",
+    ">**Note**: Applying AWQ requires significant memory and time.\n",
+    "\n",
+    ">**Note**: It is possible that there will be no matching patterns in the model to apply AWQ, in such case it will be skipped."
+   ]
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "outputs": [],
+   "execution_count": null,
+   "source": [
+    "enable_awq = widgets.Checkbox(\n",
+    "    value=False,\n",
+    "    description=\"Enable AWQ\",\n",
+    "    disabled=not prepare_int4_model.value,\n",
+    ")\n",
+    "display(enable_awq)"
+   ],
+   "id": "e4531bbd67d8753d"
+  },
   {
    "cell_type": "code",
    "execution_count": 8,
@@ -531,6 +563,8 @@
     "    int4_compression_args = \" --group-size {} --ratio {}\".format(model_compression_params[\"group_size\"], model_compression_params[\"ratio\"])\n",
     "    if model_compression_params[\"sym\"]:\n",
     "        int4_compression_args += \" --sym\"\n",
+    "    if enable_awq.value:\n",
+    "        int4_compression_args += \" --awq --dataset wikitext2 --num-samples 128\"\n",
     "    export_command_base += int4_compression_args\n",
     "    if remote_code:\n",
     "        export_command_base += \" --trust-remote-code\"\n",