[Neuralchat] Support RAG with HF endpoint & add notebook (#1334)

* support rag with hf endpoint & add notebook Signed-off-by: LetongHan <letong.han@intel.com>
intel · Mar 3, 2024 · 34b3e9b · 34b3e9b
1 parent a6c05b9
commit 34b3e9b
Show file tree

Hide file tree

Showing 5 changed files with 409 additions and 2 deletions.
diff --git a/intel_extension_for_transformers/neural_chat/docs/full_notebooks.md b/intel_extension_for_transformers/neural_chat/docs/full_notebooks.md
@@ -12,6 +12,7 @@ Welcome to use Jupyter Notebooks to explore how to build and customize chatbots
 | 1.4     | Building Chatbot on Habana Gaudi1/Gaudi2      | Learn how to create chatbot on Habana Gaudi1/Gaudi2     | [Notebook](./notebooks/build_chatbot_on_habana_gaudi.ipynb) |
 | 1.5     | Building Chatbot on Nvidia A100               | Learn how to create chatbot on Nvidia A100              | [Notebook](./notebooks/build_chatbot_on_nv_a100.ipynb)   |
 | 1.6     | Building Chatbot on Intel CPU Windows PC      | Learn how to create chatbot on Windows PC               | [Notebook](./notebooks/build_talkingbot_on_pc.ipynb) |
+| 1.7     | Building Chatbot with RAG                     | Learn how to create chatbot with RAG                    | [Notebook](./notebooks/build_chatbot_with_rag.ipynb) |
 | 2       | Deploying Chatbots                            |                                                           |                                                         |
 | 2.1     | Deploying Chatbot on Intel CPU ICX            | Learn how to deploy chatbot on ICX                    | [Notebook](./notebooks/deploy_chatbot_on_icx.ipynb) |
 | 2.2     | Deploying Chatbot on Intel CPU SPR            | Learn how to deploy chatbot on SPR                    | [Notebook](./notebooks/deploy_chatbot_on_spr.ipynb) |

diff --git a/intel_extension_for_transformers/neural_chat/docs/notebooks/build_chatbot_with_rag.ipynb b/intel_extension_for_transformers/neural_chat/docs/notebooks/build_chatbot_with_rag.ipynb
@@ -0,0 +1,341 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "NeuralChat is a customizable chat framework designed to create user own chatbot within few minutes on multiple architectures. This notebook is used to demonstrate how to build a RAG application on 4th Generation of Intel® Xeon® Scalable Processors Sapphire Rapids and Habana's Gaudi processors(HPU)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Prepare Environment"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Install intel extension for transformers:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install intel-extension-for-transformers"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Install Requirements:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!git clone https://github.com/intel/intel-extension-for-transformers.git"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%cd ./intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/\n",
+    "!pip install -r requirements.txt\n",
+    "%cd ../../../"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!conda list"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Consume RAG via NeuralChat Model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Consume RAG with Python API"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "User could leverage NeuralChat Retrieval plugin to do domain specific chat by feding with some documents like below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%cd ./intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/\n",
+    "!pip install -r requirements.txt\n",
+    "%cd ../../../../../../"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!mkdir docs\n",
+    "%cd docs\n",
+    "!curl -OL https://raw.githubusercontent.com/intel/intel-extension-for-transformers/main/intel_extension_for_transformers/neural_chat/assets/docs/sample.jsonl\n",
+    "!curl -OL https://raw.githubusercontent.com/intel/intel-extension-for-transformers/main/intel_extension_for_transformers/neural_chat/assets/docs/sample.txt\n",
+    "!curl -OL https://raw.githubusercontent.com/intel/intel-extension-for-transformers/main/intel_extension_for_transformers/neural_chat/assets/docs/sample.xlsx\n",
+    "%cd .."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from intel_extension_for_transformers.neural_chat import PipelineConfig\n",
+    "from intel_extension_for_transformers.neural_chat import build_chatbot\n",
+    "from intel_extension_for_transformers.neural_chat import plugins\n",
+    "plugins.retrieval.enable=True\n",
+    "plugins.retrieval.args[\"input_path\"]=\"./docs/\"\n",
+    "config = PipelineConfig(plugins=plugins)\n",
+    "chatbot = build_chatbot(config)\n",
+    "response = chatbot.predict(\"How many cores does the Intel® Xeon® Platinum 8480+ Processor have in total?\")\n",
+    "print(response)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Consume RAG with HTTP Restfup API"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "User should start `neuralchat_server` to consume HTTP Restful APIs with the command below."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "powershell"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "cp -r ./docs /intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/examples/deployment/rag/docs\n",
+    "cd /intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/examples/deployment/rag\n",
+    "python askdock.py"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Neuralchat support HTTP Restful API with openai-protocol. Users can consume RAG HTTP Restful API with cURL command like this:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "powershell"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "curl -X POST localhost:8000/v1/chat/completions -H 'Content-Type: application/json' \\\n",
+    "    -d {\"model\": \"Intel/neural-chat-7b-v3-1\", \\\n",
+    "        \"messages\": \"How many cores does the Intel® Xeon® Platinum 8480+ Processor have in total?\",}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Consume RAG with TGI service"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this part, we support two scenarios: run services on SPR / on Habana Gaudi. Before consuming RAG with TGI service, user need to launch a TGI service locally/remotely like below."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Launch TGI service on SPR"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "powershell"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "model=Intel/neural-chat-7b-v3-1\n",
+    "volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run\n",
+    " \n",
+    "docker run --shm-size 1g -p 8080:80 -v $volume:/data -e https_proxy -e http_proxy -e HTTPS_PROXY -e HTTP_PROXY -e no_proxy -e NO_PROXY ghcr.io/huggingface/text-generation-inference:1.4 --model-id $model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Launch TGI service on Habana Gaudi"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For Habana Gaudi, you need to build a TGI docker image on your server first. Then start TGI service using this gaudi-docker."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "powershell"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "git clone https://github.com/huggingface/tgi-gaudi.git\n",
+    "cd tgi-gaudi\n",
+    "docker build -t tgi_gaudi ."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "powershell"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "model=Intel/neural-chat-7b-v3-1\n",
+    "volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run\n",
+    "tgi_habana_visible_devices=${your_visible_habana_devices} # define your visible habana devices for TGI service, such as `all`, `0,1`\n",
+    "tgi_sharded=true # boolean, whether to do sharding on more than one card\n",
+    "tgi_num_shard=2 # integer, between 1 and the number of your physical gaudi cards(usually 8)\n",
+    "\n",
+    "docker run -p 8080:80 --name tgi_service_gaudi -v $volume:/data -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true --runtime=habana -e HABANA_VISIBLE_DEVICES=$tgi_habana_visible_devices -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host tgi_gaudi --model-id $model  --sharded $tgi_sharded --num-shard $tgi_num_shard"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Consume RAG service"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "When TGI service is ready, you could leverage the endpoint of TGI to construct a Neuralchat chatbot. Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get the access token and export your Huggingface API token."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "powershell"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token}"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from intel_extension_for_transformers.neural_chat import PipelineConfig\n",
+    "from intel_extension_for_transformers.neural_chat import build_chatbot\n",
+    "from intel_extension_for_transformers.neural_chat import plugins\n",
+    "\n",
+    "plugins.retrieval.enable=True\n",
+    "plugins.retrieval.args[\"input_path\"]=\"./docs/\"\n",
+    "config = PipelineConfig(\n",
+    "    model_name_or_path=\"Intel/neural-chat-7b-v3-1\", \n",
+    "    plugins=plugins, \n",
+    "    hf_endpoint_url=\"http://localhost:8080/\", \n",
+    "    hf_access_token=os.getenv(\"HUGGINGFACEHUB_API_TOKEN\", \"\")\n",
+    ")\n",
+    "chatbot = build_chatbot(config)\n",
+    "response = chatbot.predict(\"How many cores does the Intel® Xeon® Platinum 8480+ Processor have in total?\")\n",
+    "print(response)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}