[NeuralChat] Refine Document (#1127)

[NeuralChat] Refactor RAG code and structure Signed-off-by: XuhuiRen <xuhui.ren@intel.com> Signed-off-by: lvliang-intel <liang1.lv@intel.com> Signed-off-by: Tian, Feng <feng.tian@intel.com> Co-authored-by: hshen14 <haihao.shen@intel.com>
intel · Jan 13, 2024 · aabb2fc · aabb2fc
1 parent 04f5ef6
commit aabb2fc
Show file tree

Hide file tree

Showing 6 changed files with 327 additions and 114 deletions.
diff --git a/intel_extension_for_transformers/neural_chat/README.md b/intel_extension_for_transformers/neural_chat/README.md
@@ -2,183 +2,170 @@
 
 NeuralChat
 ===========================
-<h3> A customizable chatbot framework to create your own chatbot within minutes</h3>
+<h3> A customizable framework to create your own LLM-driven AI apps within minutes</h3>
 
----
-<div align="left">
+[🌟RESTful API](./docs/neuralchat_api.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[💻Examples](./examples)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[📖Notebooks](./docs/full_notebooks.md)
+</div>
 
-## Introduction
+# Introduction
 
-NeuralChat is a customizable chat framework designed to easily create user own chatbot that can be efficiently deployed across multiple architectures (e.g., Intel® Xeon® Scalable processors, Habana® Gaudi® AI processors). NeuralChat is built on top of large language models (LLMs) and provides a set of strong capabilities including LLM fine-tuning, optimization, and inference, together with a rich set of plugins such as knowledge retrieval, query caching, etc. With NeuralChat, you can easily create a text-based or audio-based chatbot within minutes and deploy on user favorite platform rapidly.
+NeuralChat is a powerful and flexible open framework that empowers you to effortlessly create LLM-centric AI applications, including chatbots and copilots.
+* Support a range of hardware like [Intel Xeon Scalable processors](https://www.intel.com/content/www/us/en/products/details/processors/xeon/scalable.html), [Intel Gaudi AI processors](https://habana.ai/products), [Intel® Data Center GPU Max Series](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/max-series.html) and NVidia GPUs
+* Leverage the leading AI frameworks (e.g., [PyTorch](https://pytorch.org/) and popular domain libraries (e.g., [Hugging Face](https://github.com/huggingface), [Langchain](https://www.langchain.com/)) with their extensions
+* Support the model customizations through parameter-efficient fine-tuning, quantization, and sparsity. Released [Intel NeuralChat-7B LLM](https://huggingface.co/Intel/neural-chat-7b-v3-1), ranking #1 in Hugging Face leaderboard in Nov'23
+* Provide a rich set of plugins that can augment the AI applications through retrieval-augmented generation (RAG) (e.g., [fastRAG](https://github.com/IntelLabs/fastRAG/tree/main)), content moderation, query caching, more
+* Integrate with popular serving frameworks (e.g., [vLLM](https://github.com/vllm-project/vllm), [TGI](https://github.com/huggingface/text-generation-inference), [Triton](https://developer.nvidia.com/triton-inference-server)). Support [OpenAI](https://platform.openai.com/docs/introduction)-compatible API to simplify the creation or migration of AI applications
 
-<a target="_blank" href="./assets/pictures/neuralchat.png">
+<a target="_blank" href="./docs/images/neuralchat_arch.png">
 <p align="center">
-  <img src="./assets/pictures/neuralchat.png" alt="NeuralChat" width=600 height=250>
+  <img src="./docs/images/neuralchat_arch.png" alt="NeuralChat" width=600 height=340>
 </p>
 </a>
 
-> NeuralChat is under active development with some experimental features (APIs are subject to change).
+> NeuralChat is under active development. APIs are subject to change.
 
-## Installation
+# Installation
 
-NeuralChat is seamlessly integrated into the Intel Extension for Transformers. Please refer to [Installation](../../docs/installation.md) page for step by step instructions.
+NeuralChat is under Intel Extension for Transformers, so ensure the installation of Intel Extension for Transformers first by following the [installation](../../docs/installation.md). After that, install additional dependency for NeuralChat per your device:
 
-## Getting Started
-
-NeuralChat could be deployed locally or accessed through service.
+```shell
+# For CPU device
+pip install -r requirements_cpu.txt
 
-### Deploy Chatbot Locally
+# For HPU device
+pip install -r requirements_hpu.txt
 
-NeuralChat can be deployed locally and users can run it through command line or python code.
+# For XPU device
+pip install -r requirements_xpu.txt
 
-```shell
-# Command line
-neuralchat predict --query "Tell me about Intel Xeon Scalable Processors."
+# For CUDA
+pip install -r requirements.txt
 ```
 
-```python
-# Python code
-from intel_extension_for_transformers.neural_chat import build_chatbot
-chatbot = build_chatbot()
-response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")
-print(response)
-```
+# Getting Started
 
-### Deploy Chatbot Service
+## OpenAI-Compatible RESTful APIs
 
-NeuralChat can be deployed as a service and users can access it through curl with Restful API.
+NeuralChat provides OpenAI-compatible RESTful APIs for LLM inference, so you can use NeuralChat as a drop-in replacement for OpenAI APIs. NeuralChat service can also be accessible through [OpenAI client library](https://github.com/openai/openai-python), `curl` commands, and `requests` library. See [neuralchat_api.md](./docs/neuralchat_api.md).
 
-#### Launch Service
+### Launch OpenAI-compatible Service
 
+NeuralChat launches a chatbot service using [Intel/neural-chat-7b-v3-1](https://huggingface.co/Intel/neural-chat-7b-v3-1) by default. You can customize the chatbot service by configuring the YAML file.
 
 ```shell
 neuralchat_server start --config_file ./server/config/neuralchat.yaml
 ```
 
-#### Access Service
+### Access the Service
 
+Once the service is running, you can observe an OpenAI-compatible endpoint `/v1/chat/completions`. You can use any of below ways to access the endpoint.
 
-```shell
-curl -X POST -H "Content-Type: application/json" -d '{"prompt": "Tell me about Intel Xeon Scalable Processors."}' http://127.0.0.1:80/v1/chat/completions
+#### Using OpenAI Client Library
+```python
+from openai import Client
+# Replace 'your_api_key' with your actual OpenAI API key
+api_key = 'your_api_key'
+backend_url = 'http://127.0.0.1:80/v1/chat/completions'
+client = Client(api_key=api_key, base_url=backend_url)
+response = client.ChatCompletion.create(
+      model="Intel/neural-chat-7b-v3-1",
+      messages=[
+          {"role": "system", "content": "You are a helpful assistant."},
+          {"role": "user", "content": "Tell me about Intel Xeon Scalable Processors."},
+      ]
+)
+print(response)
 ```
 
-## Advanced Topics
-
-### Plugins
-
-NeuralChat introduces the `plugins` which offer a rich set of useful LLM utils and features to augment the chatbot's capability. Such plugins are applied in the chatbot pipeline for inference.
-
-Below shows the supported plugins:
-
-- [Knowledge Retrieval](./pipeline/plugins/retrieval/)
-
-    Knowledge retrieval consists of document indexing for efficient retrieval of relevant information, including Dense Indexing based on LangChain and Sparse Indexing based on fastRAG, document rankers to prioritize the most relevant responses.
-
-- [Query Caching](./pipeline/plugins/caching/)
-
-    Query caching enables the fast path to get the response without LLM inference and therefore improves the chat response time
-
-- [Prompt Optimization](./pipeline/plugins/prompt/)
+#### Using Curl
+```shell
+curl http://127.0.0.1:80/v1/chat/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+    "model": "Intel/neural-chat-7b-v3-1",
+    "messages": [
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": "Tell me about Intel Xeon Scalable Processors."}
+    ]
+    }'
+```
 
-    Prompt optimization supports auto prompt engineering to improve user prompts.
+#### Using Python Requests Library
 
-- [Memory Controller](./pipeline/plugins/memory/)
+```python
+import requests
+url = 'http://127.0.0.1:80/v1/chat/completions'
+headers = {'Content-Type': 'application/json'}
+data = '{"model": "Intel/neural-chat-7b-v3-1", "messages": [ \
+          {"role": "system", "content": "You are a helpful assistant."}, \
+          {"role": "user", "content": "Tell me about Intel Xeon Scalable Processors."}] \
+       }'
+response = requests.post(url, headers=headers, data=data)
+print(response.json())
+```
 
-    Memory controller enables the efficient memory utilization.
+## Langchain Extension APIs
 
-- [Safety Checker](./pipeline/plugins/security/)
+Intel Extension for Transformers provides a comprehensive suite of Langchain-based extension APIs, including advanced retrievers, embedding models, and vector stores. These enhancements are carefully crafted to expand the capabilities of the original langchain API, ultimately boosting overall performance. This extension is specifically tailored to enhance the functionality and performance of RAG.
 
-    Safety checker enables the sensitive content check on inputs and outputs of the chatbot.
+### Vector Stores
 
-User could enable, disable, and even change the default behavior of all supported plugins like below
+We introduce enhanced vector store operations, enabling users to adjust and fine-tune their settings even after the chatbot has been initialized, offering a more adaptable and user-friendly experience. For langchain users, integrating and utilizing optimized Vector Stores is straightforward by replacing the original Chroma API in langchain.
 
 ```python
-from intel_extension_for_transformers.neural_chat import build_chatbot, PipelineConfig, plugins
-
-plugins.retrieval.enable = True
-plugins.retrieval.args["input_path"]="./assets/docs/"
-conf = PipelineConf(plugins=plugins)
-chatbot = build_chatbot(conf)
-
+from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
+from langchain.chains import RetrievalQA
+from langchain_core.vectorstores import VectorStoreRetriever
+from intel_extension_for_transformers.langchain.vectorstores import Chroma
+retriever = VectorStoreRetriever(vectorstore=Chroma(...))
+retrievalQA = RetrievalQA.from_llm(llm=HuggingFacePipeline(...), retriever=retriever)
 ```
 
-### Fine-tuning
+### Retrievers
 
-NeuralChat supports fine-tuning the pretrained large language model (LLM) for text-generation, summarization, code generation tasks, and even TTS model, for user to create the customized chatbot.
-
-```shell
-# Command line
-neuralchat finetune --base_model "Intel/neural-chat-7b-v3-1" --config pipeline/finetuning/config/finetuning.yaml
-```
+We provide optimized retrievers such as `VectorStoreRetriever`, `ChildParentRetriever` to efficiently handle vectorstore operations, ensuring optimal retrieval performance.
 
 ```python
-# Python code
-from intel_extension_for_transformers.neural_chat import finetune_model, TextGenerationFinetuningConfig
-finetune_cfg = TextGenerationFinetuningConfig() # support other finetuning config
-finetune_model(finetune_cfg)
+from intel_extension_for_transformers.langchain.retrievers import ChildParentRetriever
+from langchain.vectorstores import Chroma
+retriever = ChildParentRetriever(vectorstore=Chroma(documents=child_documents), parentstore=Chroma(documents=parent_documents), search_type=xxx, search_kwargs={...})
+docs=retriever.get_relevant_documents("Intel")
 ```
 
-### Optimization
+Please refer to this [documentation](./pipeline/plugins/retrieval/README.md) for more details.
 
-NeuralChat provides typical model optimization technologies, like `Automatic Mixed Precision (AMP)` and `Weight Only Quantization`, to allow user to define a customized chatbot.
 
-```shell
-# Command line
-neuralchat optimize --base_model "Intel/neural-chat-7b-v3-1" --config pipeline/optimization/config/optimization.yaml
-```
+## Advanced Features
 
-```python
-# Python code
-from intel_extension_for_transformers.neural_chat import build_chatbot, MixedPrecisionConfig
-pipeline_cfg = PipelineConfig(optimization_config=MixedPrecisionConfig())
-chatbot = build_chatbot(pipeline_cfg)
-```
+NeuralChat introduces `plugins` that offer a wide range of useful LLM utilities and features, enhancing the capabilities of the chatbot. Additionally, NeuralChat provides advanced model optimization technologies such as `Automatic Mixed Precision (AMP)` and `Weight Only Quantization`. These technologies enable users to run a high-throughput chatbot efficiently. NeuralChat further supports fine-tuning the pretrained LLMs for tasks such as text generation, summarization, code generation, and even Text-to-Speech (TTS) models, allowing users to create customized chatbots tailored to their specific needs.
 
-## Validated Model List
+Please refer to this [documentation](./docs/advanced_features.md) for more details.
+
+# Models
+
+## Supported  Models
 The table below displays the validated model list in NeuralChat for both inference and fine-tuning.
 |Pretrained model| Text Generation (Completions) | Text Generation (Chat Completions) | Summarization | Code Generation | 
 |------------------------------------|:---:|:---:|:---:|:---:|
 |Intel/neural-chat-7b-v1-1| ✅| ✅| ✅| ✅    |
 |Intel/neural-chat-7b-v3-1| ✅| ✅| ✅| ✅    |
 |LLaMA series| ✅| ✅|✅| ✅    |
 |LLaMA2 series| ✅| ✅|✅| ✅    |
+|GPT-J| ✅| ✅|✅| ✅    |
 |MPT series| ✅| ✅|✅| ✅    |
-|Mistral| ✅| ✅|✅| ✅    |
-|Mixtral-8x7b-v0.1| ✅| ✅|✅| ✅    |
+|Mistral series| ✅| ✅|✅| ✅    |
+|Mixtral series| ✅| ✅|✅| ✅    |
+|SOLAR Series| ✅| ✅|✅| ✅    |
 |ChatGLM series| ✅| ✅|✅| ✅    |
 |Qwen series| ✅| ✅|✅| ✅    |
 |StarCoder series|   |   |   | ✅ |
 |CodeLLaMA series|   |   |   | ✅ |
 |CodeGen series|   |   |   | ✅ |
+|MagicCoder series|   |   |   | ✅ |
 
+# Notebooks
 
-## Restful API
-
-### OpenAI-Compatible RESTful APIs & SDK
-NeuralChat provides OpenAI-compatible APIs for LLM inference, so you can use NeuralChat as a local drop-in replacement for OpenAI APIs. The NeuralChat server is compatible with both [openai-python library](https://github.com/openai/openai-python) and cURL commands. See [neuralchat_api.md](./docs/neuralchat_api.md).
-
-The following OpenAI APIs are supported:
-
-- Chat Completions. (Reference: https://platform.openai.com/docs/api-reference/chat)
-- Completions. (Reference: https://platform.openai.com/docs/api-reference/completions)
-
-### Additional useful RESTful APIs
-In addition to the text-based chat RESTful API, NeuralChat offers several helpful plugins in its RESTful API lineup to aid users in building multimodal applications.
-NeuralChat supports the following RESTful APIs:
-- Finetuning
-- Audio Chat
-- Document Retrieval
-- Code Generation
-- Text to Image
-- Image to Image
-- Face animation
-
-For more details, refer to this [README](./server/README.md)
-
-
-## Selected Notebooks 
-
-Welcome to use Jupyter notebooks to explore how to create, deploy, and customize chatbots on multiple architectures, including Intel Xeon Scalable Processors, Intel Gaudi2, Intel Xeon CPU Max Series, Intel Data Center GPU Max Series, Intel Arc Series, and Intel Core Processors, and others. The selected notebooks are shown below:
+We provide Jupyter notebooks to help users explore how to create, deploy, and customize chatbots on different hardware architecture. The selected notebooks are shown below:
 
 | Notebook | Title                                       | Description                                                | Link                                           |
 | ------- | --------------------------------------------- | ---------------------------------------------------------- | ------------------------------------------------------- |

diff --git a/intel_extension_for_transformers/neural_chat/assets/pictures/neuralchat.png b/intel_extension_for_transformers/neural_chat/assets/pictures/neuralchat.png