Skip to content

Commit

Permalink
[NeuralChat] Refine Document (#1127)
Browse files Browse the repository at this point in the history
[NeuralChat] Refactor RAG code and structure

Signed-off-by: XuhuiRen <xuhui.ren@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: Tian, Feng <feng.tian@intel.com>
Co-authored-by: hshen14 <haihao.shen@intel.com>
  • Loading branch information
lvliang-intel and hshen14 committed Jan 13, 2024
1 parent 04f5ef6 commit aabb2fc
Show file tree
Hide file tree
Showing 6 changed files with 327 additions and 114 deletions.
215 changes: 101 additions & 114 deletions intel_extension_for_transformers/neural_chat/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,183 +2,170 @@

NeuralChat
===========================
<h3> A customizable chatbot framework to create your own chatbot within minutes</h3>
<h3> A customizable framework to create your own LLM-driven AI apps within minutes</h3>

---
<div align="left">
[🌟RESTful API](./docs/neuralchat_api.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[💻Examples](./examples)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[📖Notebooks](./docs/full_notebooks.md)
</div>

## Introduction
# Introduction

NeuralChat is a customizable chat framework designed to easily create user own chatbot that can be efficiently deployed across multiple architectures (e.g., Intel® Xeon® Scalable processors, Habana® Gaudi® AI processors). NeuralChat is built on top of large language models (LLMs) and provides a set of strong capabilities including LLM fine-tuning, optimization, and inference, together with a rich set of plugins such as knowledge retrieval, query caching, etc. With NeuralChat, you can easily create a text-based or audio-based chatbot within minutes and deploy on user favorite platform rapidly.
NeuralChat is a powerful and flexible open framework that empowers you to effortlessly create LLM-centric AI applications, including chatbots and copilots.
* Support a range of hardware like [Intel Xeon Scalable processors](https://www.intel.com/content/www/us/en/products/details/processors/xeon/scalable.html), [Intel Gaudi AI processors](https://habana.ai/products), [Intel® Data Center GPU Max Series](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/max-series.html) and NVidia GPUs
* Leverage the leading AI frameworks (e.g., [PyTorch](https://pytorch.org/) and popular domain libraries (e.g., [Hugging Face](https://github.com/huggingface), [Langchain](https://www.langchain.com/)) with their extensions
* Support the model customizations through parameter-efficient fine-tuning, quantization, and sparsity. Released [Intel NeuralChat-7B LLM](https://huggingface.co/Intel/neural-chat-7b-v3-1), ranking #1 in Hugging Face leaderboard in Nov'23
* Provide a rich set of plugins that can augment the AI applications through retrieval-augmented generation (RAG) (e.g., [fastRAG](https://github.com/IntelLabs/fastRAG/tree/main)), content moderation, query caching, more
* Integrate with popular serving frameworks (e.g., [vLLM](https://github.com/vllm-project/vllm), [TGI](https://github.com/huggingface/text-generation-inference), [Triton](https://developer.nvidia.com/triton-inference-server)). Support [OpenAI](https://platform.openai.com/docs/introduction)-compatible API to simplify the creation or migration of AI applications

<a target="_blank" href="./assets/pictures/neuralchat.png">
<a target="_blank" href="./docs/images/neuralchat_arch.png">
<p align="center">
<img src="./assets/pictures/neuralchat.png" alt="NeuralChat" width=600 height=250>
<img src="./docs/images/neuralchat_arch.png" alt="NeuralChat" width=600 height=340>
</p>
</a>

> NeuralChat is under active development with some experimental features (APIs are subject to change).
> NeuralChat is under active development. APIs are subject to change.
## Installation
# Installation

NeuralChat is seamlessly integrated into the Intel Extension for Transformers. Please refer to [Installation](../../docs/installation.md) page for step by step instructions.
NeuralChat is under Intel Extension for Transformers, so ensure the installation of Intel Extension for Transformers first by following the [installation](../../docs/installation.md). After that, install additional dependency for NeuralChat per your device:

## Getting Started

NeuralChat could be deployed locally or accessed through service.
```shell
# For CPU device
pip install -r requirements_cpu.txt

### Deploy Chatbot Locally
# For HPU device
pip install -r requirements_hpu.txt

NeuralChat can be deployed locally and users can run it through command line or python code.
# For XPU device
pip install -r requirements_xpu.txt

```shell
# Command line
neuralchat predict --query "Tell me about Intel Xeon Scalable Processors."
# For CUDA
pip install -r requirements.txt
```

```python
# Python code
from intel_extension_for_transformers.neural_chat import build_chatbot
chatbot = build_chatbot()
response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")
print(response)
```
# Getting Started

### Deploy Chatbot Service
## OpenAI-Compatible RESTful APIs

NeuralChat can be deployed as a service and users can access it through curl with Restful API.
NeuralChat provides OpenAI-compatible RESTful APIs for LLM inference, so you can use NeuralChat as a drop-in replacement for OpenAI APIs. NeuralChat service can also be accessible through [OpenAI client library](https://github.com/openai/openai-python), `curl` commands, and `requests` library. See [neuralchat_api.md](./docs/neuralchat_api.md).

#### Launch Service
### Launch OpenAI-compatible Service

NeuralChat launches a chatbot service using [Intel/neural-chat-7b-v3-1](https://huggingface.co/Intel/neural-chat-7b-v3-1) by default. You can customize the chatbot service by configuring the YAML file.

```shell
neuralchat_server start --config_file ./server/config/neuralchat.yaml
```

#### Access Service
### Access the Service

Once the service is running, you can observe an OpenAI-compatible endpoint `/v1/chat/completions`. You can use any of below ways to access the endpoint.

```shell
curl -X POST -H "Content-Type: application/json" -d '{"prompt": "Tell me about Intel Xeon Scalable Processors."}' http://127.0.0.1:80/v1/chat/completions
#### Using OpenAI Client Library
```python
from openai import Client
# Replace 'your_api_key' with your actual OpenAI API key
api_key = 'your_api_key'
backend_url = 'http://127.0.0.1:80/v1/chat/completions'
client = Client(api_key=api_key, base_url=backend_url)
response = client.ChatCompletion.create(
model="Intel/neural-chat-7b-v3-1",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me about Intel Xeon Scalable Processors."},
]
)
print(response)
```

## Advanced Topics

### Plugins

NeuralChat introduces the `plugins` which offer a rich set of useful LLM utils and features to augment the chatbot's capability. Such plugins are applied in the chatbot pipeline for inference.

Below shows the supported plugins:

- [Knowledge Retrieval](./pipeline/plugins/retrieval/)

Knowledge retrieval consists of document indexing for efficient retrieval of relevant information, including Dense Indexing based on LangChain and Sparse Indexing based on fastRAG, document rankers to prioritize the most relevant responses.

- [Query Caching](./pipeline/plugins/caching/)

Query caching enables the fast path to get the response without LLM inference and therefore improves the chat response time

- [Prompt Optimization](./pipeline/plugins/prompt/)
#### Using Curl
```shell
curl http://127.0.0.1:80/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Intel/neural-chat-7b-v3-1",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me about Intel Xeon Scalable Processors."}
]
}'
```

Prompt optimization supports auto prompt engineering to improve user prompts.
#### Using Python Requests Library

- [Memory Controller](./pipeline/plugins/memory/)
```python
import requests
url = 'http://127.0.0.1:80/v1/chat/completions'
headers = {'Content-Type': 'application/json'}
data = '{"model": "Intel/neural-chat-7b-v3-1", "messages": [ \
{"role": "system", "content": "You are a helpful assistant."}, \
{"role": "user", "content": "Tell me about Intel Xeon Scalable Processors."}] \
}'
response = requests.post(url, headers=headers, data=data)
print(response.json())
```

Memory controller enables the efficient memory utilization.
## Langchain Extension APIs

- [Safety Checker](./pipeline/plugins/security/)
Intel Extension for Transformers provides a comprehensive suite of Langchain-based extension APIs, including advanced retrievers, embedding models, and vector stores. These enhancements are carefully crafted to expand the capabilities of the original langchain API, ultimately boosting overall performance. This extension is specifically tailored to enhance the functionality and performance of RAG.

Safety checker enables the sensitive content check on inputs and outputs of the chatbot.
### Vector Stores

User could enable, disable, and even change the default behavior of all supported plugins like below
We introduce enhanced vector store operations, enabling users to adjust and fine-tune their settings even after the chatbot has been initialized, offering a more adaptable and user-friendly experience. For langchain users, integrating and utilizing optimized Vector Stores is straightforward by replacing the original Chroma API in langchain.

```python
from intel_extension_for_transformers.neural_chat import build_chatbot, PipelineConfig, plugins

plugins.retrieval.enable = True
plugins.retrieval.args["input_path"]="./assets/docs/"
conf = PipelineConf(plugins=plugins)
chatbot = build_chatbot(conf)

from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain_core.vectorstores import VectorStoreRetriever
from intel_extension_for_transformers.langchain.vectorstores import Chroma
retriever = VectorStoreRetriever(vectorstore=Chroma(...))
retrievalQA = RetrievalQA.from_llm(llm=HuggingFacePipeline(...), retriever=retriever)
```

### Fine-tuning
### Retrievers

NeuralChat supports fine-tuning the pretrained large language model (LLM) for text-generation, summarization, code generation tasks, and even TTS model, for user to create the customized chatbot.

```shell
# Command line
neuralchat finetune --base_model "Intel/neural-chat-7b-v3-1" --config pipeline/finetuning/config/finetuning.yaml
```
We provide optimized retrievers such as `VectorStoreRetriever`, `ChildParentRetriever` to efficiently handle vectorstore operations, ensuring optimal retrieval performance.

```python
# Python code
from intel_extension_for_transformers.neural_chat import finetune_model, TextGenerationFinetuningConfig
finetune_cfg = TextGenerationFinetuningConfig() # support other finetuning config
finetune_model(finetune_cfg)
from intel_extension_for_transformers.langchain.retrievers import ChildParentRetriever
from langchain.vectorstores import Chroma
retriever = ChildParentRetriever(vectorstore=Chroma(documents=child_documents), parentstore=Chroma(documents=parent_documents), search_type=xxx, search_kwargs={...})
docs=retriever.get_relevant_documents("Intel")
```

### Optimization
Please refer to this [documentation](./pipeline/plugins/retrieval/README.md) for more details.

NeuralChat provides typical model optimization technologies, like `Automatic Mixed Precision (AMP)` and `Weight Only Quantization`, to allow user to define a customized chatbot.

```shell
# Command line
neuralchat optimize --base_model "Intel/neural-chat-7b-v3-1" --config pipeline/optimization/config/optimization.yaml
```
## Advanced Features

```python
# Python code
from intel_extension_for_transformers.neural_chat import build_chatbot, MixedPrecisionConfig
pipeline_cfg = PipelineConfig(optimization_config=MixedPrecisionConfig())
chatbot = build_chatbot(pipeline_cfg)
```
NeuralChat introduces `plugins` that offer a wide range of useful LLM utilities and features, enhancing the capabilities of the chatbot. Additionally, NeuralChat provides advanced model optimization technologies such as `Automatic Mixed Precision (AMP)` and `Weight Only Quantization`. These technologies enable users to run a high-throughput chatbot efficiently. NeuralChat further supports fine-tuning the pretrained LLMs for tasks such as text generation, summarization, code generation, and even Text-to-Speech (TTS) models, allowing users to create customized chatbots tailored to their specific needs.

## Validated Model List
Please refer to this [documentation](./docs/advanced_features.md) for more details.

# Models

## Supported Models
The table below displays the validated model list in NeuralChat for both inference and fine-tuning.
|Pretrained model| Text Generation (Completions) | Text Generation (Chat Completions) | Summarization | Code Generation |
|------------------------------------|:---:|:---:|:---:|:---:|
|Intel/neural-chat-7b-v1-1|||||
|Intel/neural-chat-7b-v3-1|||||
|LLaMA series|||||
|LLaMA2 series|||||
|GPT-J|||||
|MPT series|||||
|Mistral|||||
|Mixtral-8x7b-v0.1|||||
|Mistral series|||||
|Mixtral series|||||
|SOLAR Series|||||
|ChatGLM series|||||
|Qwen series|||||
|StarCoder series| | | ||
|CodeLLaMA series| | | ||
|CodeGen series| | | ||
|MagicCoder series| | | ||

# Notebooks

## Restful API

### OpenAI-Compatible RESTful APIs & SDK
NeuralChat provides OpenAI-compatible APIs for LLM inference, so you can use NeuralChat as a local drop-in replacement for OpenAI APIs. The NeuralChat server is compatible with both [openai-python library](https://github.com/openai/openai-python) and cURL commands. See [neuralchat_api.md](./docs/neuralchat_api.md).

The following OpenAI APIs are supported:

- Chat Completions. (Reference: https://platform.openai.com/docs/api-reference/chat)
- Completions. (Reference: https://platform.openai.com/docs/api-reference/completions)

### Additional useful RESTful APIs
In addition to the text-based chat RESTful API, NeuralChat offers several helpful plugins in its RESTful API lineup to aid users in building multimodal applications.
NeuralChat supports the following RESTful APIs:
- Finetuning
- Audio Chat
- Document Retrieval
- Code Generation
- Text to Image
- Image to Image
- Face animation

For more details, refer to this [README](./server/README.md)


## Selected Notebooks

Welcome to use Jupyter notebooks to explore how to create, deploy, and customize chatbots on multiple architectures, including Intel Xeon Scalable Processors, Intel Gaudi2, Intel Xeon CPU Max Series, Intel Data Center GPU Max Series, Intel Arc Series, and Intel Core Processors, and others. The selected notebooks are shown below:
We provide Jupyter notebooks to help users explore how to create, deploy, and customize chatbots on different hardware architecture. The selected notebooks are shown below:

| Notebook | Title | Description | Link |
| ------- | --------------------------------------------- | ---------------------------------------------------------- | ------------------------------------------------------- |
Expand Down
Binary file not shown.
Loading

0 comments on commit aabb2fc

Please sign in to comment.