Welcome to the Text Chatbot! This notebook provides instructions for setting up the Text Chatbot system with caching plugin on Intel XEON Scalable Processors.
When LLM service encounters higher traffic levels, the expenses related to LLM API calls can become substantial. Additionally, LLM services might exhibit slow response times. Hence, we leverage GPTCache to build a semantic caching plugin for storing LLM responses.

Caching plugin offers the following primary benefits:

- **Decreased expenses**: Caching plugin effectively minimizes expenses by caching query results, which in turn reduces the number of requests and tokens sent to the LLM service.
- **Enhanced performance**: Caching plugin can also provide superior query throughput compared to standard LLM services.
- **Improved scalability and availability**: Caching plugin can easily scale to accommodate an increasing volume of of queries, ensuring consistent performance as your application's user base expands.

# Setup Backend

## Setup environment

In [None]:
!pip install intel-extension-for-transformers
!git clone https://github.com/intel/intel-extension-for-transformers.git
%cd ./intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/
!pip install -r requirements.txt
!sudo apt install numactl
!conda install astunparse ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses -y
!conda install jemalloc gperftools -c conda-forge -y
!pip install nest_asyncio

## Startup the backend server

❗ Please be aware that the server is running in the background. You can download the 'textbot.yaml' configuration file locally. This 'textbot.yaml' file enables caching and security checker plugins by default.

In [None]:
!curl -OL https://raw.githubusercontent.com/intel/intel-extension-for-transformers/main/intel_extension_for_transformers/neural_chat/examples/deployment/textbot/backend_with_cache/textbot.yaml

In [None]:
import multiprocessing
from intel_extension_for_transformers.neural_chat import NeuralChatServerExecutor
import nest_asyncio
nest_asyncio.apply()

def start_service():
    server_executor = NeuralChatServerExecutor()
    server_executor(config_file="textbot.yaml", log_file="neuralchat.log")
multiprocessing.Process(target=start_service).start()

# Setup frontend

Hugging Face Space helps to make some amazing ML applications more accessible to the community. Inspired by this, we can create a chatbot frontend on Hugging Face Spaces. Alternatively, you can also deploy the frontend on your own server.

## Deploy on Huggingface Space

### Create a new space on Huggingface
To create a new application space on Hugging Face, visit the website at [https://huggingface.co/new-space](https://huggingface.co/new-space) and follow the below steps to create a new space.

![Create New Space](https://i.imgur.com/QyjqUd6.png)

The new space is like a new project that supports GitHub-style code repository management.

### Check configuration
We recommend using Gradio as the Space SDK, keeping the default values for the other settings.

For detailed information about the configuration settings, please refer to the [Hugging Face Spaces Config Reference](https://huggingface.co/docs/hub/spaces-config-reference).

### Setup application
We strongly recommend utilizing the provided textbot frontend code as it represents the reference implementation already deployed on Hugging Face Space. To establish your application, simply copy the code files from this directory(intel_extension_for_transformers/neural_chat/examples/textbot/frontend) and adjust their configurations as necessary (e.g., backend service URL in the `app.py` file like below).

![Update backend URL](https://i.imgur.com/rQxPOV7.png)

Alternatively, you have the option to clone the existing space from [https://huggingface.co/spaces/Intel/NeuralChat-GNR-1](https://huggingface.co/spaces/Intel/NeuralChat-GNR-1).

![Clone Space](https://i.imgur.com/76N8m5B.png)

Please also update the backend service URL in the `app.py` file.



## Deploy frontend on your server

### Install the required Python dependencies

In [None]:
!pip install -r ./examples/deployment/textbot/frontend/requirements.txt

### Run the frontend

Launch the chatbot frontend on your server using the following command:

In [None]:
!cd ./examples/deployment/textbot/frontend/
!nohup python app.py &

This will run the chatbot application in the background on your server. The port is defined in `server_port=` at the end of the `app.py` file.

Once the application is running, you can find the access URL in the trace log:

```log
INFO | gradio_web_server | Models: meta-llama/Llama-2-7b-chat-hf
INFO | stdout | Running on local URL:  http://0.0.0.0:7860
```
The URL to access the chatbot frontend is http://SERVER_IP_ADDRESS:7860. Please remember to replace SERVER_IP_ADDRESS with your server's actual IP address.

![URL](https://i.imgur.com/La3tJ8d.png)

Please update the backend service URL in the `app.py` file.

![Update backend URL](https://i.imgur.com/gRtZHrJ.png)

## Performance Comparison: Caching vs. No Caching