### Objectives
Fastchat is a software backend designed to run LLMs.
What we want to test in this notebook is :
- run a single LLM on GPU in Google Colab
- start an API server
- query the API using the OpenAI package, indeed one of the features of fastchat is to use the OpenAI syntax for querying

### Setup
Download fastchat from github

In [None]:
!pip install openai
%cd /content/

# clone FastChat
!git clone https://github.com/lm-sys/FastChat.git

# install dependencies
%cd FastChat
!python3 -m pip install -e ".[model_worker,webui]" --quiet
import subprocess

### Initialization
Fastchat uses 2 components to serve a model :
1. Controller server which usually runs on 21001, acts as model backend
2. Model worker which is managed by the controller

That's for API mode and web app mode.

Notes :
- As colab is not printing out the standard output of the cells, you need to query the controller port before starting up the model worker. Or check using ps that the python controller is properly started.
- The model worker may crash often : for instance if you cancel a running cell interacting with the model, that might kill the model worker. Unfortunately you almost always need to fully recreate the runtime environment to restart the model worker : force kill the session and restart.

In [None]:
# To check that the controller is running
!curl -X GET http://localhost:21001

In [None]:
# To check that the python processes are running
!ps -e -o pid,pcpu,pmem,command | grep python

In [None]:
# Start controller
controller_process = subprocess.Popen(["python3", "-m", "fastchat.serve.controller", "--host", "127.0.0.1"], stdout=subprocess.PIPE)

In [None]:
# Once you are sure that the controller is running, start the model worker
# this process might take a while to start, therefore you should regularly run
# `ps` to check that the worker is still running
worker_process = subprocess.Popen(["python3", "-m", "fastchat.serve.model_worker", "--host", "127.0.0.1", "--controller-address", "http://127.0.0.1:21001", "--model-path", "lmsys/vicuna-7b-v1.5", "--load-8bit"], stdout=subprocess.PIPE)

In [None]:
# To test that the port is responding (no connection error), use following command
# If the port is listening, it should return `{"detail":"Not Found"}`
!curl -X GET http://localhost:21001/

In [None]:
# To test that your worker is up and running, use
!python3 -m fastchat.serve.test_message --model-name vicuna-7b-v1.5b

### API Server

In [None]:
# Finally, once you are sure that the worker is running, start the API server
api_process = subprocess.Popen(["python3", "-m", "fastchat.serve.openai_api_server", "--host", "127.0.0.1", "--controller-address", "http://127.0.0.1:21001", "--port", "8000"], stdout=subprocess.PIPE)

### Querying the API server

In [None]:
# Send a command to the API server
!curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{ \
    "model": "vicuna-7b-v1.5", \
    "messages": [{"role": "user", "content": "Hello, can you tell me a joke for me?"}], \
    "temperature": 0.5 \
  }'

In [None]:
# use openai package to run queries
import openai

openai.api_key = "EMPTY"
openai.base_url = "http://localhost:8000/v1/"

model = "vicuna-7b-v1.5"
prompt = "Once upon a time"

# create a completion
completion = openai.completions.create(model=model, prompt=prompt, max_tokens=64)
# print the completion
print(prompt + completion.choices[0].text)

# create a chat completion
completion = openai.chat.completions.create(
  model=model,
  messages=[{"role": "user", "content": "Hello! What is your name?"}]
)
# print the completion
print(completion.choices[0].message.content)

### Web app

In [None]:
!python3 -m fastchat.serve.gradio_web_server