# FastChat Jupyter Notebook Guide
FastChat is an open-source platform for training, serving, and evaluating large language model-based chatbots. In this guide, we will cover the installation process and the main features of FastChat.
- [Installation](#Installation)
- [Vicuna Weights](#Vicuna-Weights)
- [Inference with Command Line Interface](#Inference-with-Command-Line-Interface)
- [Serving with Web GUI](#Serving-with-Web-GUI)


## Installation
Before starting to use FastChat, it needs to be installed on your system. There are two ways to install FastChat, either through pip or by cloning the repository and installing it from source.

### Method 1: With pip
The following commands can be used to install FastChat and the latest main branch of transformers with pip:

In [None]:

!pip3 install fschat
!pip3 install git+https://github.com/huggingface/transformers


### Method 2: From source
To install FastChat from source, first, clone the repository and navigate to the FastChat folder:

**If you are running on Mac, you need to install Rust and CMake before proceeding:**

In [None]:
# !brew install rust cmake

Then, install FastChat by running the following commands:

In [None]:
# !git clone https://github.com/lm-sys/FastChat.git
# !cd FastChat

# !pip3 install --upgrade pip
# !pip3 install -e .

## Vicuna Weights

FastChat provides [Vicuna](https://vicuna.lmsys.org/) weights as delta weights to comply with the LLaMA model license. You can add our delta to the original LLaMA weights to obtain the Vicuna weights. The following scripts can be used to get Vicuna weights by applying our delta:

1. Get the original LLaMA weights in the huggingface format by following the code block or instructions [here](https://huggingface.co/docs/transformers/main/model_doc/llama).

In [None]:
import os
import transformers

# Llama model directory
Llama_model_dir = "LLaMA"

# Huggingface model directory
Huggingface_model_dir = "huggingface_LLaMA"

# Transformers library directory
transformers_dir = os.path.dirname(transformers.__file__)

model_size = "7B"

# Create the arguments required to run the command
cmd = f"python3 {os.path.join(transformers_dir, 'models', 'llama', 'convert_llama_weights_to_hf.py')} \
    --input_dir {Llama_model_dir} --model_size {model_size} --output_dir {Huggingface_model_dir}/{model_size}"

# Run the command
os.system(cmd)


2. Use the following scripts to get Vicuna weights by applying our delta. They will automatically download delta weights from our Hugging Face [account](https://huggingface.co/lmsys).

**NOTE**:
Our released weights are only compatible with the latest main branch of huggingface/transformers.
We install the correct version of transformers when fastchat is installed.




### Vicuna-7B
This conversion command needs around 30 GB of CPU RAM.
If you do not have enough memory, you can create a large swap file that allows the operating system to automatically utilize the disk as virtual memory.

In [None]:
import os

delta_path = "lmsys/vicuna-7b-delta-v1.1"
vicuna_model_dir = "vicuna_LLaMA"

# Create the arguments required to run the command
cmd = f"python3 -m fastchat.model.apply_delta \
    --base {Huggingface_model_dir}/7B \
    --target {vicuna_model_dir}/7B \
    --delta {delta_path}"

# Run the command
os.system(cmd)

### Vicuna-13B
This conversion command needs around 60 GB of CPU RAM.
If you do not have enough memory, you can create a large swap file that allows the operating system to automatically utilize the disk as virtual memory.


In [None]:
import os

delta_path = "lmsys/vicuna-13b-delta-v1.1"
vicuna_model_dir = "vicuna_LLaMA"

# Create the arguments required to run the command
cmd = f"python3 -m fastchat.model.apply_delta \
    --base {Huggingface_model_dir}/13B \
    --target {vicuna_model_dir}/13B \
    --delta {delta_path}"

# Run the command
os.system(cmd)

## Inference with Command Line Interface
The FastChat CLI provides a command-line interface for inference. You can specify different options to configure the inference process.

### Single GPU
The command below requires around 28GB of GPU memory for Vicuna-13B and 14GB of GPU memory for Vicuna-7B.
See the "No Enough Memory" section below if you do not have enough memory.

In [None]:
vicuna_weight_path= vicuna_model_dir + "/7B"
!python3 -m fastchat.serve.cli --model-path {vicuna_weight_path} # or /path/to/vicuna/weights

### Multiple GPUs
If you do not have enough GPU memory, you can use model parallelism to aggregate memory from multiple GPUs on the same machine.


In [None]:
vicuna_weight_path= f"{vicuna_model_dir}/7B"
!python3 -m fastchat.serve.cli  --num-gpus 2 --model-path {vicuna_weight_path} # or /path/to/vicuna/weights


### CPU Only
Use `--device cpu` to use CPU only and does not require GPU. It requires around 60GB of CPU memory for Vicuna-13B and around 30GB of CPU memory for Vicuna-7B.

In [None]:
vicuna_weight_path= f"{vicuna_model_dir}/7B"
!python3 -m fastchat.serve.cli  --device cpu --model-path {vicuna_weight_path} # or /path/to/vicuna/weights

### Metal Backend (Mac Computers with Apple Silicon or AMD GPUs)
Use `--device mps` to enable GPU acceleration on Mac computers (requires torch >= 2.0).
Use `--load-8bit` to turn on 8-bit compression. Vicuna-7B can run on a 32GB M1 Macbook with 1 - 2 words / second.

In [None]:
vicuna_weight_path= f"{vicuna_model_dir}/7B"
!python3 -m fastchat.serve.cli  --device mps --load-8bit --model-path {vicuna_weight_path} # or /path/to/vicuna/weights

### No Enough Memory or Other Platforms
If you do not have enough memory, you can enable 8-bit compression by adding `--load-8bit` to commands above.
This can reduce memory usage by around half with slightly degraded model quality.
It is compatible with the CPU, GPU, and Metal backend.
Vicuna-13B with 8-bit compression can run on a single NVIDIA 3090/4080/V100(16GB) GPU.

In [None]:
vicuna_weight_path = f"{vicuna_model_dir}/7B"
!python3 -m fastchat.serve.cli  --load-8bit --model-path {vicuna_weight_path} # or /path/to/vicuna/weights

## Serving with Web GUI

To serve using the web UI, you need three main components: web servers that interface with users, model workers that host one or more models, and a controller to coordinate the webserver and model workers. Here are the commands to follow in your terminal:


### Launch the controller
The controller is responsible for coordinating the webserver and model workers. It needs to be launched first.

In [None]:
import subprocess # we need this to run the controller in a separate process in the jupyter notebook
subprocess.Popen(["python3", "-m", "fastchat.serve.controller"])

### Launch the model worker

In [None]:
vicuna_weight_path = f"{vicuna_model_dir}/7B"
device = "cuda" # or "cpu" / "mps"
subprocess.Popen(["python3", "-m", "fastchat.serve.model_worker","--device",device, "--model-path", vicuna_weight_path])


Wait until the process finishes loading the model and you see **"Uvicorn running on ..."**. You can launch multiple model workers to serve multiple models concurrently. The model worker will connect to the controller automatically.

To ensure that your model worker is connected to your controller properly, send a test message using the following command:

In [None]:
!python3 -m fastchat.serve.test_message --model-name vicuna-7b

### Launch the Gradio web server
This is the user interface that users will interact with.

In [None]:
subprocess.Popen(["python3", "-m", "fastchat.serve.gradio_web_server"])

### Kill the processes
To kill the processes, you can use the following commands:

In [None]:

!kill -9 $(lsof -t -i:21001) # kill the controller
!kill -9 $(lsof -t -i:21002) # kill the worker
!kill -9 $(lsof -t -i:7860) # kill the gradio server
