kv.run

(Limited) comparison of popular model serving solutions

Solution	Inference backend	Serving backend	Advanced kernel support	Model support
Huggingface TGI	Pytorch	HF TGI (Rust)	Paged + Flash attention	Language
Deepspeed MII	PyTorch	Deepspeed (Python)	DeepSpeed-Kernels	Language
TensorRT-LLM	TensorRT-LLM	TensorRT-LLM (C++)	TensorRT XQA	Language
vLLM	vLLM	vLLM (Python)	Paged + Flash attention	Language
kv.run	PyTorch	HF TGI + more (Rust)	Paged + Flash attention, FlashInfer	Language, Diffusion models(soon)

Installation

Install Rust:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Install Protobuf

sudo apt-get install libssl-dev gcc -y
PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP

Install Kernel Libraries (optional)

# Install FlashInfer
# For CUDA 12.1 & torch 2.3
pip install flashinfer==0.1.1 -i https://flashinfer.ai/whl/cu121/torch2.3
# For other CUDA & torch versions, please check https://docs.flashinfer.ai/installation.html

# Install Flash and Paged Attention
cd server && make install-flash-attention && make install-vllm-cuda && install-flash-attention-v2-cuda

Build Code Base

make install

Build Docker Image (optional)

Dockerfile_kvrun provides a docker image building script. We will provide pre-built docker images shortly.

Usages

Deploy services

text-generation-launcher --model-id tjluyao/llama-3-8b

You can use --disable-flashinfer to force a classic TGI serving.

Query the model

You can query the model either through curl:

curl 127.0.0.1:3000/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"lora_id": "tjluyao/llama-3-8b-math", "max_new_tokens":20}}' -H 'Content-Type: application/json'

or using the Python client. Please refer to README.me.

Local API tests

cd server/examples && python test_local_api.py

Local UI demo

(Inherited from Punica)

python server/examples/test_ui.py

demo.mp4

Using quantized models

Add --quantize [Method] to the command above, for example:

text-generation-launcher --model-id TechxGenus/gemma-2b-GPTQ --lora-ids tjluyao/gemma-2b-it-math --quantize gptq

The supported quantization methods include:

AWQ: 4-bit. Need specific quantized model.
EETQ: 8-bit. Can work for any model.
GPTQ: 4-bit. Need specific quantized model.
Bitandbytes: 8-bit. Can work for any model.

For AWQ and EETQ quantization, you need to build their specific kernels:

# AWQ
cd server && make install-awq
git clone https://github.com/casper-hansen/AutoAWQ && cd AutoAWQ
pip install -e .
# EETQ
cd server && make install-eetq
# GTPQ
git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
pip install -vvv --no-build-isolation -e .

Multi-LoRA support

To load LoRA adapters, you may either (1) specify in the laucher argument using lora-ids:

text-generation-launcher --model-id tjluyao/llama-3-8b --lora-ids tjluyao/llama-3-8b-math;tjluyao/llama-3-8b-zh

Or, loading dynamically by the client after the model is launched:

curl 127.0.0.1:3000/download_lora_adapter -X POST -d '{"lora_id":"tjluyao/llama-3-8b-math"}' -H 'Content-Type: application/json'

To query the model, similarly you can use lora-id in the parameters (make sure the adapter is loaded):

curl 127.0.0.1:3000/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"lora_id": "tjluyao/llama-3-8b-math", "max_new_tokens":20}}' -H 'Content-Type: application/json'

Benchmarks

Testing Llama-2-7b on RTX 6000 ada (Vast AI):

Step	Batch Size	Average FlashInfer	Average TGI
Prefill	1	52.16 tokens/secs	41.14 tokens/secs
	2	101.64 tokens/secs	78.69 tokens/secs
	4	191.48 tokens/secs	154.11 tokens/secs
	8	323.21 tokens/secs	290.82 tokens/secs
	16	512.50 tokens/secs	538.15 tokens/secs
	32	697.89 tokens/secs	783.61 tokens/secs
Decode	1	56.55 tokens/secs	40.84 tokens/secs
	2	108.55 tokens/secs	77.85 tokens/secs
	4	207.10 tokens/secs	154.27 tokens/secs
	8	383.92 tokens/secs	297.53 tokens/secs
	16	682.78 tokens/secs	562.83 tokens/secs
	32	1119.92 tokens/secs	993.33 tokens/secs

Testing Llama-2-7b on 3090 (Vast AI):

Step	Batch Size	Average FlashInfer	Average TGI
Prefill	1	44.33 tokens/secs	23.32 tokens/secs
	2	74.81 tokens/secs	46.68 tokens/secs
	4	133.93 tokens/secs	90.51 tokens/secs
	8	189.78 tokens/secs	168.27 tokens/secs
	16	231.24 tokens/secs	218.12 tokens/secs
	32	270.12 tokens/secs	265.74tokens/secs
Decode	1	50.21 tokens/secs	23.13 tokens/secs
	2	89.70 tokens/secs	47.26 tokens/secs
	4	174.92 tokens/secs	93.09 tokens/secs
	8	324.06 tokens/secs	175.21 tokens/secs
	16	567.67 tokens/secs	337.92 tokens/secs
	32	861.50 tokens/secs	601.03 tokens/secs

Model and kernel support matrix

Note: L = Language, I = Image

Model	MOE	Size	Modality	Flash & Page Attention	FlashInfer
Idefics		9B	L, I ⇒ L	✔
Idefics 2		8B	L, I ⇒ L	✔
Llava Next (1.6)		13B	L, I ⇒ L	✔
Llama 2		7B	L ⇒ L	✔	✔
Llama 3		8B	L ⇒ L	✔	✔
Phi 1.5		2.7B	L ⇒ L	✔
Phi 3		3.8B	L ⇒ L	✔	✔
Gemma		2B	L ⇒ L	✔	✔
Cohere		104B	L ⇒ L	✔
Dbrx	✔	132B	L ⇒ L	✔
Mamba		2.8B	L ⇒ L
Mistral		7B	L ⇒ L	✔	✔
Mixtral	✔	8x22B	L ⇒ L	✔
Gpt Bigcode		1.1B	L ⇒ L	✔
Baichuan		7B	L ⇒ L	✔	✔
Falcon		7B	L ⇒ L	✔
StarCoder 2		15B	L ⇒ L	✔
Qwen 2		7B	L ⇒ L	✔	✔
Qwen 1.5		7B	L ⇒ L	✔	✔
Opt		6.7B	L ⇒ L
T5		11B	L ⇒ L
Galactica		120B	L ⇒ L
SantaCoder		1.1B	L ⇒ L	✔
Bloom		560M	L ⇒ L
Mpt		7B	L ⇒ L
Gpt2		124M	L ⇒ L	✔
Gpt Neox		20B	L ⇒ L	✔
Yi 1.5		9B	L ⇒ L	✔	✔
ChatGLM 4		9B	L ⇒ L	✔	✔

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.github		.github
.vscode		.vscode
assets		assets
benchmark		benchmark
clients/python		clients/python
docs		docs
integration-tests		integration-tests
launcher		launcher
load_tests		load_tests
proto		proto
router		router
server		server
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
Dockerfile_amd		Dockerfile_amd
Dockerfile_intel		Dockerfile_intel
Dockerfile_kvrun		Dockerfile_kvrun
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
rust-toolchain.toml		rust-toolchain.toml
sagemaker-entrypoint.sh		sagemaker-entrypoint.sh
tgi-entrypoint.sh		tgi-entrypoint.sh
update_doc.py		update_doc.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kv.run

Installation

Install Rust:

Install Protobuf

Install Kernel Libraries (optional)

Build Code Base

Build Docker Image (optional)

Usages

Deploy services

Query the model

Local API tests

Local UI demo

Using quantized models

Multi-LoRA support

Benchmarks

Model and kernel support matrix

About

Releases

Contributors 7

Languages

License

mlsys-io/kv.run

Folders and files

Latest commit

History

Repository files navigation

kv.run

Installation

Install Rust:

Install Protobuf

Install Kernel Libraries (optional)

Build Code Base

Build Docker Image (optional)

Usages

Deploy services

Query the model

Local API tests

Local UI demo

Using quantized models

Multi-LoRA support

Benchmarks

Model and kernel support matrix

About

Resources

License

Stars

Watchers

Forks

Releases

Contributors 7

Languages