Deploying Llama2 with NVIDIA Triton Server tutorial

In this repository, we give an example on how to efficiently package and deploy Llama2, using NVIDIA Triton Inference Server to make it production-ready in no time.

Features

Concurrent model execution
Multi GPU support
Dynamic Batching
vLLM support

Examples

We cover three different deployment approaches:

Results

By exploiting Triton’s concurrent model execution feature, we have gained a x1.5 increase in throughput by deploying two parallel instances of the Llama2 7B model quantized to 8 bit.

1 instance 2 instances

Exec time 9.79s 6.72s

Throughput 10.6 token/s 15.5 token/s
Implementing dynamic batching added an additional x5 increase in model’s throughput.

Batch size = 1 Batch size = 2

Exec time 9.79s 17.07s

Throughput 10.6 token/s 66.5 token/s
The incorporation of the vLLM framework outperformed the dynamic batching results with a x6 increase.

Batch size = 1 Batch size = 2

Exec time 2.06s 3.12s

Throughput 50 token/s 363 token/s

Documentation

Deploying Llama2 with NVIDIA Triton Inference Server blog post.
NVIDIA Triton Inference Server Official documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
docs/imgs		docs/imgs
ensemble_model		ensemble_model
python_backend		python_backend
vLLM		vLLM
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs/imgs

docs/imgs

ensemble_model

ensemble_model

python_backend

python_backend

vLLM

vLLM

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Deploying Llama2 with NVIDIA Triton Server tutorial

Features

Examples

Results

Documentation

About

Releases

Packages

Languages

	1 instance	2 instances
Exec time	9.79s	6.72s
Throughput	10.6 token/s	15.5 token/s

	Batch size = 1	Batch size = 2
Exec time	9.79s	17.07s
Throughput	10.6 token/s	66.5 token/s

	Batch size = 1	Batch size = 2
Exec time	2.06s	3.12s
Throughput	50 token/s	363 token/s

License

marvik-ai/triton-llama2-adapter

Folders and files

Latest commit

History

Repository files navigation

Deploying Llama2 with NVIDIA Triton Server tutorial

Features

Examples

Results

Documentation

About

Resources

License

Stars

Watchers

Forks

Languages