In this repository, we give an example on how to efficiently package and deploy Llama2, using NVIDIA Triton Inference Server to make it production-ready in no time.
- Concurrent model execution
- Multi GPU support
- Dynamic Batching
- vLLM support
We cover three different deployment approaches:
- Using HuggingFace models with Triton’s Python Backend
- Using HuggingFace models with Triton’s Ensemble models
- Using the vLLM framework
-
By exploiting Triton’s concurrent model execution feature, we have gained a x1.5 increase in throughput by deploying two parallel instances of the Llama2 7B model quantized to 8 bit.
1 instance 2 instances Exec time 9.79s 6.72s Throughput 10.6 token/s 15.5 token/s -
Implementing dynamic batching added an additional x5 increase in model’s throughput.
Batch size = 1 Batch size = 2 Exec time 9.79s 17.07s Throughput 10.6 token/s 66.5 token/s -
The incorporation of the vLLM framework outperformed the dynamic batching results with a x6 increase.
Batch size = 1 Batch size = 2 Exec time 2.06s 3.12s Throughput 50 token/s 363 token/s
-
Deploying Llama2 with NVIDIA Triton Inference Server blog post.
-
NVIDIA Triton Inference Server Official documentation.