Skip to content

marvik-ai/triton-llama2-adapter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Triton License TritonServer

Deploying Llama2 with NVIDIA Triton Server tutorial

In this repository, we give an example on how to efficiently package and deploy Llama2, using NVIDIA Triton Inference Server to make it production-ready in no time.

Features

  • Concurrent model execution
  • Multi GPU support
  • Dynamic Batching
  • vLLM support

Examples

We cover three different deployment approaches:

Results

  • By exploiting Triton’s concurrent model execution feature, we have gained a x1.5 increase in throughput by deploying two parallel instances of the Llama2 7B model quantized to 8 bit.

    1 instance 2 instances
    Exec time 9.79s 6.72s
    Throughput 10.6 token/s 15.5 token/s
  • Implementing dynamic batching added an additional x5 increase in model’s throughput.

    Batch size = 1 Batch size = 2
    Exec time 9.79s 17.07s
    Throughput 10.6 token/s 66.5 token/s
  • The incorporation of the vLLM framework outperformed the dynamic batching results with a x6 increase.

    Batch size = 1 Batch size = 2
    Exec time 2.06s 3.12s
    Throughput 50 token/s 363 token/s

Documentation

  • Deploying Llama2 with NVIDIA Triton Inference Server blog post.

  • NVIDIA Triton Inference Server Official documentation.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published