Skip to content

pulzeai-oss/tritonserver

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Triton Inference Server + TensorRT-LLM

A thin wrapper for simplifying the deployment of Triton Inference Server with the TensorRT-LLM backend.

Instructions for deployment of TensorRT-LLM engines leave much to be desired, for the following reasons:

  • Requires arbitrary scripts:
    • Bootstrapping model repository from configuration templates
    • Rendering configuration templates with user-specified values
  • No clear compatibility matrix between backend and server versions
  • Python entrypoint fails to forward signals, preventing graceful termination

We simplify this by:

  • Baking the necessary configuration templates into the docker image
  • Using an entrypoint that:
    • Renders the configuration templates on the fly from environment variables
    • Is replaced by mpirun via an exec call that correctly forwards signals to the respective workers, allowing in-flight requests to be drained before termination
  • Bundling the correct revision of the TensorRT-LLM backend for the base Triton NGC image

Configuration

Model configuration is set via environment variables. For instance, setting TRTLLM__ENGINE__TRITON_MAX_BATCH_SIZE=1 replaces the ${triton_max_batch_size} template variable in tensorrt_llm/config.pbtxt. Similarly, setting TRTLLM__BLS__ACCUMULATE_TOKENS=false replaces the ${accumulate_tokens} template variable in tensorrt_llm_bls/config.pbtxt. You get the idea!

The pre-built TensorRT-LLM engine should be mounted into /srv/run/repo/tensorrt_llm/1.

See examples for instructions on deploying on Kubernetes or locally using docker-compose.

TODOs

  • Helm chart

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages