diff --git a/docs/source/en/serving.md b/docs/source/en/serving.md index 218cb18682e3..f421a284950a 100644 --- a/docs/source/en/serving.md +++ b/docs/source/en/serving.md @@ -21,7 +21,7 @@ Transformer models can be efficiently deployed using libraries such as vLLM, Tex > [!TIP] > Responses API is now supported as an experimental API! Read more about it [here](#responses-api). -Apart from that you can also serve transformer models easily using the `transformers serve` CLI. This is ideal for experimentation purposes, or to run models locally for personal and private use. +You can also serve transformer models with the `transformers serve` CLI. With Continuous Batching, `serve` now delivers solid throughput and latency well suited for evaluation, experimentation, and moderate-load local or self-hosted deployments. While vLLM, SGLang, or other inference engines remain our recommendations for large-scale production, `serve` avoids the extra runtime and operational overhead, and is on track to gain more production-oriented features. In this document, we dive into the different supported endpoints and modalities; we also cover the setup of several user interfaces that can be used on top of `transformers serve` in the following guides: - [Jan (text and MCP user interface)](./jan.md) @@ -58,7 +58,7 @@ or by sending an HTTP request, like we'll see below. ## Chat Completions - text-based -See below for examples for text-based requests. Both LLMs and VLMs should handle +See below for examples for text-based requests. Both LLMs and VLMs should handle @@ -366,6 +366,40 @@ The `transformers serve` server is also an MCP client, so it can interact with M +## Continuous Batching +Continuous Batching (CB) lets the server dynamically group and interleave requests so they can share forward passes on the GPU. Instead of processing each request sequentially, `serve` adds new requests as others progress (prefill) and drops finished ones during decode. The result is significantly higher GPU utilization and better throughput without sacrificing latency for most workloads. + +Thanks to this, evaluation, experimentation, and moderate-load local/self-hosted use can now be handled comfortably by `transformers serve` without introducing an extra runtime to operate. + +### Enable CB in serve + +CB is opt-in and currently applies to chat completions. + +```sh +transformers serve \ + --continuous-batching + --attn_implementation sdpa_paged +``` + + +### Performance tips + +- Use an efficient attention backend when available: + +```sh +transformers serve \ + --continuous_batching \ + --attn_implementation paged_attention +``` + +> [!TIP] +> If you choose `paged_attention`, you must install `flash-attn` separately: `pip install flash-attn --no-build-isolation` + +- `--dtype {bfloat16|float16}` typically improve throughput and memory use vs. `float32` + +- `--load_in_4bit`/`--load_in_8bit` can reduce memory footprint for LoRA setups + +- `--force-model ` avoids per-request model hints and helps produce stable, repeatable runs