huggingface · McPatate · Sep 8, 2025 · Sep 8, 2025 · Sep 8, 2025 · Sep 8, 2025
diff --git a/docs/source/en/serving.md b/docs/source/en/serving.md
@@ -21,7 +21,7 @@ Transformer models can be efficiently deployed using libraries such as vLLM, Tex
 > [!TIP]
 > Responses API is now supported as an experimental API! Read more about it [here](#responses-api).
 
-Apart from that you can also serve transformer models easily using the `transformers serve` CLI. This is ideal for experimentation purposes, or to run models locally for personal and private use.
+You can also serve transformer models with the `transformers serve` CLI. With Continuous Batching, `serve` now delivers solid throughput and latency well suited for evaluation, experimentation, and moderate-load local or self-hosted deployments. While vLLM, SGLang, or other inference engines remain our recommendations for large-scale production, `serve` avoids the extra runtime and operational overhead, and is on track to gain more production-oriented features.
 
 In this document, we dive into the different supported endpoints and modalities; we also cover the setup of several user interfaces that can be used on top of `transformers serve` in the following guides:
 - [Jan (text and MCP user interface)](./jan.md)
@@ -58,7 +58,7 @@ or by sending an HTTP request, like we'll see below.
 
 ## Chat Completions - text-based
 
-See below for examples for text-based requests. Both LLMs and VLMs should handle 
+See below for examples for text-based requests. Both LLMs and VLMs should handle
 
 <hfoptions id="chat-completion-http">
 <hfoption id="curl">
@@ -366,6 +366,40 @@ The `transformers serve` server is also an MCP client, so it can interact with M
 
 <!-- TODO: example with a minimal python example, and explain that it is possible to pass a full generation config in the request -->
 
+## Continuous Batching
 
+Continuous Batching (CB) lets the server dynamically group and interleave requests so they can share forward passes on the GPU. Instead of processing each request sequentially, `serve` adds new requests as others progress (prefill) and drops finished ones during decode. The result is significantly higher GPU utilization and better throughput without sacrificing latency for most workloads.
+
+Thanks to this, evaluation, experimentation, and moderate-load local/self-hosted use can now be handled comfortably by `transformers serve` without introducing an extra runtime to operate.
+
+### Enable CB in serve
+
+CB is opt-in and currently applies to chat completions.
+
+```sh
+transformers serve \
+  --continuous-batching
+  --attn_implementation sdpa_paged
+```
+
+
+### Performance tips
+
+- Use an efficient attention backend when available:
+
+```sh
+transformers serve \
+  --continuous_batching \
+  --attn_implementation paged_attention
+```
+
+> [!TIP]
+> If you choose `paged_attention`, you must install `flash-attn` separately: `pip install flash-attn --no-build-isolation`
+
+- `--dtype {bfloat16|float16}` typically improve throughput and memory use vs. `float32`
+
+- `--load_in_4bit`/`--load_in_8bit` can reduce memory footprint for LoRA setups
+
+- `--force-model <repo_id>` avoids per-request model hints and helps produce stable, repeatable runs