Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 36 additions & 2 deletions docs/source/en/serving.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Transformer models can be efficiently deployed using libraries such as vLLM, Tex
> [!TIP]
> Responses API is now supported as an experimental API! Read more about it [here](#responses-api).

Apart from that you can also serve transformer models easily using the `transformers serve` CLI. This is ideal for experimentation purposes, or to run models locally for personal and private use.
You can also serve transformer models with the `transformers serve` CLI. With Continuous Batching, `serve` now delivers solid throughput and latency well suited for evaluation, experimentation, and moderate-load local or self-hosted deployments. While vLLM, SGLang, or other inference engines remain our recommendations for large-scale production, `serve` avoids the extra runtime and operational overhead, and is on track to gain more production-oriented features.

In this document, we dive into the different supported endpoints and modalities; we also cover the setup of several user interfaces that can be used on top of `transformers serve` in the following guides:
- [Jan (text and MCP user interface)](./jan.md)
Expand Down Expand Up @@ -58,7 +58,7 @@ or by sending an HTTP request, like we'll see below.

## Chat Completions - text-based

See below for examples for text-based requests. Both LLMs and VLMs should handle
See below for examples for text-based requests. Both LLMs and VLMs should handle

<hfoptions id="chat-completion-http">
<hfoption id="curl">
Expand Down Expand Up @@ -366,6 +366,40 @@ The `transformers serve` server is also an MCP client, so it can interact with M

<!-- TODO: example with a minimal python example, and explain that it is possible to pass a full generation config in the request -->

## Continuous Batching

Continuous Batching (CB) lets the server dynamically group and interleave requests so they can share forward passes on the GPU. Instead of processing each request sequentially, `serve` adds new requests as others progress (prefill) and drops finished ones during decode. The result is significantly higher GPU utilization and better throughput without sacrificing latency for most workloads.

Thanks to this, evaluation, experimentation, and moderate-load local/self-hosted use can now be handled comfortably by `transformers serve` without introducing an extra runtime to operate.

### Enable CB in serve

CB is opt-in and currently applies to chat completions.

```sh
transformers serve \
--continuous-batching
--attn_implementation sdpa_paged
```


### Performance tips

- Use an efficient attention backend when available:

```sh
transformers serve \
--continuous_batching \
--attn_implementation paged_attention
```

> [!TIP]
> If you choose `paged_attention`, you must install `flash-attn` separately: `pip install flash-attn --no-build-isolation`

- `--dtype {bfloat16|float16}` typically improve throughput and memory use vs. `float32`

- `--load_in_4bit`/`--load_in_8bit` can reduce memory footprint for LoRA setups

- `--force-model <repo_id>` avoids per-request model hints and helps produce stable, repeatable runs