Skip to content

Propagate the model loading from transformers serve to chat#44758

Merged
SunMarc merged 8 commits intomainfrom
transformers-chat-serve-progress
Mar 19, 2026
Merged

Propagate the model loading from transformers serve to chat#44758
SunMarc merged 8 commits intomainfrom
transformers-chat-serve-progress

Conversation

@LysandreJik
Copy link
Copy Markdown
Member

@LysandreJik LysandreJik commented Mar 16, 2026

Provides a nicer feedback when transformers chat loads a model, instead of hanging

Screen.Recording.2026-03-17.at.00.00.52.mov

Adds a POST /load_model endpoint to transformers serve that streams model loading progress via Server-Sent Events (SSE). Previously, transformers chat would hang with no feedback while downloading and loading a model — now users see real-time progress for each phase.

transformers chat calls this endpoint on startup and renders the progress with Rich progress bars.

The /load_model SSE protocol

The endpoint uses a simple, implementation-agnostic SSE protocol designed to be easy to consume from any language or framework. No Python or library-specific details leak into the wire format.

Request:

POST /load_model
Content-Type: application/json

{"model": "HuggingFaceTB/SmolLM2-135M-Instruct"}

Response: text/event-stream — each frame is data: <json>\n\n

Every event has status and model. Three possible statuses:

Status Terminal? Extra fields Description
loading No stage, optionally progress Something is being loaded
ready Yes cached: bool Model is available
error Yes message: str Loading failed

download is skipped when files are already cached locally. Progress events include {"current": int, "total": int}.

Example stream (fresh load with download):

data: {"status": "loading", "model": "HuggingFaceTB/SmolLM2-135M-Instruct@main", "stage": "processor"}
data: {"status": "loading", "model": "HuggingFaceTB/SmolLM2-135M-Instruct@main", "stage": "config"}
data: {"status": "loading", "model": "HuggingFaceTB/SmolLM2-135M-Instruct@main", "stage": "download", "progress": {"current": 67020052, "total": 269060552}}
data: {"status": "loading", "model": "HuggingFaceTB/SmolLM2-135M-Instruct@main", "stage": "download", "progress": {"current": 269060552, "total": 269060552}}
data: {"status": "loading", "model": "HuggingFaceTB/SmolLM2-135M-Instruct@main", "stage": "weights", "progress": {"current": 136, "total": 272}}
data: {"status": "loading", "model": "HuggingFaceTB/SmolLM2-135M-Instruct@main", "stage": "weights", "progress": {"current": 272, "total": 272}}
data: {"status": "ready", "model": "HuggingFaceTB/SmolLM2-135M-Instruct@main", "cached": false}

When the model is already in memory, a single event is returned:

data: {"status": "ready", "model": "HuggingFaceTB/SmolLM2-135M-Instruct@main", "cached": true}

Integration in chat.py

The RichInterface.print_model_load() method in chat.py consumes this SSE stream and renders it with Rich progress bars. It's intentionally kept simple as a reference implementation — the client just reads events sequentially and reacts to each status/stage:

  • processor / config → status text
  • download → byte-progress bar with speed and ETA
  • weights → item-progress bar
  • ready → done message
  • error → raise

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! Just a few nits

@LysandreJik LysandreJik force-pushed the transformers-chat-serve-progress branch from ebd54fa to decc747 Compare March 18, 2026 11:18
@LysandreJik LysandreJik marked this pull request as ready for review March 18, 2026 11:52
Copy link
Copy Markdown
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, just a few nits and we can merge !


def load_model_and_processor(
self, model_id_and_revision: str
self, model_id_and_revision: str, progress_callback: Callable[[dict], None] | None = None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also propagate to load_audio_model_and_processor ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be down for us to do that, but maybe in a follow-up? It's not a hugely used path for now and no client would benefit from it at this time

Comment on lines 2160 to +2161
def load_model_and_processor(
self, model_id_and_revision: str
self, model_id_and_revision: str, progress_callback: Callable[[dict], None] | None = None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some docstring for progress_callback ?

}

progress = Progress(
TextColumn("[bold]{task.description}", table_column=Column(width=50, no_wrap=True)),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cropping the description sometimes when it is too long. Maybe we shouldn't put the model name in the description

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put the model name in the desc only if we have the space to show it, otherwise I remove it

Copy link
Copy Markdown
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this clean PR !

Comment on lines +1007 to +1008

def test_concurrent_load_same_model(self):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, we should add more concurrent test in general for our endpoints to see if they behaves correctly. I will probably do that in the refactor

@SunMarc SunMarc added this pull request to the merge queue Mar 19, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 19, 2026
@SunMarc SunMarc added this pull request to the merge queue Mar 19, 2026
Merged via the queue into main with commit e94695e Mar 19, 2026
29 checks passed
@SunMarc SunMarc deleted the transformers-chat-serve-progress branch March 19, 2026 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants