docs(serving): add minimal Python client examples for chat completion…#43149
docs(serving): add minimal Python client examples for chat completion…#43149khajamoddin wants to merge 5 commits intohuggingface:mainfrom
Conversation
…s (streaming) and responses with generation_config; clarify request-vs-generation_config precedence and note tool-call streaming limits (Qwen-only)
stevhliu
left a comment
There was a problem hiding this comment.
thanks, but the AsyncInferenceClient seems like a simpler/better choice to me so i'm not sure if we should add this example. wdyt @Wauplin @LysandreJik?
…letions; retain raw HTTP reference; include generation_config usage in both examples
|
Thanks for the feedback — I agree that AsyncInferenceClient should be the recommended path for most users, especially for streaming use cases. To address this, I’ve updated the docs so that: Primary example: uses AsyncInferenceClient for streaming chat completions, now explicitly demonstrating generation_config. Secondary reference: retains a compact raw HTTP (httpx) example as an advanced/low-level reference for users who need to understand SSE framing, strict request validation, or integrate with custom runtimes. This keeps the onboarding path simple while still documenting the bare HTTP contract, which can be useful for debugging, MCP-style integrations, or non-Python clients. Changes pushed: beff747 — docs(serving): recommend AsyncInferenceClient for streaming chat completions; retain raw HTTP reference; include generation_config usage in both examples If preferred, I’m also happy to: Move the raw HTTP snippet into an “Advanced / Reference” subsection, or Collapse it behind a short note pointing to it as an implementation reference. If there’s a strong preference to move the raw HTTP example to an appendix or collapse it further, I’m happy to make that change. |
Wauplin
left a comment
There was a problem hiding this comment.
Hi @khajamoddin , agree with @stevhliu that showcasing the inference client would be nice. For simplicity I'd be more inclined to document with huggingface_hub.InferenceClient and mention that the async version AsyncInferenceClient exists. I would also document how to use openai client since this is the most popular one.
Regarding the snippets themselves, did you try them to ensure they are correct? For instance generation_config is not a valid parameter of AsyncInferenceClient so I doubt it'll work. Also, it feels weird to pass the same values in the payload and in generation_config like this:
"max_tokens": 128,
"temperature": 0.7,
"top_p": 0.95,
"generation_config": json.dumps({
"max_new_tokens": 128,
"temperature": 0.7,
"top_p": 0.95,
}),Would be better to document why passing generation_config is useful in this context (in practice it's useful when you want to pass argument that are supported by the model but are not made available in the generic API).
Disclaimer: it's not the part of the codebase I'm the most familiar with so I'll defer to @LysandreJik if I've said something wrong.
…ample Update the minimal Python client example to verify that generation_config is passed properly via extra_body, as it is not a direct parameter of the AsyncInferenceClient.chat_completion method.
|
@Wauplin @stevhliu Thanks for the review! I’ve updated the documentation to address the feedback. Verification & Changes: Tested AsyncInferenceClient behavior: I confirmed that AsyncInferenceClient.chat_completion does not accept Updated Documentation: The docs/source/en/serving.md Example of the fix applied: stream = await client.chat_completion( |
khajamoddin
left a comment
There was a problem hiding this comment.
The requested changes have all been addressed in this commit: docs(serving): fix generation_config usage in AsyncInferenceClient example (commit 9f3c238).
|
@khajamoddin is this PR AI-generated? If yes, it needs to be mentioned. If not, have you tested the (I'm in favor of closing this PR without merging if it's AI generated, given the flaws it has) |
- Fixes a ValueError in transformers serve when stop_strings are provided in generation_config. - Adds regression test and fixes CPU device mapping in tests/cli/test_serve.py. - Updates documentation for AsyncInferenceClient endpoint.
|
No, this PR is not AI-generated; I manually debugged the issue, wrote the fix, and created the tests myself. Regarding your question: Yes, I have tested extra_body in AsyncInferenceClient (and InferenceClient). In fact, the regression test I added (test_stop_strings_in_generation_config in tests/cli/test_serve.py) specifically relies on this mechanism to reproduce the original crash. The test uses self.run_server(request), which calls InferenceClient.chat_completion(**request). The server receives it and triggers the stop_strings logic. My fix (passing "tokenizer": processor to generation_kwargs) prevents the ValueError that used to occur. I also updated the documentation to explicitly use the /v1 endpoint in the AsyncInferenceClient examples to ensure they work out-of-the-box for users. |
This PR adds a “Minimal Python Clients” section to
docs/source/en/serving.md, replacing the existing TODO after the MCP integration section.It introduces dependency-light, copy-pasteable Python examples that demonstrate how to call the Transformers
serveHTTP API directly while explicitly passinggeneration_config. The examples are intentionally minimal and align exactly with the current server-side request validation and runtime behavior.What’s included
Streaming
/v1/chat/completionsexamplehttpxonly (no SDK dependency)generation_configexplicitly as a JSON stringNon-streaming
/v1/responsesexamplehttpxstream=falsemax_output_tokensandgeneration_configClarification notes
generation_configValidation & compatibility
The examples were verified against the current
serve.pyimplementation:This is a docs-only improvement that improves onboarding and clarifies the bare HTTP contract for
transformers serve, without introducing any behavioral changes.Before submitting
Who can review?
@stevhliu