Skip to content

Conversation

kozistr
Copy link
Contributor

@kozistr kozistr commented Sep 20, 2025

What does this PR do?

Fixes #723
Fixes #694

Changes

  • Raise an error when max_input_length is bigger than max_batch_tokens and auto-truncate is disabled.
  • Reduce max_input_length to max_batch_tokens when auto-truncate is enabled.

Feel free to let me know whether this approach would be proper or not 🤗

Log

./target/release/text-embeddings-router --model-id ../Qwen3-Embedding-0.6B/ --pooling last-token --port 8080 --dtype float32 --max-batch-tokens 1024
2025-09-20T10:05:20.198290Z  INFO text_embeddings_router: router/src/main.rs:203: Args { model_id: "../Qwe**-*********-0.6B/", revision: None, tokenization_workers: None, dtype: Some(Float32), pooling: Some(LastToken), max_concurrent_requests: 512, max_batch_tokens: 1024, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, dense_path: None, hf_api_token: None, hf_token: None, hostname: "0.0.0.0", port: 8080, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: None, payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
2025-09-20T10:05:20.516435Z  WARN text_embeddings_router: router/src/lib.rs:191: Could not find a Sentence Transformers config
Error: `max_input_length` must be smaller than `max_batch_tokens` when `auto_truncate` is disabled (32768 > 1024)
./target/release/text-embeddings-router --model-id ../Qwen3-Embedding-0.6B/ --pooling last-token --port 8080 --dtype float32 --max-batch-tokens 1024 --auto-truncate
2025-09-20T09:59:09.902213Z  INFO text_embeddings_router: router/src/main.rs:203: Args { model_id: "../Qwe**-*********-0.6B/", revision: None, tokenization_workers: None, dtype: Some(Float32), pooling: Some(LastToken), max_concurrent_requests: 512, max_batch_tokens: 1024, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: true, default_prompt_name: None, default_prompt: None, dense_path: None, hf_api_token: None, hf_token: None, hostname: "0.0.0.0", port: 8080, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: None, payload_limit: 2000000, api_key: None, json_output: false, disable_spans: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", prometheus_port: 9000, cors_allow_origin: None }
2025-09-20T09:59:10.231469Z  WARN text_embeddings_router: router/src/lib.rs:191: Could not find a Sentence Transformers config
2025-09-20T09:59:10.231513Z  WARN text_embeddings_router: router/src/lib.rs:205: Reduce `max_input_length` to `max_batch_tokens` (from 32768 to 1024)
2025-09-20T09:59:10.231517Z  INFO text_embeddings_router: router/src/lib.rs:215: Maximum number of tokens per request: 1024
2025-09-20T09:59:10.231673Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:38: Starting 8 tokenization workers
2025-09-20T09:59:10.534633Z  INFO text_embeddings_router: router/src/lib.rs:263: Starting model backend
2025-09-20T09:59:10.539197Z  INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:305: Starting Qwen3 model on Cpu
2025-09-20T09:59:13.086429Z  INFO text_embeddings_router: router/src/lib.rs:281: Warming up model
2025-09-20T09:59:25.175351Z  WARN text_embeddings_router: router/src/lib.rs:290: Backend does not support a batch size > 4
2025-09-20T09:59:25.175381Z  WARN text_embeddings_router: router/src/lib.rs:291: forcing `max_batch_requests=4`
2025-09-20T09:59:25.176762Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1852: Starting HTTP server: 0.0.0.0:8080
2025-09-20T09:59:25.176786Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1853: Ready
2025-09-20T10:00:00.953783Z  INFO embed{total_time="11.896093678s" tokenization_time="31.117792ms" queue_time="407.883µs" inference_time="11.864449086s"}: text_embeddings_router::http::server: router/src/http/server.rs:733: Success

curl -vvvv -H "Content-Type: application/json" -d @tei_qwen_embed_0.6b_broken_input.json http://localhost:8080/embed
*   Trying 127.0.0.1:8080...
* Connected to localhost (127.0.0.1) port 8080 (#0)
> POST /embed HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.81.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 55126
> 

* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< content-type: application/json
< x-compute-type: gpu+optimized
< x-compute-time: 11896
< x-compute-characters: 55060
< x-compute-tokens: 1024
< x-total-time: 11896
< x-tokenization-time: 31
< x-queue-time: 0
< x-inference-time: 11864
< vary: origin, access-control-request-method, access-control-request-headers
< access-control-allow-origin: *
< content-length: 12759
< date: Sat, 20 Sep 2025 10:00:00 GMT

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline?
  • Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the documentation guidelines.
  • Did you write any new necessary tests? If applicable, did you include or update the insta snapshots?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@Narsil @alvarobartt

@kozistr kozistr changed the title Fix the infinite loop when max_input_length is bigger than max-batch-tokens . Fix the infinite loop when max_input_length is bigger than max-batch-tokens Sep 20, 2025
Copy link
Member

@alvarobartt alvarobartt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kozistr, I've included some wording suggestions whilst I review the rest and make sure it works as expected! 🤗

@alvarobartt
Copy link
Member

P.S. The cargo test are failing with HTTP 401 Unauthorized which is most likely related to the recently included HF_TOKEN required to run EmbeddingGemma tests, but I'll look into that as it's unrelated to the PR per se, apologies for the inconvenience 🤗

kozistr and others added 2 commits September 26, 2025 01:40
Co-authored-by: Alvaro Bartolome <36760800+alvarobartt@users.noreply.github.com>
Co-authored-by: Alvaro Bartolome <36760800+alvarobartt@users.noreply.github.com>
@alvarobartt alvarobartt merged commit a593f66 into huggingface:main Sep 25, 2025
@kozistr kozistr deleted the fix/infinite-loop branch September 25, 2025 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Infinite Loop When max-batch-tokens < model max_input_length Some inputs hang the whole embedding service on Qwen3-Embedding-0.6B
2 participants