-
Notifications
You must be signed in to change notification settings - Fork 313
Closed
Description
System Info
cargo 1.85.1 (d73d2caf9 2024-12-31)
{
"model_id": "Qwen/Qwen3-Embedding-0.6B",
"model_sha": null,
"model_dtype": "float32",
"model_type": {
"embedding": {
"pooling": "last_token"
}
},
"max_concurrent_requests": 512,
"max_input_length": 32768,
"max_batch_tokens": 1000,
"max_batch_requests": 4,
"max_client_batch_size": 32,
"auto_truncate": false,
"tokenization_workers": 32,
"version": "1.8.2",
"sha": "ff3969a9e55405dda42b6dd167bd0c5c6900c2b0",
"docker_label": null
}
Linux hostname 5.15.0-72-generic #79-Ubuntu SMP Wed Apr 19 08:22:18 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Hardware: cpu only run
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
cargo run --release --features candle,ort,http --no-default-features -- --model-id Qwen/Qwen3-Embedding-0.6B --max-batch-tokens 1000
curl -vvvv -H "Content-Type: application/json" -d @tei_qwen_embed_0.6b_broken_input.json http://localhost:3000/embed
tei_qwen_embed_0.6b_broken_input.json
- from issue 694
- Request will never end, infer actually not started (0% of cpu usage)
Source code found:
text-embeddings-inference/core/src/queue.rs
Line 155 in ff3969a
if total_tokens > max_batch_tokens { |
Expected behavior
Depends on configuration. If Auto truncate is set - truncate to Min(max_batch_tokens, max_input_tokens)
. If not - reply with and Error.
Metadata
Metadata
Assignees
Labels
No labels