Skip to content

Infinite Loop When max-batch-tokens < model max_input_length #723

@andrey-chernykh

Description

@andrey-chernykh

System Info

cargo 1.85.1 (d73d2caf9 2024-12-31)

{
  "model_id": "Qwen/Qwen3-Embedding-0.6B",
  "model_sha": null,
  "model_dtype": "float32",
  "model_type": {
    "embedding": {
      "pooling": "last_token"
    }
  },
  "max_concurrent_requests": 512,
  "max_input_length": 32768,
  "max_batch_tokens": 1000,
  "max_batch_requests": 4,
  "max_client_batch_size": 32,
  "auto_truncate": false,
  "tokenization_workers": 32,
  "version": "1.8.2",
  "sha": "ff3969a9e55405dda42b6dd167bd0c5c6900c2b0",
  "docker_label": null
}

Linux hostname 5.15.0-72-generic #79-Ubuntu SMP Wed Apr 19 08:22:18 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Hardware: cpu only run

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

  1. cargo run --release --features candle,ort,http --no-default-features -- --model-id Qwen/Qwen3-Embedding-0.6B --max-batch-tokens 1000
  2. curl -vvvv -H "Content-Type: application/json" -d @tei_qwen_embed_0.6b_broken_input.json http://localhost:3000/embed

tei_qwen_embed_0.6b_broken_input.json

  1. Request will never end, infer actually not started (0% of cpu usage)

Source code found:

if total_tokens > max_batch_tokens {

Expected behavior

Depends on configuration. If Auto truncate is set - truncate to Min(max_batch_tokens, max_input_tokens). If not - reply with and Error.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions