Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many model backend threads destroy performance when running on CPU #405

Closed
2 of 4 tasks
askervin opened this issue Sep 11, 2024 · 0 comments · Fixed by #410
Closed
2 of 4 tasks

Too many model backend threads destroy performance when running on CPU #405

askervin opened this issue Sep 11, 2024 · 0 comments · Fixed by #410

Comments

@askervin
Copy link

System Info

text-embeddings-interface:cpu-1.5

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Run a container using the text-embeddings-interface:cpu-1.5 image so that cpuset.cpus is limited in cgroups. This can be done using docker --cpuset-cpus ... or Kubernetes NRI resource policies or CPU manager.

For instance, in system with 128 vCPU / 64 physical CPU cores, the output of text-generation-router shows:
(Following clip is from the ChatQnA example application, kubectl logs chatqna-teirerank-...)

2024-09-09T11:54:19.994401Z  INFO text_embeddings_router: router/src/main.rs:175: Args { model_id: "BAA*/***-********-*ase", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: true, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: "chatqna-teirerank-7fd4d88d85-z2nzh", port: 2082, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }

---8<--- snip --->8---

2024-09-09T11:54:34.747212Z  INFO text_embeddings_router: router/src/lib.rs:241: Starting model backend
2024-09-09T11:54:34.758273Z  WARN ort::environment: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/ort-2.0.0-rc.2/src/environment.rs:266: pthread_setaffinity_np failed for thread: 80, index: 0, mask: {1, 65, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
2024-09-09T11:54:34.758288Z  WARN ort::environment: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/ort-2.0.0-rc.2/src/environment.rs:266: pthread_setaffinity_np failed for thread: 84, index: 4, mask: {5, 69, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
2024-09-09T11:54:34.758307Z  WARN ort::environment: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/ort-2.0.0-rc.2/src/environment.rs:266: pthread_setaffinity_np failed for thread: 81, index: 1, mask: {2, 66, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
2024-09-09T11:54:34.758353Z  WARN ort::environment: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/ort-2.0.0-rc.2/src/environment.rs:266: pthread_setaffinity_np failed for thread: 83, index: 3, mask: {4, 68, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
2024-09-09T11:54:34.758355Z  WARN ort::environment: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/ort-2.0.0-rc.2/src/environment.rs:266: pthread_setaffinity_np failed for thread: 82, index: 2, mask: {3, 67, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
2024-09-09T11:54:34.758391Z  WARN ort::environment: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/ort-2.0.0-rc.2/src/environment.rs:266: pthread_setaffinity_np failed for thread: 85, index: 5, mask: {6, 70, }, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
...

That is, the model backend launches a wrong number of threads and tries to set CPU affinity of each thread to CPUs that are not allowed for this container.

Expected behavior

The model backend should align the number of threads with the number of CPUs available for it, and it should set CPU affinity of its threads only on available CPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant