-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support of Docker/Kubernetes CPU limit/reservation #170
Comments
Super interesting! version: '3.4'
services:
multiminilml12v2:
image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
environment:
- MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- NVIDIA_DISABLE_REQUIRE=1
- RUST_BACKTRACE=full
- JSON_OUTPUT=true
- PORT=18083
- MAX_BATCH_TOKENS=65536
- MAX_CLIENT_BATCH_SIZE=1024
# interesting variables below
- MKL_NUM_THREADS=1
- MKL_DOMAIN_NUM_THREADS="MKL_BLAS=1"
- MKL_DYNAMIC="FALSE"
deploy:
resources:
limits:
cpus: '2' ? TEI uses https://crates.io/crates/num_cpus internally which correctly get the number of CPUs. My guess is that only MKL is confused. |
@OlivierDehaene thanks for you fast answer! Your configuration is working fine:
There is a slight difference but it doesn't mean anything (I kept the 11.03 value for the |
Ok then I'm not sure there is a lot that can be done here besides adding some documentation to explain this issue in the README/docs. |
Do you think TEI can set those environment variables at the beginning of its startup phase? |
Setting these values correctly would be really hard since they are MKL/runtime specific. Plus these should be set before execution so this imply creating a launching script above the TEI bin. I think it's better to document the problem and let users find the best values for their specific env. |
So it would be great if the documentation could state a rule of thumb (that would be a sensible default) like this: For people running the CPU-based Docker image with Docker or Kubernetes CPU limits: set those environment variables and replace
TEI running on bare metal definitely doesn't require such a script but, since this repo is including the CPU-based Dockerfile, the CPU-based image would definitely be more user-friendly with a script :). |
This issue causes serious performance problems when running on servers with many CPUs. Example: there are 256 CPUs in a multi-socket system, and a user wants to dedicate 8 CPUs for each text-embedding-interface container in the system. What happens now is: every text-embedding-interface creates 2 x 256 threads. When the threads are squeezed to run on 8 CPUs, the whole service runs so slow that it looks broken. Worker thread pools need to be populated based on allowed, not all CPUs in the system. For logs, see Issue opea-project/GenAIExamples#763 |
Feature request
Docker (swarm) and Kubernetes have a way to limit CPU usage of a container.
Docker (swarm):
Kubernetes:
However, for that to be working optimally (see motivation below), the application in the container has to be aware of this limit and to allocate thread pools accordingly.
Motivation
If thread pools don't match the CPU limit, the container is throttled and performance will drop way beyond expectations (6 times slower in the example below).
For instance on my core i3-8300H (4 cores, 8 threads).
I'm evaluating performance with the following apache bench command (a request containing a single 17KB text to be processed with the
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model
:You can see on line 3 that
cpu=2
(without environment variables) performance are 6 times slower than usingcpuset=0,1
.The problem is neither Kubernetes nor Docker Swarm allow the cpuset option.
You can see on line 4 that adding environment variables controlling the number of threads has a positive impact on performances (almost on par with
cpuset=0,1
).no cpu limit
configuration:cpuset=0,1
configuration:cpus=2
configuration:cpus=2 + env vars
configuration:Your contribution
I'm afraid I can't do much more than this.
Please note that I don't really now which of my environment variables do really have an impact on performance since I am totally unaware of the internals of text-embeddings-inference.
The issue is known though:
https://danluu.com/cgroup-throttling/
https://nemre.medium.com/is-your-go-application-really-using-the-correct-number-of-cpu-cores-20915d2b6ccb
Some ecosystems have begun to take that into account.
For instance since Python 3.13 you can fool Python into thinking it has less CPU using an environment variable:
Java does that automatically since java 15:
The text was updated successfully, but these errors were encountered: