Support of Docker/Kubernetes CPU limit/reservation #170

bfreuden · 2024-02-26T13:44:52Z

Feature request

Docker (swarm) and Kubernetes have a way to limit CPU usage of a container.

Docker (swarm):

version: '3.4'
services:
    text-embeddings:
        image: ghcr.io/huggingface/text-embeddings-inference:cpu-0.6
        deploy:
            resources:
                limits:
                    cpus: '2'

Kubernetes:

apiVersion: v1
kind: Pod
metadata:
  name: text-embeddings
spec:
  containers:
  - name: text-embeddings
    image: ghcr.io/huggingface/text-embeddings-inference:cpu-0.6
    resources:
      requests:
        cpu: "2000m"
      limits:
        cpu: "2000m"

However, for that to be working optimally (see motivation below), the application in the container has to be aware of this limit and to allocate thread pools accordingly.

Motivation

If thread pools don't match the CPU limit, the container is throttled and performance will drop way beyond expectations (6 times slower in the example below).

For instance on my core i3-8300H (4 cores, 8 threads).

I'm evaluating performance with the following apache bench command (a request containing a single 17KB text to be processed with the sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model:

ab -k -n24 -c4 -p req-en-17k-b1-huggingface.json -T application/json localhost:18083/embed

configuration	reqs/sec	avg. CPU usage (top)
no cpu limit	16.87	465%
cpuset=0,1	11.48	185%
cpus=2	1.82	200%
cpus=2 + env vars	11.03	150%

You can see on line 3 that cpu=2 (without environment variables) performance are 6 times slower than using cpuset=0,1.
The problem is neither Kubernetes nor Docker Swarm allow the cpuset option.
You can see on line 4 that adding environment variables controlling the number of threads has a positive impact on performances (almost on par with cpuset=0,1).

no cpu limit configuration:

version: '3.4'
services:
    multiminilml12v2:
        image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
        environment:
            - MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
            - NVIDIA_DISABLE_REQUIRE=1
            - RUST_BACKTRACE=full
            - JSON_OUTPUT=true
            - PORT=18083
            - MAX_BATCH_TOKENS=65536
            - MAX_CLIENT_BATCH_SIZE=1024

cpuset=0,1 configuration:

version: '3.4'
services:
    multiminilml12v2:
        image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
        environment:
            - MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
            - NVIDIA_DISABLE_REQUIRE=1
            - RUST_BACKTRACE=full
            - JSON_OUTPUT=true
            - PORT=18083
            - MAX_BATCH_TOKENS=65536
            - MAX_CLIENT_BATCH_SIZE=1024
        cpuset: "0,1"

cpus=2 configuration:

version: '3.4'
services:
    multiminilml12v2:
        image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
        environment:
            - MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
            - NVIDIA_DISABLE_REQUIRE=1
            - RUST_BACKTRACE=full
            - JSON_OUTPUT=true
            - PORT=18083
            - MAX_BATCH_TOKENS=65536
            - MAX_CLIENT_BATCH_SIZE=1024
        deploy:
            resources:
                limits:
                    cpus: '2'

cpus=2 + env vars configuration:

version: '3.4'
services:
    multiminilml12v2:
        image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
        environment:
            - MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
            - NVIDIA_DISABLE_REQUIRE=1
            - RUST_BACKTRACE=full
            - JSON_OUTPUT=true
            - PORT=18083
            - MAX_BATCH_TOKENS=65536
            - MAX_CLIENT_BATCH_SIZE=1024
            # interesting variables below
            - TOKIO_WORKER_THREADS=1
            - NUM_RAYON_THREADS=1
            - MKL_NUM_THREADS=1
            - MKL_DOMAIN_NUM_THREADS="MKL_BLAS=1"
            - OMP_NUM_THREADS=1
            - MKL_DYNAMIC="FALSE"
            - OMP_DYNAMIC="FALSE"            
        deploy:
            resources:
                limits:
                    cpus: '2'

Your contribution

I'm afraid I can't do much more than this.

Please note that I don't really now which of my environment variables do really have an impact on performance since I am totally unaware of the internals of text-embeddings-inference.

The issue is known though:
https://danluu.com/cgroup-throttling/
https://nemre.medium.com/is-your-go-application-really-using-the-correct-number-of-cpu-cores-20915d2b6ccb

Some ecosystems have begun to take that into account.

For instance since Python 3.13 you can fool Python into thinking it has less CPU using an environment variable:

docker run --rm -it --name py13 -e PYTHON_CPU_COUNT=2  python:3.13.0a4-slim  python -c "import os; print(os.cpu_count())"
2

Java does that automatically since java 15:

docker run --rm -it --name java23 --entrypoint /bin/bash openjdk:23-slim 
root@31e4b2de8fad:/# jshell
jshell>  System.out.println(Runtime.getRuntime().availableProcessors());
8

ocker run --rm -it --name java23 --cpus=2 --entrypoint /bin/bash openjdk:23-slim 
root@1935b08ebcf7:/# jshell
jshell> System.out.println(Runtime.getRuntime().availableProcessors());
2

The text was updated successfully, but these errors were encountered:

OlivierDehaene · 2024-02-26T17:00:17Z

Super interesting!
Can you try:

version: '3.4'
services:
    multiminilml12v2:
        image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
        environment:
            - MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
            - NVIDIA_DISABLE_REQUIRE=1
            - RUST_BACKTRACE=full
            - JSON_OUTPUT=true
            - PORT=18083
            - MAX_BATCH_TOKENS=65536
            - MAX_CLIENT_BATCH_SIZE=1024
            # interesting variables below
            - MKL_NUM_THREADS=1
            - MKL_DOMAIN_NUM_THREADS="MKL_BLAS=1"
            - MKL_DYNAMIC="FALSE"
        deploy:
            resources:
                limits:
                    cpus: '2'

?

TEI uses https://crates.io/crates/num_cpus internally which correctly get the number of CPUs. My guess is that only MKL is confused.

bfreuden · 2024-02-27T16:12:50Z

@OlivierDehaene thanks for you fast answer!

Your configuration is working fine:

configuration	reqs/sec	avg. CPU usage (top)
no cpu limit	16.87	465%
cpuset=0,1	11.48	185%
cpus=2	1.82	200%
cpus=2 + env vars	11.03	150%
cpus=2 + your conf	10.92	150%

There is a slight difference but it doesn't mean anything (I kept the 11.03 value for the cpus=2 + env vars configuration but its result was 10.90 today).

OlivierDehaene · 2024-02-28T10:56:05Z

Ok then I'm not sure there is a lot that can be done here besides adding some documentation to explain this issue in the README/docs.

bfreuden · 2024-02-29T09:26:03Z

Do you think TEI can set those environment variables at the beginning of its startup phase?

OlivierDehaene · 2024-02-29T09:56:33Z

Setting these values correctly would be really hard since they are MKL/runtime specific. Plus these should be set before execution so this imply creating a launching script above the TEI bin.

I think it's better to document the problem and let users find the best values for their specific env.
It's more a design decision than anything.

bfreuden · 2024-02-29T13:03:22Z

Setting these values correctly would be really hard since they are MKL/runtime specific.
let users find the best values for their specific env
I think it's better to document the problem

So it would be great if the documentation could state a rule of thumb (that would be a sensible default) like this:

For people running the CPU-based Docker image with Docker or Kubernetes CPU limits: set those environment variables and replace 1 with the smallest positive integer greater than or equal to the allocated CPU limit. (if that's the correct rule of thumb because I don't know the relationship between MKL_NUM_THREADS and MKL_DOMAIN_NUM_THREADS).

MKL_NUM_THREADS=1
MKL_DOMAIN_NUM_THREADS="MKL_BLAS=1"
MKL_DYNAMIC="FALSE"

Plus these should be set before execution so this imply creating a launching script above the TEI bin.

TEI running on bare metal definitely doesn't require such a script but, since this repo is including the CPU-based Dockerfile, the CPU-based image would definitely be more user-friendly with a script :).

askervin · 2024-09-09T12:15:27Z

This issue causes serious performance problems when running on servers with many CPUs.

Example: there are 256 CPUs in a multi-socket system, and a user wants to dedicate 8 CPUs for each text-embedding-interface container in the system. What happens now is: every text-embedding-interface creates 2 x 256 threads. When the threads are squeezed to run on 8 CPUs, the whole service runs so slow that it looks broken.

Worker thread pools need to be populated based on allowed, not all CPUs in the system.

For logs, see Issue opea-project/GenAIExamples#763

eero-t mentioned this issue Sep 11, 2024

Too many router/tokenizer threads #404

Closed

4 tasks

OlivierDehaene mentioned this issue Sep 17, 2024

fix: use num_cpus::get to check as get_physical does not check cgroups #410

Merged

OlivierDehaene closed this as completed in #410 Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support of Docker/Kubernetes CPU limit/reservation #170

Support of Docker/Kubernetes CPU limit/reservation #170

bfreuden commented Feb 26, 2024

OlivierDehaene commented Feb 26, 2024

bfreuden commented Feb 27, 2024

OlivierDehaene commented Feb 28, 2024

bfreuden commented Feb 29, 2024 •

edited

Loading

OlivierDehaene commented Feb 29, 2024

bfreuden commented Feb 29, 2024

askervin commented Sep 9, 2024

Support of Docker/Kubernetes CPU limit/reservation #170

Support of Docker/Kubernetes CPU limit/reservation #170

Comments

bfreuden commented Feb 26, 2024

Feature request

Motivation

Your contribution

OlivierDehaene commented Feb 26, 2024

bfreuden commented Feb 27, 2024

OlivierDehaene commented Feb 28, 2024

bfreuden commented Feb 29, 2024 • edited Loading

OlivierDehaene commented Feb 29, 2024

bfreuden commented Feb 29, 2024

askervin commented Sep 9, 2024

bfreuden commented Feb 29, 2024 •

edited

Loading