The TGI loading model consumes all available gpus memory #1824

IdleIdiot · 2024-04-28T06:08:31Z

System Info

Environments

>> cat docker-compose.yml 
version: '3.8'
services:
  llama2_api:
    image: ghcr.io/huggingface/text-generation-inference:1.4
    container_name: llama2_api
    command: --model-id /data/llama2/llama2-chat-13b-hf 
    volumes:
      - /data/wanghui01/models/:/data/
    ports:
      - "8081:80"
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      CUDA_VISIBLE_DEVICES: "0,1,2,3,4,5,6,7"
    shm_size: 1g
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

>> curl 127.0.0.1:8081/info | jq
{
  "model_id": "/data/llama2/llama2-chat-13b-hf",
  "model_sha": null,
  "model_dtype": "torch.float16",
  "model_device_type": "cuda",
  "model_pipeline_tag": null,
  "max_concurrent_requests": 128,
  "max_best_of": 2,
  "max_stop_sequences": 4,
  "max_input_length": 1024,
  "max_total_tokens": 2048,
  "waiting_served_ratio": 1.2,
  "max_batch_total_tokens": 342560,
  "max_waiting_tokens": 20,
  "max_batch_size": null,
  "validation_workers": 2,
  "version": "1.4.4",
  "sha": "6c4496a1a30f119cebd3afbfedd847039325dbc9",
  "docker_label": "sha-6c4496a"
}

>> docker exec f4f ls -lh /data/llama2/llama2-chat-13b-hf
total 49G
-rw-r--r-- 1 root root  638 Feb  5 01:49 config.json
-rw-r--r-- 1 root root  111 Feb  5 01:49 generation_config.json
-rw-r--r-- 1 root root 4.7G Mar 27 07:11 model-00001-of-00006.safetensors
-rw-r--r-- 1 root root 4.7G Mar 27 07:12 model-00002-of-00006.safetensors
-rw-r--r-- 1 root root 4.7G Mar 27 07:12 model-00003-of-00006.safetensors
-rw-r--r-- 1 root root 4.6G Mar 27 07:11 model-00004-of-00006.safetensors
-rw-r--r-- 1 root root 4.6G Mar 27 07:11 model-00005-of-00006.safetensors
-rw-r--r-- 1 root root 1.2G Mar 27 07:12 model-00006-of-00006.safetensors
-rw-r--r-- 1 root root 4.7G Feb  5 01:50 pytorch_model-00001-of-00006.bin
-rw-r--r-- 1 root root 4.7G Feb  5 01:50 pytorch_model-00002-of-00006.bin
-rw-r--r-- 1 root root 4.7G Feb  5 01:50 pytorch_model-00003-of-00006.bin
-rw-r--r-- 1 root root 4.6G Feb  5 01:50 pytorch_model-00004-of-00006.bin
-rw-r--r-- 1 root root 4.6G Feb  5 01:50 pytorch_model-00005-of-00006.bin
-rw-r--r-- 1 root root 1.2G Feb  5 01:50 pytorch_model-00006-of-00006.bin
-rw-r--r-- 1 root root  30K Feb  5 01:50 pytorch_model.bin.index.json
-rw-r--r-- 1 root root  414 Feb  5 01:48 special_tokens_map.json
-rw-r--r-- 1 root root 1.8M Feb  5 01:48 tokenizer.json
-rw-r--r-- 1 root root 489K Feb  5 01:48 tokenizer.model
-rw-r--r-- 1 root root  932 Feb  5 01:48 tokenizer_config.json

>> nvidia-smi 
Sun Apr 28 13:51:36 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          Off | 00000000:0B:00.0 Off |                    0 |
| N/A   31C    P0              35W / 250W |  37625MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          Off | 00000000:0C:00.0 Off |                    0 |
| N/A   31C    P0              34W / 250W |  37633MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-PCIE-40GB          Off | 00000000:0F:00.0 Off |                    0 |
| N/A   31C    P0              34W / 250W |  37633MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-PCIE-40GB          Off | 00000000:14:00.0 Off |                    0 |
| N/A   29C    P0              36W / 250W |  37633MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-PCIE-40GB          Off | 00000000:15:00.0 Off |                    0 |
| N/A   30C    P0              35W / 250W |  37633MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-PCIE-40GB          Off | 00000000:18:00.0 Off |                    0 |
| N/A   31C    P0              36W / 250W |  37633MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-PCIE-40GB          Off | 00000000:1C:00.0 Off |                    0 |
| N/A   31C    P0              38W / 250W |  37633MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-PCIE-40GB          Off | 00000000:24:00.0 Off |                    0 |
| N/A   30C    P0              37W / 250W |  37593MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    303244      C   /opt/conda/bin/python3.10                 37612MiB |
|    1   N/A  N/A    303245      C   /opt/conda/bin/python3.10                 37620MiB |
|    2   N/A  N/A    303248      C   /opt/conda/bin/python3.10                 37620MiB |
|    3   N/A  N/A    303252      C   /opt/conda/bin/python3.10                 37620MiB |
|    4   N/A  N/A    303251      C   /opt/conda/bin/python3.10                 37620MiB |
|    5   N/A  N/A    303256      C   /opt/conda/bin/python3.10                 37620MiB |
|    6   N/A  N/A    303254      C   /opt/conda/bin/python3.10                 37620MiB |
|    7   N/A  N/A    303260      C   /opt/conda/bin/python3.10                 37580MiB |
+---------------------------------------------------------------------------------------+

When I just load the model with transfomer, it's obvious that the gpus memory is normal.

>> cat demo.py 
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation.utils import GenerationConfig

model_path = "/data/wanghui01/models/llama2/llama2-chat-13b-hf/"

tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
model.generation_config = GenerationConfig.from_pretrained(model_path)
input("press any key to continue...")

>> python demo.py 
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:05<00:00,  1.03it/s]
press any key to continue...

>> nvidia-smi 
Sun Apr 28 13:56:31 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          Off | 00000000:0B:00.0 Off |                    0 |
| N/A   30C    P0              35W / 250W |   3179MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          Off | 00000000:0C:00.0 Off |                    0 |
| N/A   30C    P0              34W / 250W |   4083MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-PCIE-40GB          Off | 00000000:0F:00.0 Off |                    0 |
| N/A   31C    P0              34W / 250W |   4083MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-PCIE-40GB          Off | 00000000:14:00.0 Off |                    0 |
| N/A   29C    P0              36W / 250W |   4083MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-PCIE-40GB          Off | 00000000:15:00.0 Off |                    0 |
| N/A   30C    P0              35W / 250W |   4083MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-PCIE-40GB          Off | 00000000:18:00.0 Off |                    0 |
| N/A   30C    P0              36W / 250W |   4083MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-PCIE-40GB          Off | 00000000:1C:00.0 Off |                    0 |
| N/A   31C    P0              38W / 250W |   4083MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-PCIE-40GB          Off | 00000000:24:00.0 Off |                    0 |
| N/A   30C    P0              37W / 250W |    741MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    338969      C   python                                     3166MiB |
|    1   N/A  N/A    338969      C   python                                     4070MiB |
|    2   N/A  N/A    338969      C   python                                     4070MiB |
|    3   N/A  N/A    338969      C   python                                     4070MiB |
|    4   N/A  N/A    338969      C   python                                     4070MiB |
|    5   N/A  N/A    338969      C   python                                     4070MiB |
|    6   N/A  N/A    338969      C   python                                     4070MiB |
|    7   N/A  N/A    338969      C   python                                      728MiB |
+---------------------------------------------------------------------------------------+

I'm confused and wondering what could be causing this, maybe someone can give me some advice.

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

make docker-compose.yml

>> cat docker-compose.yml 
version: '3.8'
services:
  llama2_api:
    image: ghcr.io/huggingface/text-generation-inference:1.4
    container_name: llama2_api
    command: --model-id /data/llama2/llama2-chat-13b-hf 
    volumes:
      - /data/wanghui01/models/:/data/
    ports:
      - "8081:80"
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      CUDA_VISIBLE_DEVICES: "0,1,2,3,4,5,6,7"
    shm_size: 1g
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Run the container and check the memory

>> docker compose up llama2_api -d 
[+] Running 1/1
 ✔ Container llama2_api  Started  

>> nvidia-smi

Expected behavior

The memory of the Gep is similar to that of the transfomers loading model.

canamika27 · 2024-05-13T08:35:54Z

Facing same issue.. any update why this is happening ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The TGI loading model consumes all available gpus memory #1824

The TGI loading model consumes all available gpus memory #1824

IdleIdiot commented Apr 28, 2024

canamika27 commented May 13, 2024

The TGI loading model consumes all available gpus memory #1824

The TGI loading model consumes all available gpus memory #1824

Comments

IdleIdiot commented Apr 28, 2024

System Info

Environments

Information

Tasks

Reproduction

Expected behavior

canamika27 commented May 13, 2024