Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The TGI loading model consumes all available gpus memory #1824

Open
2 of 4 tasks
IdleIdiot opened this issue Apr 28, 2024 · 1 comment
Open
2 of 4 tasks

The TGI loading model consumes all available gpus memory #1824

IdleIdiot opened this issue Apr 28, 2024 · 1 comment

Comments

@IdleIdiot
Copy link

System Info

Environments

>> cat docker-compose.yml 
version: '3.8'
services:
  llama2_api:
    image: ghcr.io/huggingface/text-generation-inference:1.4
    container_name: llama2_api
    command: --model-id /data/llama2/llama2-chat-13b-hf 
    volumes:
      - /data/wanghui01/models/:/data/
    ports:
      - "8081:80"
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      CUDA_VISIBLE_DEVICES: "0,1,2,3,4,5,6,7"
    shm_size: 1g
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
>> curl 127.0.0.1:8081/info | jq
{
  "model_id": "/data/llama2/llama2-chat-13b-hf",
  "model_sha": null,
  "model_dtype": "torch.float16",
  "model_device_type": "cuda",
  "model_pipeline_tag": null,
  "max_concurrent_requests": 128,
  "max_best_of": 2,
  "max_stop_sequences": 4,
  "max_input_length": 1024,
  "max_total_tokens": 2048,
  "waiting_served_ratio": 1.2,
  "max_batch_total_tokens": 342560,
  "max_waiting_tokens": 20,
  "max_batch_size": null,
  "validation_workers": 2,
  "version": "1.4.4",
  "sha": "6c4496a1a30f119cebd3afbfedd847039325dbc9",
  "docker_label": "sha-6c4496a"
}
>> docker exec f4f ls -lh /data/llama2/llama2-chat-13b-hf
total 49G
-rw-r--r-- 1 root root  638 Feb  5 01:49 config.json
-rw-r--r-- 1 root root  111 Feb  5 01:49 generation_config.json
-rw-r--r-- 1 root root 4.7G Mar 27 07:11 model-00001-of-00006.safetensors
-rw-r--r-- 1 root root 4.7G Mar 27 07:12 model-00002-of-00006.safetensors
-rw-r--r-- 1 root root 4.7G Mar 27 07:12 model-00003-of-00006.safetensors
-rw-r--r-- 1 root root 4.6G Mar 27 07:11 model-00004-of-00006.safetensors
-rw-r--r-- 1 root root 4.6G Mar 27 07:11 model-00005-of-00006.safetensors
-rw-r--r-- 1 root root 1.2G Mar 27 07:12 model-00006-of-00006.safetensors
-rw-r--r-- 1 root root 4.7G Feb  5 01:50 pytorch_model-00001-of-00006.bin
-rw-r--r-- 1 root root 4.7G Feb  5 01:50 pytorch_model-00002-of-00006.bin
-rw-r--r-- 1 root root 4.7G Feb  5 01:50 pytorch_model-00003-of-00006.bin
-rw-r--r-- 1 root root 4.6G Feb  5 01:50 pytorch_model-00004-of-00006.bin
-rw-r--r-- 1 root root 4.6G Feb  5 01:50 pytorch_model-00005-of-00006.bin
-rw-r--r-- 1 root root 1.2G Feb  5 01:50 pytorch_model-00006-of-00006.bin
-rw-r--r-- 1 root root  30K Feb  5 01:50 pytorch_model.bin.index.json
-rw-r--r-- 1 root root  414 Feb  5 01:48 special_tokens_map.json
-rw-r--r-- 1 root root 1.8M Feb  5 01:48 tokenizer.json
-rw-r--r-- 1 root root 489K Feb  5 01:48 tokenizer.model
-rw-r--r-- 1 root root  932 Feb  5 01:48 tokenizer_config.json
>> nvidia-smi 
Sun Apr 28 13:51:36 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          Off | 00000000:0B:00.0 Off |                    0 |
| N/A   31C    P0              35W / 250W |  37625MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          Off | 00000000:0C:00.0 Off |                    0 |
| N/A   31C    P0              34W / 250W |  37633MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-PCIE-40GB          Off | 00000000:0F:00.0 Off |                    0 |
| N/A   31C    P0              34W / 250W |  37633MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-PCIE-40GB          Off | 00000000:14:00.0 Off |                    0 |
| N/A   29C    P0              36W / 250W |  37633MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-PCIE-40GB          Off | 00000000:15:00.0 Off |                    0 |
| N/A   30C    P0              35W / 250W |  37633MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-PCIE-40GB          Off | 00000000:18:00.0 Off |                    0 |
| N/A   31C    P0              36W / 250W |  37633MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-PCIE-40GB          Off | 00000000:1C:00.0 Off |                    0 |
| N/A   31C    P0              38W / 250W |  37633MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-PCIE-40GB          Off | 00000000:24:00.0 Off |                    0 |
| N/A   30C    P0              37W / 250W |  37593MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    303244      C   /opt/conda/bin/python3.10                 37612MiB |
|    1   N/A  N/A    303245      C   /opt/conda/bin/python3.10                 37620MiB |
|    2   N/A  N/A    303248      C   /opt/conda/bin/python3.10                 37620MiB |
|    3   N/A  N/A    303252      C   /opt/conda/bin/python3.10                 37620MiB |
|    4   N/A  N/A    303251      C   /opt/conda/bin/python3.10                 37620MiB |
|    5   N/A  N/A    303256      C   /opt/conda/bin/python3.10                 37620MiB |
|    6   N/A  N/A    303254      C   /opt/conda/bin/python3.10                 37620MiB |
|    7   N/A  N/A    303260      C   /opt/conda/bin/python3.10                 37580MiB |
+---------------------------------------------------------------------------------------+

When I just load the model with transfomer, it's obvious that the gpus memory is normal.

>> cat demo.py 
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation.utils import GenerationConfig

model_path = "/data/wanghui01/models/llama2/llama2-chat-13b-hf/"

tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
model.generation_config = GenerationConfig.from_pretrained(model_path)
input("press any key to continue...")

>> python demo.py 
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:05<00:00,  1.03it/s]
press any key to continue...
>> nvidia-smi 
Sun Apr 28 13:56:31 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          Off | 00000000:0B:00.0 Off |                    0 |
| N/A   30C    P0              35W / 250W |   3179MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          Off | 00000000:0C:00.0 Off |                    0 |
| N/A   30C    P0              34W / 250W |   4083MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-PCIE-40GB          Off | 00000000:0F:00.0 Off |                    0 |
| N/A   31C    P0              34W / 250W |   4083MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-PCIE-40GB          Off | 00000000:14:00.0 Off |                    0 |
| N/A   29C    P0              36W / 250W |   4083MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-PCIE-40GB          Off | 00000000:15:00.0 Off |                    0 |
| N/A   30C    P0              35W / 250W |   4083MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-PCIE-40GB          Off | 00000000:18:00.0 Off |                    0 |
| N/A   30C    P0              36W / 250W |   4083MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-PCIE-40GB          Off | 00000000:1C:00.0 Off |                    0 |
| N/A   31C    P0              38W / 250W |   4083MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-PCIE-40GB          Off | 00000000:24:00.0 Off |                    0 |
| N/A   30C    P0              37W / 250W |    741MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    338969      C   python                                     3166MiB |
|    1   N/A  N/A    338969      C   python                                     4070MiB |
|    2   N/A  N/A    338969      C   python                                     4070MiB |
|    3   N/A  N/A    338969      C   python                                     4070MiB |
|    4   N/A  N/A    338969      C   python                                     4070MiB |
|    5   N/A  N/A    338969      C   python                                     4070MiB |
|    6   N/A  N/A    338969      C   python                                     4070MiB |
|    7   N/A  N/A    338969      C   python                                      728MiB |
+---------------------------------------------------------------------------------------+

I'm confused and wondering what could be causing this, maybe someone can give me some advice.

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

  1. make docker-compose.yml
>> cat docker-compose.yml 
version: '3.8'
services:
  llama2_api:
    image: ghcr.io/huggingface/text-generation-inference:1.4
    container_name: llama2_api
    command: --model-id /data/llama2/llama2-chat-13b-hf 
    volumes:
      - /data/wanghui01/models/:/data/
    ports:
      - "8081:80"
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      CUDA_VISIBLE_DEVICES: "0,1,2,3,4,5,6,7"
    shm_size: 1g
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
  1. Run the container and check the memory
>> docker compose up llama2_api -d 
[+] Running 1/1
 ✔ Container llama2_api  Started  

>> nvidia-smi

Expected behavior

  1. The memory of the Gep is similar to that of the transfomers loading model.
@canamika27
Copy link

Facing same issue.. any update why this is happening ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants