We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
>> cat docker-compose.yml version: '3.8' services: llama2_api: image: ghcr.io/huggingface/text-generation-inference:1.4 container_name: llama2_api command: --model-id /data/llama2/llama2-chat-13b-hf volumes: - /data/wanghui01/models/:/data/ ports: - "8081:80" environment: NVIDIA_VISIBLE_DEVICES: all CUDA_VISIBLE_DEVICES: "0,1,2,3,4,5,6,7" shm_size: 1g deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu]
>> curl 127.0.0.1:8081/info | jq { "model_id": "/data/llama2/llama2-chat-13b-hf", "model_sha": null, "model_dtype": "torch.float16", "model_device_type": "cuda", "model_pipeline_tag": null, "max_concurrent_requests": 128, "max_best_of": 2, "max_stop_sequences": 4, "max_input_length": 1024, "max_total_tokens": 2048, "waiting_served_ratio": 1.2, "max_batch_total_tokens": 342560, "max_waiting_tokens": 20, "max_batch_size": null, "validation_workers": 2, "version": "1.4.4", "sha": "6c4496a1a30f119cebd3afbfedd847039325dbc9", "docker_label": "sha-6c4496a" }
>> docker exec f4f ls -lh /data/llama2/llama2-chat-13b-hf total 49G -rw-r--r-- 1 root root 638 Feb 5 01:49 config.json -rw-r--r-- 1 root root 111 Feb 5 01:49 generation_config.json -rw-r--r-- 1 root root 4.7G Mar 27 07:11 model-00001-of-00006.safetensors -rw-r--r-- 1 root root 4.7G Mar 27 07:12 model-00002-of-00006.safetensors -rw-r--r-- 1 root root 4.7G Mar 27 07:12 model-00003-of-00006.safetensors -rw-r--r-- 1 root root 4.6G Mar 27 07:11 model-00004-of-00006.safetensors -rw-r--r-- 1 root root 4.6G Mar 27 07:11 model-00005-of-00006.safetensors -rw-r--r-- 1 root root 1.2G Mar 27 07:12 model-00006-of-00006.safetensors -rw-r--r-- 1 root root 4.7G Feb 5 01:50 pytorch_model-00001-of-00006.bin -rw-r--r-- 1 root root 4.7G Feb 5 01:50 pytorch_model-00002-of-00006.bin -rw-r--r-- 1 root root 4.7G Feb 5 01:50 pytorch_model-00003-of-00006.bin -rw-r--r-- 1 root root 4.6G Feb 5 01:50 pytorch_model-00004-of-00006.bin -rw-r--r-- 1 root root 4.6G Feb 5 01:50 pytorch_model-00005-of-00006.bin -rw-r--r-- 1 root root 1.2G Feb 5 01:50 pytorch_model-00006-of-00006.bin -rw-r--r-- 1 root root 30K Feb 5 01:50 pytorch_model.bin.index.json -rw-r--r-- 1 root root 414 Feb 5 01:48 special_tokens_map.json -rw-r--r-- 1 root root 1.8M Feb 5 01:48 tokenizer.json -rw-r--r-- 1 root root 489K Feb 5 01:48 tokenizer.model -rw-r--r-- 1 root root 932 Feb 5 01:48 tokenizer_config.json
>> nvidia-smi Sun Apr 28 13:51:36 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 545.29.06 Driver Version: 545.29.06 CUDA Version: 12.3 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A100-PCIE-40GB Off | 00000000:0B:00.0 Off | 0 | | N/A 31C P0 35W / 250W | 37625MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-PCIE-40GB Off | 00000000:0C:00.0 Off | 0 | | N/A 31C P0 34W / 250W | 37633MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA A100-PCIE-40GB Off | 00000000:0F:00.0 Off | 0 | | N/A 31C P0 34W / 250W | 37633MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA A100-PCIE-40GB Off | 00000000:14:00.0 Off | 0 | | N/A 29C P0 36W / 250W | 37633MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 4 NVIDIA A100-PCIE-40GB Off | 00000000:15:00.0 Off | 0 | | N/A 30C P0 35W / 250W | 37633MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 5 NVIDIA A100-PCIE-40GB Off | 00000000:18:00.0 Off | 0 | | N/A 31C P0 36W / 250W | 37633MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 6 NVIDIA A100-PCIE-40GB Off | 00000000:1C:00.0 Off | 0 | | N/A 31C P0 38W / 250W | 37633MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 7 NVIDIA A100-PCIE-40GB Off | 00000000:24:00.0 Off | 0 | | N/A 30C P0 37W / 250W | 37593MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 303244 C /opt/conda/bin/python3.10 37612MiB | | 1 N/A N/A 303245 C /opt/conda/bin/python3.10 37620MiB | | 2 N/A N/A 303248 C /opt/conda/bin/python3.10 37620MiB | | 3 N/A N/A 303252 C /opt/conda/bin/python3.10 37620MiB | | 4 N/A N/A 303251 C /opt/conda/bin/python3.10 37620MiB | | 5 N/A N/A 303256 C /opt/conda/bin/python3.10 37620MiB | | 6 N/A N/A 303254 C /opt/conda/bin/python3.10 37620MiB | | 7 N/A N/A 303260 C /opt/conda/bin/python3.10 37580MiB | +---------------------------------------------------------------------------------------+
When I just load the model with transfomer, it's obvious that the gpus memory is normal.
>> cat demo.py import torch from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation.utils import GenerationConfig model_path = "/data/wanghui01/models/llama2/llama2-chat-13b-hf/" tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True) model.generation_config = GenerationConfig.from_pretrained(model_path) input("press any key to continue...") >> python demo.py Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:05<00:00, 1.03it/s] press any key to continue...
>> nvidia-smi Sun Apr 28 13:56:31 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 545.29.06 Driver Version: 545.29.06 CUDA Version: 12.3 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A100-PCIE-40GB Off | 00000000:0B:00.0 Off | 0 | | N/A 30C P0 35W / 250W | 3179MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-PCIE-40GB Off | 00000000:0C:00.0 Off | 0 | | N/A 30C P0 34W / 250W | 4083MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA A100-PCIE-40GB Off | 00000000:0F:00.0 Off | 0 | | N/A 31C P0 34W / 250W | 4083MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA A100-PCIE-40GB Off | 00000000:14:00.0 Off | 0 | | N/A 29C P0 36W / 250W | 4083MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 4 NVIDIA A100-PCIE-40GB Off | 00000000:15:00.0 Off | 0 | | N/A 30C P0 35W / 250W | 4083MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 5 NVIDIA A100-PCIE-40GB Off | 00000000:18:00.0 Off | 0 | | N/A 30C P0 36W / 250W | 4083MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 6 NVIDIA A100-PCIE-40GB Off | 00000000:1C:00.0 Off | 0 | | N/A 31C P0 38W / 250W | 4083MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 7 NVIDIA A100-PCIE-40GB Off | 00000000:24:00.0 Off | 0 | | N/A 30C P0 37W / 250W | 741MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 338969 C python 3166MiB | | 1 N/A N/A 338969 C python 4070MiB | | 2 N/A N/A 338969 C python 4070MiB | | 3 N/A N/A 338969 C python 4070MiB | | 4 N/A N/A 338969 C python 4070MiB | | 5 N/A N/A 338969 C python 4070MiB | | 6 N/A N/A 338969 C python 4070MiB | | 7 N/A N/A 338969 C python 728MiB | +---------------------------------------------------------------------------------------+
I'm confused and wondering what could be causing this, maybe someone can give me some advice.
>> docker compose up llama2_api -d [+] Running 1/1 ✔ Container llama2_api Started >> nvidia-smi
The text was updated successfully, but these errors were encountered:
Facing same issue.. any update why this is happening ?
Sorry, something went wrong.
No branches or pull requests
System Info
Environments
When I just load the model with transfomer, it's obvious that the gpus memory is normal.
I'm confused and wondering what could be causing this, maybe someone can give me some advice.
Information
Tasks
Reproduction
Expected behavior
The text was updated successfully, but these errors were encountered: