Load HuggingFace `transformers` models over multiple GPUs with a custom `device_map`.
First, explore how `accelerate` calculates its `max_memory` (a mapping between devices and their maximum available memory), following https://github.com/huggingface/accelerate/blob/v1.0.0rc1/src/accelerate/utils/modeling.py#L842C37-L842C63

In [1]:
import torch

def print_gpu_memory():
    print(f"The current device is {torch.cuda.current_device()}")
    for i in range(torch.cuda.device_count()):
        print(f"GPU {i}: {torch.cuda.mem_get_info(i)[0] / 1024 / 1024 / 1024:.2f} GB free, {torch.cuda.mem_get_info(i)[1] / 1024 / 1024 / 1024:.2f} GB total")

print_gpu_memory()

The current device is 0
GPU 0: 44.09 GB free, 44.35 GB total
GPU 1: 44.09 GB free, 44.35 GB total
GPU 2: 44.09 GB free, 44.35 GB total
GPU 3: 44.09 GB free, 44.35 GB total


In [2]:
from transformers import AutoModelForCausalLM
import gc


def test_memory_leak(model_name="gpt2"):
    print_gpu_memory()
    print("Loading model...")
    model = AutoModelForCausalLM.from_pretrained(model_name, device_map="balanced_low_0", cache_dir="/workspace/hf_cache")
    print_gpu_memory()
    print("Collecting garbage...")
    gc.collect()
    torch.cuda.empty_cache()
    print_gpu_memory()

test_memory_leak(model_name="gpt2")

  from .autonotebook import tqdm as notebook_tqdm


The current device is 0
GPU 0: 44.09 GB free, 44.35 GB total
GPU 1: 44.09 GB free, 44.35 GB total
GPU 2: 44.09 GB free, 44.35 GB total
GPU 3: 44.09 GB free, 44.35 GB total
Loading model...
The current device is 0
GPU 0: 44.09 GB free, 44.35 GB total
GPU 1: 43.84 GB free, 44.35 GB total
GPU 2: 43.85 GB free, 44.35 GB total
GPU 3: 44.05 GB free, 44.35 GB total
Collecting garbage...
The current device is 0
GPU 0: 44.09 GB free, 44.35 GB total
GPU 1: 43.84 GB free, 44.35 GB total
GPU 2: 43.85 GB free, 44.35 GB total
GPU 3: 44.05 GB free, 44.35 GB total


In [3]:
# login to huggingface using python and getpass
from getpass import getpass
import os

if not os.environ.get("HUGGINGFACE_TOKEN"):
    huggingface_token = getpass("Enter your HuggingFace token: ")
    os.environ["HUGGINGFACE_TOKEN"] = huggingface_token
    
!huggingface-cli login --token $HUGGINGFACE_TOKEN

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [4]:
test_memory_leak(model_name="meta-llama/Meta-Llama-3.1-70B-Instruct")

The current device is 0
GPU 0: 44.09 GB free, 44.35 GB total
GPU 1: 43.84 GB free, 44.35 GB total
GPU 2: 43.85 GB free, 44.35 GB total
GPU 3: 44.05 GB free, 44.35 GB total
Loading model...


Downloading shards: 100%|██████████| 30/30 [09:32<00:00, 19.09s/it]
Loading checkpoint shards: 100%|██████████| 30/30 [00:50<00:00,  1.69s/it]


The current device is 0
GPU 0: 5.11 GB free, 44.35 GB total
GPU 1: 2.65 GB free, 44.35 GB total
GPU 2: 2.65 GB free, 44.35 GB total
GPU 3: 2.65 GB free, 44.35 GB total
Collecting garbage...
The current device is 0
GPU 0: 5.11 GB free, 44.35 GB total
GPU 1: 2.65 GB free, 44.35 GB total
GPU 2: 2.65 GB free, 44.35 GB total
GPU 3: 2.65 GB free, 44.35 GB total


: 