Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA OOM error when using Llama converted weights, but not repository weights #31648

Closed
2 of 4 tasks
matthewclso opened this issue Jun 26, 2024 · 0 comments
Closed
2 of 4 tasks

Comments

@matthewclso
Copy link

matthewclso commented Jun 26, 2024

System Info

  • transformers version: 4.41.2
  • Platform: Linux-6.5.0-1020-aws-x86_64-with-glibc2.35
  • Python version: 3.11.5
  • Huggingface_hub version: 0.23.4
  • Safetensors version: 0.4.3
  • Accelerate version: 0.31.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.3.1+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: 8x L4 GPUs
  • Using distributed or parallel set-up in script?: DeepSpeed (ZeRO not enabled)

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

import pandas as pd

from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM
from transformers.integrations.deepspeed import HfDeepSpeedConfig
from peft import get_peft_model, LoraConfig, TaskType
import torch
import deepspeed
from tqdm import tqdm

num_epochs = 100

deepspeed_config = {
    "train_batch_size": 8,
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 2e-4,
            "betas": [0.9, 0.95],
            "weight_decay": 0,
        },
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 2e-4,
            "warmup_num_steps": 1000,
        },
    },
    "bf16": {
        "enabled": True,
    },
    "data_efficency": {
        "enabled": True,
        "data_sampling": {
            "enabled": True,
            "num_epochs": num_epochs,
        },
    },
}

dschf = HfDeepSpeedConfig(deepspeed_config)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.add_special_tokens(
    {
        "pad_token": "<|pad|>",
        "unk_token": "<|unk|>",
    }
)
tokenizer.padding_side = "left"
max_len = 1024

config = AutoConfig.from_pretrained(model_name)
config.vocab_size = len(tokenizer)
config.pad_token_id = tokenizer.pad_token_id
config.padding_idx = tokenizer.pad_token_id

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    torch_dtype=torch.bfloat16,
    config=config,
    ignore_mismatched_sizes=True,
)

model.resize_token_embeddings(len(tokenizer))

model, optimizer, train_dataloader, lr_scheduler = deepspeed.initialize(
    model=model,
    config=deepspeed_config,
)

device = torch.device("cuda")

test_tensor = torch.zeros((1, max_len), dtype=torch.int64).to(device)
output = model(test_tensor)

The above code fails without error on my L4 GPUs. However, when I download the weights from the Llama3 repository, convert the weights using

./download.sh
...
python src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir Meta-Llama-3-8B --model_size 8B --llama_version 3 --output_dir weights/Meta-Llama-3-8B-HF

and run this code -

model = AutoModelForCausalLM.from_pretrained(
    "weights/Meta-Llama-3-8B-HF",
    torch_dtype=torch.bfloat16,
    config=config,
    ignore_mismatched_sizes=True,
)
...
test_tensor = torch.zeros((1, max_len), dtype=torch.int64).to(device)
output = model(test_tensor)

I get a CUDA OOM error. I also tried the same code on an 8x V100 32GB machine, with the same results. Any stage of ZeRO optimization also doesn't make any difference.

Expected behavior

Using downloaded and converted weights vs. using Hugging Face repository weights should not make a difference in terms of CUDA memory usage.

@matthewclso matthewclso closed this as not planned Won't fix, can't repro, duplicate, stale Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant