Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Title: CUDA RuntimeError: Unspecified Launch Failure during Training #30913

Closed
2 of 4 tasks
Hongjie1Chu opened this issue May 20, 2024 · 16 comments
Closed
2 of 4 tasks

Title: CUDA RuntimeError: Unspecified Launch Failure during Training #30913

Hongjie1Chu opened this issue May 20, 2024 · 16 comments
Labels

Comments

@Hongjie1Chu
Copy link

System Info

  • transformers version: 4.41.0
  • Platform: Linux-5.15.0-88-generic-x86_64-with-glibc2.35
  • Python version: 3.10.6
  • Huggingface_hub version: 0.23.0
  • Safetensors version: 0.4.3
  • Accelerate version: 0.27.2
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.2+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker @younesbelkada @muellerzr

Why does this error occur when passing a custom device_map? The map I wrote only differs from the auto-generated map in device order. Why does this cause an error? Does the device order affect the execution results?

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

import torch
from torch import nn
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, LlamaForCausalLM
from transformers import DataCollatorForLanguageModeling, DataCollatorWithPadding
from transformers.utils.fx import symbolic_trace
import argparse
import numpy as np
from datasets import load_metric, load_dataset

def compute_metrics(eval_preds):
metric = load_metric("glue", "mrpc")
logits, labels = eval_preds
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)

def tokenize_function(example):
return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

if name == "main":
parser = argparse.ArgumentParser()
parser.add_argument('--gpus', type=int, help='the number of gpus', default=8)
parser.add_argument('--modelName', type=str, help="the name of model", default='Llama2')
parser.add_argument('--bs', type=int, help="the name of bs", default=4)

args = parser.parse_args()

# Step 1: Define the model
tokenizer = AutoTokenizer.from_pretrained('FlagAlpha/Atom-7B-Chat')
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

device_map = {
    'model.embed_tokens': 6,
    'model.layers.0': 6,
    'model.layers.1': 4,
    'model.layers.2': 1,
    'model.layers.3': 1,
    'model.layers.4': 1,
    'model.layers.5': 0,
    'model.layers.6': 0,
    'model.layers.7': 0,
    'model.layers.8': 0,
    'model.layers.9': 0,
    'model.layers.10': 6,
    'model.layers.11': 5,
    'model.layers.12': 5,
    'model.layers.13': 5,
    'model.layers.14': 5,
    'model.layers.15': 5,
    'model.layers.16': 4,
    'model.layers.17': 4,
    'model.layers.18': 4,
    'model.layers.19': 4,
    'model.layers.20': 3,
    'model.layers.21': 3,
    'model.layers.22': 3,
    'model.layers.23': 3,
    'model.layers.24': 3,
    'model.layers.25': 2,
    'model.layers.26': 2,
    'model.layers.27': 2,
    'model.layers.28': 2,
    'model.layers.29': 2,
    'model.layers.30': 1,
    'model.layers.31': 1,
    "model.norm.weight": 1,
    "lm_head": 6,
}

model = AutoModelForCausalLM.from_pretrained('FlagAlpha/Atom-7B-Chat', device_map=device_map, num_labels=2)

print(model)
print(model.hf_device_map)

print("gpt start train")

# Step 4: Load the dataset
data_files = {
    'train': '/mnt/glue_mrpc/train.jsonl',
    'test': '/mnt/glue_mrpc/test.jsonl',
    'validation': '/mnt/glue_mrpc/validation.jsonl'
}
raw_datasets = load_dataset('json', data_files=data_files)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.rename_column("label", 'labels')
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Step 5: Train the model
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=5,
    per_device_train_batch_size=args.bs,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

print('start train')
trainer.train()

Expected behavior

I want to know if the device order in the device_map affects the results.

@Hongjie1Chu
Copy link
Author

and when i set :
device_map["model.embed_tokens"] = 0
device_map["model.norm.weight"] = 0

it will not error at start ,but it will error after:
image

@younesbelkada
Copy link
Contributor

Hi @Hongjie1Chu !
In principle the device order shouldn't affect the training behaviour - can you let us know what happens when you run the training script with CUDA_LAUNCH_BLOCKING=1 ? Also do you run your training script with accelerate launch xxx or python xxx.py?

@Sharan1712
Copy link

I too am facing a similar issue.
I haven't made any changes to my code but all of a sudden, my code gives this error after training for like 30 steps.

@Sharan1712
Copy link

Update: I downgraded my PEFT to 10.0 and Transformers to 4.39.0 and it is working fine now

@Hongjie1Chu
Copy link
Author

thanks for your answer!

@Sharan1712
Copy link

Has there been a solution for this yet? I tried using the latest version of transformers and it still gave this issue. I want to use some of the new quantization methods.

@Sharan1712
Copy link

@younesbelkada
Copy link
Contributor

Hi !
It is hard for us to debug without a proper error trace, can you re-run the training script with CUDA_LAUNCH_BLOCKING=1 and paste the error trace here?

@tlangfor
Copy link

I believe I'm seeing the same issue with peft 0.11.1 and transformers 4.41.2 (both installed from conda-forge).

When I rerun with CUDA_LAUNCH_BLOCKING=1 I get:

RuntimeError                              Traceback (most recent call last)
Cell In[16], line 20
      5 trainer = SFTTrainer(
      6     model=model,
      7     train_dataset=full_doc_dataset,
   (...)
     15     compute_metrics=lambda eval_pred: compute_metrics(eval_pred, tokenizer)  # Pass tokenizer here
     16 )
     18 model = accelerator.prepare(model)
---> 20 trainer.train()

File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/trl/trainer/sft_trainer.py:440, in SFTTrainer.train(self, *args, **kwargs)
    437 if self.neftune_noise_alpha is not None and not self._trainer_supports_neftune:
    438     self.model = self._trl_activate_neftune(self.model)
--> 440 output = super().train(*args, **kwargs)
    442 # After training we make sure to retrieve back the original forward pass method
    443 # for the embedding layer by removing the forward post hook.
    444 if self.neftune_noise_alpha is not None and not self._trainer_supports_neftune:

File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/transformers/trainer.py:1885, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1883         hf_hub_utils.enable_progress_bars()
   1884 else:
-> 1885     return inner_training_loop(
   1886         args=args,
   1887         resume_from_checkpoint=resume_from_checkpoint,
   1888         trial=trial,
   1889         ignore_keys_for_eval=ignore_keys_for_eval,
   1890     )

File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/transformers/trainer.py:2216, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   2213     self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
   2215 with self.accelerator.accumulate(model):
-> 2216     tr_loss_step = self.training_step(model, inputs)
   2218 if (
   2219     args.logging_nan_inf_filter
   2220     and not is_torch_xla_available()
   2221     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   2222 ):
   2223     # if loss is nan or inf simply add the average of previous logged losses
   2224     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/transformers/trainer.py:3241, in Trainer.training_step(***failed resolving arguments***)
   3238     loss = self.compute_loss(model, inputs)
   3240 del inputs
-> 3241 torch.cuda.empty_cache()
   3243 if self.args.n_gpu > 1:
   3244     loss = loss.mean()  # mean() to average on multi-gpu parallel training

File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/torch/cuda/memory.py:162, in empty_cache()
    151 r"""Release all unoccupied cached memory currently held by the caching
    152 allocator so that those can be used in other GPU application and visible in
    153 `nvidia-smi`.
   (...)
    159     more details about GPU memory management.
    160 """
    161 if is_initialized():
--> 162     torch._C._cuda_emptyCache()

RuntimeError: CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

@amyeroberts
Copy link
Collaborator

cc @BenjaminBossan Are you the best person to ping for PEFT now?

@BenjaminBossan
Copy link
Member

Hmm, I don't see how this is PEFT related, there is no PEFT code being used? Are you sure that the upgrade/downgrade of PEFT has any influence on the outcome and that it's not because of transformers?

@amyeroberts
Copy link
Collaborator

@BenjaminBossan Sorry, I was just skimming, saw peft mentioned and pinged you :)

Re SFTTrainer, perhaps @SunMarc is the best person here?

@huggingface huggingface deleted a comment from github-actions bot Jul 26, 2024
@huggingface huggingface deleted a comment from github-actions bot Aug 20, 2024
@amyeroberts
Copy link
Collaborator

Gentle ping @SunMarc

@MekkCyber
Copy link
Contributor

Hi @Hongjie1Chu, I tried running your code with the current transformers & accelerate versions, but I run into the error :

 File "~/miniconda3/envs/dev/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 211, in forward
    freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:6! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)

Can you try from your side ?

@ArthurZucker
Copy link
Collaborator

I think #33742 should fix it

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants