Title: CUDA RuntimeError: Unspecified Launch Failure during Training #30913

Hongjie1Chu · 2024-05-20T11:05:47Z

System Info

transformers version: 4.41.0
Platform: Linux-5.15.0-88-generic-x86_64-with-glibc2.35
Python version: 3.10.6
Huggingface_hub version: 0.23.0
Safetensors version: 0.4.3
Accelerate version: 0.27.2
Accelerate config: not found
PyTorch version (GPU?): 2.1.2+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

Why does this error occur when passing a custom device_map? The map I wrote only differs from the auto-generated map in device order. Why does this cause an error? Does the device order affect the execution results?

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

import torch
from torch import nn
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, LlamaForCausalLM
from transformers import DataCollatorForLanguageModeling, DataCollatorWithPadding
from transformers.utils.fx import symbolic_trace
import argparse
import numpy as np
from datasets import load_metric, load_dataset

def compute_metrics(eval_preds):
metric = load_metric("glue", "mrpc")
logits, labels = eval_preds
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)

def tokenize_function(example):
return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

if name == "main":
parser = argparse.ArgumentParser()
parser.add_argument('--gpus', type=int, help='the number of gpus', default=8)
parser.add_argument('--modelName', type=str, help="the name of model", default='Llama2')
parser.add_argument('--bs', type=int, help="the name of bs", default=4)

args = parser.parse_args()

# Step 1: Define the model
tokenizer = AutoTokenizer.from_pretrained('FlagAlpha/Atom-7B-Chat')
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

device_map = {
    'model.embed_tokens': 6,
    'model.layers.0': 6,
    'model.layers.1': 4,
    'model.layers.2': 1,
    'model.layers.3': 1,
    'model.layers.4': 1,
    'model.layers.5': 0,
    'model.layers.6': 0,
    'model.layers.7': 0,
    'model.layers.8': 0,
    'model.layers.9': 0,
    'model.layers.10': 6,
    'model.layers.11': 5,
    'model.layers.12': 5,
    'model.layers.13': 5,
    'model.layers.14': 5,
    'model.layers.15': 5,
    'model.layers.16': 4,
    'model.layers.17': 4,
    'model.layers.18': 4,
    'model.layers.19': 4,
    'model.layers.20': 3,
    'model.layers.21': 3,
    'model.layers.22': 3,
    'model.layers.23': 3,
    'model.layers.24': 3,
    'model.layers.25': 2,
    'model.layers.26': 2,
    'model.layers.27': 2,
    'model.layers.28': 2,
    'model.layers.29': 2,
    'model.layers.30': 1,
    'model.layers.31': 1,
    "model.norm.weight": 1,
    "lm_head": 6,
}

model = AutoModelForCausalLM.from_pretrained('FlagAlpha/Atom-7B-Chat', device_map=device_map, num_labels=2)

print(model)
print(model.hf_device_map)

print("gpt start train")

# Step 4: Load the dataset
data_files = {
    'train': '/mnt/glue_mrpc/train.jsonl',
    'test': '/mnt/glue_mrpc/test.jsonl',
    'validation': '/mnt/glue_mrpc/validation.jsonl'
}
raw_datasets = load_dataset('json', data_files=data_files)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.rename_column("label", 'labels')
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Step 5: Train the model
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=5,
    per_device_train_batch_size=args.bs,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

print('start train')
trainer.train()

Expected behavior

I want to know if the device order in the device_map affects the results.

The text was updated successfully, but these errors were encountered:

Hongjie1Chu · 2024-05-20T11:53:05Z

and when i set :
device_map["model.embed_tokens"] = 0
device_map["model.norm.weight"] = 0

it will not error at start ,but it will error after:

younesbelkada · 2024-05-21T07:41:00Z

Hi @Hongjie1Chu !
In principle the device order shouldn't affect the training behaviour - can you let us know what happens when you run the training script with CUDA_LAUNCH_BLOCKING=1 ? Also do you run your training script with accelerate launch xxx or python xxx.py?

Sharan1712 · 2024-05-21T17:07:14Z

I too am facing a similar issue.
I haven't made any changes to my code but all of a sudden, my code gives this error after training for like 30 steps.

Sharan1712 · 2024-05-22T15:43:03Z

Update: I downgraded my PEFT to 10.0 and Transformers to 4.39.0 and it is working fine now

Hongjie1Chu · 2024-05-23T05:18:50Z

thanks for your answer!

Sharan1712 · 2024-06-06T17:22:26Z

Has there been a solution for this yet? I tried using the latest version of transformers and it still gave this issue. I want to use some of the new quantization methods.

Sharan1712 · 2024-06-10T08:34:22Z

@ArthurZucker @younesbelkada @muellerzr

younesbelkada · 2024-06-10T10:41:37Z

Hi !
It is hard for us to debug without a proper error trace, can you re-run the training script with CUDA_LAUNCH_BLOCKING=1 and paste the error trace here?

tlangfor · 2024-06-27T11:47:57Z

I believe I'm seeing the same issue with peft 0.11.1 and transformers 4.41.2 (both installed from conda-forge).

When I rerun with CUDA_LAUNCH_BLOCKING=1 I get:

RuntimeError                              Traceback (most recent call last)
Cell In[16], line 20
      5 trainer = SFTTrainer(
      6     model=model,
      7     train_dataset=full_doc_dataset,
   (...)
     15     compute_metrics=lambda eval_pred: compute_metrics(eval_pred, tokenizer)  # Pass tokenizer here
     16 )
     18 model = accelerator.prepare(model)
---> 20 trainer.train()

File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/trl/trainer/sft_trainer.py:440, in SFTTrainer.train(self, *args, **kwargs)
    437 if self.neftune_noise_alpha is not None and not self._trainer_supports_neftune:
    438     self.model = self._trl_activate_neftune(self.model)
--> 440 output = super().train(*args, **kwargs)
    442 # After training we make sure to retrieve back the original forward pass method
    443 # for the embedding layer by removing the forward post hook.
    444 if self.neftune_noise_alpha is not None and not self._trainer_supports_neftune:

File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/transformers/trainer.py:1885, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1883         hf_hub_utils.enable_progress_bars()
   1884 else:
-> 1885     return inner_training_loop(
   1886         args=args,
   1887         resume_from_checkpoint=resume_from_checkpoint,
   1888         trial=trial,
   1889         ignore_keys_for_eval=ignore_keys_for_eval,
   1890     )

File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/transformers/trainer.py:2216, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   2213     self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
   2215 with self.accelerator.accumulate(model):
-> 2216     tr_loss_step = self.training_step(model, inputs)
   2218 if (
   2219     args.logging_nan_inf_filter
   2220     and not is_torch_xla_available()
   2221     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   2222 ):
   2223     # if loss is nan or inf simply add the average of previous logged losses
   2224     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/transformers/trainer.py:3241, in Trainer.training_step(***failed resolving arguments***)
   3238     loss = self.compute_loss(model, inputs)
   3240 del inputs
-> 3241 torch.cuda.empty_cache()
   3243 if self.args.n_gpu > 1:
   3244     loss = loss.mean()  # mean() to average on multi-gpu parallel training

File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/torch/cuda/memory.py:162, in empty_cache()
    151 r"""Release all unoccupied cached memory currently held by the caching
    152 allocator so that those can be used in other GPU application and visible in
    153 `nvidia-smi`.
   (...)
    159     more details about GPU memory management.
    160 """
    161 if is_initialized():
--> 162     torch._C._cuda_emptyCache()

RuntimeError: CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

amyeroberts · 2024-06-28T18:12:55Z

cc @BenjaminBossan Are you the best person to ping for PEFT now?

BenjaminBossan · 2024-07-01T11:09:31Z

Hmm, I don't see how this is PEFT related, there is no PEFT code being used? Are you sure that the upgrade/downgrade of PEFT has any influence on the outcome and that it's not because of transformers?

amyeroberts · 2024-07-01T19:57:23Z

@BenjaminBossan Sorry, I was just skimming, saw peft mentioned and pinged you :)

Re SFTTrainer, perhaps @SunMarc is the best person here?

amyeroberts · 2024-08-20T09:38:30Z

Gentle ping @SunMarc

MekkCyber · 2024-09-30T06:39:01Z

Hi @Hongjie1Chu, I tried running your code with the current transformers & accelerate versions, but I run into the error :

 File "~/miniconda3/envs/dev/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 211, in forward
    freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:6! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)

Can you try from your side ?

ArthurZucker · 2024-10-03T12:51:32Z

I think #33742 should fix it

github-actions · 2024-10-28T08:12:11Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface deleted a comment from github-actions bot Jul 26, 2024

huggingface deleted a comment from github-actions bot Aug 20, 2024

ArthurZucker added the trainer label Aug 27, 2024

ArthurZucker mentioned this issue Sep 6, 2024

Accelerate x Trainer issue tracker: #33345

Open

43 tasks

huggingface deleted a comment from github-actions bot Sep 20, 2024

github-actions bot closed this as completed Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Title: CUDA RuntimeError: Unspecified Launch Failure during Training #30913

Title: CUDA RuntimeError: Unspecified Launch Failure during Training #30913

Hongjie1Chu commented May 20, 2024

Hongjie1Chu commented May 20, 2024

younesbelkada commented May 21, 2024

Sharan1712 commented May 21, 2024

Sharan1712 commented May 22, 2024

Hongjie1Chu commented May 23, 2024

Sharan1712 commented Jun 6, 2024

Sharan1712 commented Jun 10, 2024

younesbelkada commented Jun 10, 2024

tlangfor commented Jun 27, 2024

amyeroberts commented Jun 28, 2024

BenjaminBossan commented Jul 1, 2024

amyeroberts commented Jul 1, 2024

amyeroberts commented Aug 20, 2024

MekkCyber commented Sep 30, 2024

ArthurZucker commented Oct 3, 2024

github-actions bot commented Oct 28, 2024

Title: CUDA RuntimeError: Unspecified Launch Failure during Training #30913

Title: CUDA RuntimeError: Unspecified Launch Failure during Training #30913

Comments

Hongjie1Chu commented May 20, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Hongjie1Chu commented May 20, 2024

younesbelkada commented May 21, 2024

Sharan1712 commented May 21, 2024

Sharan1712 commented May 22, 2024

Hongjie1Chu commented May 23, 2024

Sharan1712 commented Jun 6, 2024

Sharan1712 commented Jun 10, 2024

younesbelkada commented Jun 10, 2024

tlangfor commented Jun 27, 2024

amyeroberts commented Jun 28, 2024

BenjaminBossan commented Jul 1, 2024

amyeroberts commented Jul 1, 2024

amyeroberts commented Aug 20, 2024

MekkCyber commented Sep 30, 2024

ArthurZucker commented Oct 3, 2024

github-actions bot commented Oct 28, 2024