-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Title: CUDA RuntimeError: Unspecified Launch Failure during Training #30913
Comments
Hi @Hongjie1Chu ! |
I too am facing a similar issue. |
Update: I downgraded my PEFT to 10.0 and Transformers to 4.39.0 and it is working fine now |
thanks for your answer! |
Has there been a solution for this yet? I tried using the latest version of transformers and it still gave this issue. I want to use some of the new quantization methods. |
Hi ! |
I believe I'm seeing the same issue with peft 0.11.1 and transformers 4.41.2 (both installed from conda-forge). When I rerun with RuntimeError Traceback (most recent call last)
Cell In[16], line 20
5 trainer = SFTTrainer(
6 model=model,
7 train_dataset=full_doc_dataset,
(...)
15 compute_metrics=lambda eval_pred: compute_metrics(eval_pred, tokenizer) # Pass tokenizer here
16 )
18 model = accelerator.prepare(model)
---> 20 trainer.train()
File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/trl/trainer/sft_trainer.py:440, in SFTTrainer.train(self, *args, **kwargs)
437 if self.neftune_noise_alpha is not None and not self._trainer_supports_neftune:
438 self.model = self._trl_activate_neftune(self.model)
--> 440 output = super().train(*args, **kwargs)
442 # After training we make sure to retrieve back the original forward pass method
443 # for the embedding layer by removing the forward post hook.
444 if self.neftune_noise_alpha is not None and not self._trainer_supports_neftune:
File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/transformers/trainer.py:1885, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1883 hf_hub_utils.enable_progress_bars()
1884 else:
-> 1885 return inner_training_loop(
1886 args=args,
1887 resume_from_checkpoint=resume_from_checkpoint,
1888 trial=trial,
1889 ignore_keys_for_eval=ignore_keys_for_eval,
1890 )
File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/transformers/trainer.py:2216, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
2213 self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
2215 with self.accelerator.accumulate(model):
-> 2216 tr_loss_step = self.training_step(model, inputs)
2218 if (
2219 args.logging_nan_inf_filter
2220 and not is_torch_xla_available()
2221 and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
2222 ):
2223 # if loss is nan or inf simply add the average of previous logged losses
2224 tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)
File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/transformers/trainer.py:3241, in Trainer.training_step(***failed resolving arguments***)
3238 loss = self.compute_loss(model, inputs)
3240 del inputs
-> 3241 torch.cuda.empty_cache()
3243 if self.args.n_gpu > 1:
3244 loss = loss.mean() # mean() to average on multi-gpu parallel training
File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/torch/cuda/memory.py:162, in empty_cache()
151 r"""Release all unoccupied cached memory currently held by the caching
152 allocator so that those can be used in other GPU application and visible in
153 `nvidia-smi`.
(...)
159 more details about GPU memory management.
160 """
161 if is_initialized():
--> 162 torch._C._cuda_emptyCache()
RuntimeError: CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. |
cc @BenjaminBossan Are you the best person to ping for PEFT now? |
Hmm, I don't see how this is PEFT related, there is no PEFT code being used? Are you sure that the upgrade/downgrade of PEFT has any influence on the outcome and that it's not because of transformers? |
@BenjaminBossan Sorry, I was just skimming, saw peft mentioned and pinged you :) Re SFTTrainer, perhaps @SunMarc is the best person here? |
Gentle ping @SunMarc |
Hi @Hongjie1Chu, I tried running your code with the current transformers & accelerate versions, but I run into the error :
Can you try from your side ? |
I think #33742 should fix it |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
transformers
version: 4.41.0Who can help?
@ArthurZucker @younesbelkada @muellerzr
Why does this error occur when passing a custom device_map? The map I wrote only differs from the auto-generated map in device order. Why does this cause an error? Does the device order affect the execution results?
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
import torch
from torch import nn
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, LlamaForCausalLM
from transformers import DataCollatorForLanguageModeling, DataCollatorWithPadding
from transformers.utils.fx import symbolic_trace
import argparse
import numpy as np
from datasets import load_metric, load_dataset
def compute_metrics(eval_preds):
metric = load_metric("glue", "mrpc")
logits, labels = eval_preds
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
def tokenize_function(example):
return tokenizer(example["sentence1"], example["sentence2"], truncation=True)
if name == "main":
parser = argparse.ArgumentParser()
parser.add_argument('--gpus', type=int, help='the number of gpus', default=8)
parser.add_argument('--modelName', type=str, help="the name of model", default='Llama2')
parser.add_argument('--bs', type=int, help="the name of bs", default=4)
Expected behavior
I want to know if the device order in the device_map affects the results.
The text was updated successfully, but these errors were encountered: