Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-node fine-tuning getting RuntimeError: : CUDA error: invalid device ordinal #90

Closed
1 of 2 tasks
qiuosier opened this issue Aug 3, 2023 · 4 comments
Closed
1 of 2 tasks

Comments

@qiuosier
Copy link

qiuosier commented Aug 3, 2023

System Info

PyTorch version: 2.0.1
CUDA used to build PyTorch: 11.7
GCC version: (Anaconda gcc) 11.2.0
Libc version: glibc-2.17
Python platform: Linux-5.4.17-2136.319.1.3.el7uek.x86_64-x86_64-with-glibc2.17
Python version: 3.9.16 (main, May 15 2023, 23:46:34) [GCC 11.2.0] (64-bit runtime)
CUDA_MODULE_LOADING set to: LAZY
CUDA runtime version: 11.7.99
Is CUDA available: True
Nvidia driver version: 510.108.03

Running on 3 nodes, each node has 2 A10 GPUs.

Information

  • The official example scripts
  • My own modified scripts

🐛 Describe the bug

This line torch.cuda.set_device(rank) should use local_rank instead of rank. Otherwise the rank would be "invalid device ordinal" for nodes other than the first node (which has local_rank==rank).

When running on single node, local rank is the same as rank. However, when running on multi nodes, rank can go from zero to the total number of GPUs minus one. torch.cuda.set_device(rank) will hit this error when rank is greater than the number of GPUs on the particular node.

Also in this line. the evaluation() should be called with local_rank when available.

eval_ppl, eval_epoch_loss = evaluation(model, train_config, eval_dataloader, local_rank if local_rank else rank, tokenizer)

The error for multi-node training goes away after I did the above change.

The command I used to start the fine tuning:

torchrun llama_finetuning.py --enable_fsdp --use_peft --peft_method lora --pure_bf16 --model_name /home/user/llama --output_dir /home/user/outputs

Error logs

Traceback (most recent call last):
   File "/home/datascience/decompressed_artifact/code/llama_finetuning.py", line 237, in <module>
          fire.Fire(main)
   File "/home/datascience/conda/pytorch20_p39_gpu_v1/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
         component_trace = _Fire(component, args, parsed_flag_args, context, name)component_trace = _Fire(component, args, parsed_flag_args, context, name)
   File "/home/datascience/conda/pytorch20_p39_gpu_v1/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
         component, remaining_args = _CallAndUpdateTrace(component, remaining_args = _CallAndUpdateTrace(
   File "/home/datascience/conda/pytorch20_p39_gpu_v1/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
         component = fn(*varargs, **kwargs)    
   File "/home/datascience/decompressed_artifact/code/llama_finetuning.py", line 86, in main
         torch.cuda.set_device(rank)    
   File "/home/datascience/conda/pytorch20_p39_gpu_v1/lib/python3.9/site-packages/torch/cuda/__init__.py", line 350, in set_device
         torch._C._cuda_setDevice(device)torch._C._cuda_setDevice(device)
 
 RuntimeErrorRuntimeError: : CUDA error: invalid device ordinal

Expected behavior

There should be no " invalid device ordinal" error.

lchu-ibm added a commit to lchu-ibm/llama-recipes that referenced this issue Aug 3, 2023
@qiuosier
Copy link
Author

qiuosier commented Aug 3, 2023

@lchu-ibm Thanks you for putting in the fix.
Actually I found another line that also need to be fixed. The evaluation() should be called with local_rank when available.

eval_ppl, eval_epoch_loss = evaluation(model, train_config, eval_dataloader, local_rank if local_rank else rank, tokenizer)

Could you take a look at this also?

@lchu-ibm
Copy link
Contributor

lchu-ibm commented Aug 3, 2023

@qiuosier Good catch! yea I think that needs to be fixed as well.

cc @HamidShojanazeri

lchu-ibm added a commit to lchu-ibm/llama-recipes that referenced this issue Aug 3, 2023
@lchu-ibm
Copy link
Contributor

lchu-ibm commented Aug 3, 2023

I can confirm I am able to run end-to-end with these two fix on ranks. cc @HamidShojanazeri

@HamidShojanazeri
Copy link
Contributor

Thanks @qiuosier and @lchu-ibm, I will close this issue as been addressed in the this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants