You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
initializing model parallel with size 2
initializing ddp with size 1
initializing pipeline with size 1
Traceback (most recent call last):
File "example.py", line 72, in
fire.Fire(main)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "example.py", line 58, in main
local_rank, world_size = setup_model_parallel()
File "example.py", line 25, in setup_model_parallel
torch.cuda.set_device(local_rank)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Loading
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2077 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2078) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2023-03-02_13:56:42
host : 5fbe06fc63ef
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2078)
error_file: <N/A>
-tokenizer_path /content/drive/MyDrive/models/tokenizer.model
WARNING:torch.distributed.run:
*****************************************Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
initializing model parallel with size 2
initializing ddp with size 1
initializing pipeline with size 1
Traceback (most recent call last):
File "example.py", line 72, in
fire.Fire(main)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "example.py", line 58, in main
local_rank, world_size = setup_model_parallel()
File "example.py", line 25, in setup_model_parallel
torch.cuda.set_device(local_rank)
File "/usr/local/lib/python3.8/dist-packages/torch/cuda/init.py", line 326, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Loading
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2077 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2078) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example.py FAILED
Had the same problem and found out that you need to have exactly as many GPUs in your system as specified by --nproc_per_node, so in your case you need at least 2 GPUs.
/content/llama# torchrun --nproc_per_node 2 example.py --ckpt_dir /content/drive/MyDrive/models/13B --tokenizer_path /content/drive/MyDrive/models/tokenizer.model
WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2023-03-02_13:56:42
host : 5fbe06fc63ef
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2078)
error_file: <N/A>
-tokenizer_path /content/drive/MyDrive/models/tokenizer.model
WARNING:torch.distributed.run:
*****************************************Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2023-03-02_13:56:42
host : 5fbe06fc63ef
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2078)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
The text was updated successfully, but these errors were encountered: