You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
torchrun --nproc_per_node=1 train.py --logdir=logs/sample/toy_example --config=projects/neuralangelo/configs/custom/toy_example.yaml --show_pbar
Traceback (most recent call last):
File "train.py", line 104, in <module>
main()
File "train.py", line 46, in main
set_affinity(args.local_rank)
File "/data/imaginaire/utils/gpu_affinity.py", line 74, in set_affinity
os.sched_setaffinity(0, dev.get_cpu_affinity())
File "/data/imaginaire/utils/gpu_affinity.py", line 50, in get_cpu_affinity
for j in pynvml.nvmlDeviceGetCpuAffinity(self.handle, Device._nvml_affinity_elements):
File "/usr/local/lib/python3.8/dist-packages/pynvml/nvml.py", line 1745, in nvmlDeviceGetCpuAffinity
_nvmlCheckReturn(ret)
File "/usr/local/lib/python3.8/dist-packages/pynvml/nvml.py", line 765, in _nvmlCheckReturn
raise NVMLError(ret)
pynvml.nvml.NVMLError_NotSupported: Not Supported
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 442) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.1.0a0+fe05266', 'console_scripts', 'torchrun')())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-08-14_16:29:36
host : c7c816135a1c
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 442)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Running on Windows 11, RTX 4090, from WSL Ubuntu 22.04.02 with --gpus all flag
The text was updated successfully, but these errors were encountered:
Hi @mitdave95, could you try commenting out this line? This is an optional function that sets the processor affinity. If this resolves your issue, I can push a hotfix. Thanks!
Getting the below error
Running on Windows 11, RTX 4090, from WSL Ubuntu 22.04.02 with --gpus all flag
The text was updated successfully, but these errors were encountered: