【Training stage1(VAE-GAN)】Cannot train with the data i've prepared. #3

clown6613 · 2021-12-16T11:35:24Z

Hello.
I tried VAE-GAN training with my data (256×256pix).
However , after typing ./scripts/train.sh ./img_data/data1 displayed following errors.

There are about 17,000 png images in the data1 folder.
Also --batch in latent_decoder_model/script/train.sh is changed 6 to 1.

The development environment is a docker container provided by nvidia.
https://ngc.nvidia.com/catalog/containers/nvidia:pytorch

OS is ubuntu20.04LTS.
Graphics board is RTX3080.
The specified torch did not work with this graphics board, so I had to re-install the torch.
pip3 install torch==1.10.1+cu113 torchvision==0.11.2+cu113 torchaudio==0.10.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

I'm not from an English-speaking country, so my writing may be poor, but I hope you can help me with a solution.

The text was updated successfully, but these errors were encountered:

clown6613 · 2021-12-17T10:24:22Z

According to williamfzc/stagesepx#150 , The function name of skimag has been changed.

clown6613 · 2021-12-20T11:49:15Z

When I have changed the error , I faced a following error .

So I changed torch.cuda.set_device(args.local_rank) in "main.py" to torch.cuda.set_device(0) .
That I faced the following error.

I have a no idea for fixing fixing the error . I hope your idea.

clown6613 · 2021-12-20T11:50:44Z

Scond phot

/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rankargument to be set, please change it to read fromos.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu113/fused/build.ninja...
Building extension module fused...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused...
Loading extension module fused...
Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu113/fused/build.ninja...
Building extension module fused...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused...
Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu113/fused/build.ninja...
Building extension module fused...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused...
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "main.py", line 426, in
File "main.py", line 426, in
File "main.py", line 426, in
File "main.py", line 426, in
synchronize()synchronize()synchronize()

synchronize() File "/workspace/DriveGAN_code/latent_decoder_model/distributed.py", line 63, in synchronize
File "/workspace/DriveGAN_code/latent_decoder_model/distributed.py", line 63, in synchronize
File "/workspace/DriveGAN_code/latent_decoder_model/distributed.py", line 63, in synchronize

File "/workspace/DriveGAN_code/latent_decoder_model/distributed.py", line 63, in synchronize
dist.barrier() dist.barrier()dist.barrier()

dist.barrier() File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier

File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
work = default_pg.barrier(opts=opts) work = default_pg.barrier(opts=opts)
work = default_pg.barrier(opts=opts)work = default_pg.barrier(opts=opts)

RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).RuntimeErrorRuntimeErrorRuntimeError
: : : NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 916) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures:
[1]:
time : 2021-12-20_11:39:01
host : d2dcfed2edd3
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 917)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2021-12-20_11:39:01
host : d2dcfed2edd3
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 918)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2021-12-20_11:39:01
host : d2dcfed2edd3
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 919)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2021-12-20_11:39:01
host : d2dcfed2edd3
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 916)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

`

clown6613 · 2021-12-29T02:23:36Z

If we train the data we've prepared with single GPU , we need change --nproc_per_node=4 in DriveGAN_code/latent_decoder_model/scripts/train.sh to --nproc_per_node=1 .

Also prepared dataset need following directory structure.

DriveGAN_code/latent_decoder_model/img_data/data1/episode(e.g. 0 , 1 ...)/prepared_data.png(e.g. 0.png , 1.png ... 79.png ) and info.json(this include the information about speed , weather , etc...)

clown6613 · 2022-04-09T05:52:24Z

I solved this issue. Thank u.

clown6613 closed this as completed Dec 29, 2021

clown6613 reopened this Dec 29, 2021

clown6613 closed this as completed Apr 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【Training stage1(VAE-GAN)】Cannot train with the data i've prepared. #3

【Training stage1(VAE-GAN)】Cannot train with the data i've prepared. #3

clown6613 commented Dec 16, 2021

clown6613 commented Dec 17, 2021

clown6613 commented Dec 20, 2021

clown6613 commented Dec 20, 2021

clown6613 commented Dec 29, 2021 •

edited

clown6613 commented Apr 9, 2022

【Training stage1(VAE-GAN)】Cannot train with the data i've prepared. #3

【Training stage1(VAE-GAN)】Cannot train with the data i've prepared. #3

Comments

clown6613 commented Dec 16, 2021

clown6613 commented Dec 17, 2021

clown6613 commented Dec 20, 2021

clown6613 commented Dec 20, 2021

main.py FAILED

Root Cause (first observed failure): [0]: time : 2021-12-20_11:39:01 host : d2dcfed2edd3 rank : 0 (local_rank: 0) exitcode : 1 (pid: 916) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

clown6613 commented Dec 29, 2021 • edited

clown6613 commented Apr 9, 2022

Root Cause (first observed failure):
[0]:
time : 2021-12-20_11:39:01
host : d2dcfed2edd3
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 916)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

clown6613 commented Dec 29, 2021 •

edited