Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【Training stage1(VAE-GAN)】Cannot train with the data i've prepared. #3

Closed
clown6613 opened this issue Dec 16, 2021 · 5 comments
Closed

Comments

@clown6613
Copy link

Hello.
I tried VAE-GAN training with my data (256×256pix).
However , after typing ./scripts/train.sh ./img_data/data1 displayed following errors.

Screenshot from 2021-12-16 20-16-50

There are about 17,000 png images in the data1 folder.
Also --batch in latent_decoder_model/script/train.sh is changed 6 to 1.

The development environment is a docker container provided by nvidia.
https://ngc.nvidia.com/catalog/containers/nvidia:pytorch

OS is ubuntu20.04LTS.
Graphics board is RTX3080.
The specified torch did not work with this graphics board, so I had to re-install the torch.
pip3 install torch==1.10.1+cu113 torchvision==0.11.2+cu113 torchaudio==0.10.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

I'm not from an English-speaking country, so my writing may be poor, but I hope you can help me with a solution.

@clown6613
Copy link
Author

According to williamfzc/stagesepx#150 , The function name of skimag has been changed.

@clown6613
Copy link
Author

When I have changed the error , I faced a following error .
Screenshot from 2021-12-20 20-37-55

So I changed torch.cuda.set_device(args.local_rank) in "main.py" to torch.cuda.set_device(0) .
That I faced the following error.
Screenshot from 2021-12-20 20-47-02

I have a no idea for fixing fixing the error . I hope your idea.

@clown6613
Copy link
Author

Scond phot

/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rankargument to be set, please change it to read fromos.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu113/fused/build.ninja...
Building extension module fused...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused...
Loading extension module fused...
Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu113/fused/build.ninja...
Building extension module fused...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused...
Using /root/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu113/fused/build.ninja...
Building extension module fused...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused...
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "main.py", line 426, in
File "main.py", line 426, in
File "main.py", line 426, in
File "main.py", line 426, in
synchronize()synchronize()synchronize()

synchronize() File "/workspace/DriveGAN_code/latent_decoder_model/distributed.py", line 63, in synchronize
File "/workspace/DriveGAN_code/latent_decoder_model/distributed.py", line 63, in synchronize
File "/workspace/DriveGAN_code/latent_decoder_model/distributed.py", line 63, in synchronize

File "/workspace/DriveGAN_code/latent_decoder_model/distributed.py", line 63, in synchronize
dist.barrier() dist.barrier()dist.barrier()

dist.barrier() File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier

File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2709, in barrier
work = default_pg.barrier(opts=opts) work = default_pg.barrier(opts=opts)
work = default_pg.barrier(opts=opts)work = default_pg.barrier(opts=opts)

RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).RuntimeErrorRuntimeErrorRuntimeError
: : : NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 916) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures:
[1]:
time : 2021-12-20_11:39:01
host : d2dcfed2edd3
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 917)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2021-12-20_11:39:01
host : d2dcfed2edd3
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 918)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2021-12-20_11:39:01
host : d2dcfed2edd3
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 919)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2021-12-20_11:39:01
host : d2dcfed2edd3
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 916)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

`

@clown6613
Copy link
Author

clown6613 commented Dec 29, 2021

If we train the data we've prepared with single GPU , we need change --nproc_per_node=4 in DriveGAN_code/latent_decoder_model/scripts/train.sh to --nproc_per_node=1 .

Also prepared dataset need following directory structure.

DriveGAN_code/latent_decoder_model/img_data/data1/episode(e.g. 0 , 1 ...)/prepared_data.png(e.g. 0.png , 1.png ... 79.png ) and info.json(this include the information about speed , weather , etc...)

@clown6613 clown6613 reopened this Dec 29, 2021
@clown6613
Copy link
Author

I solved this issue. Thank u.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant