Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Out Of Memory in Distributed Training #14

Open
YouliangHUANG opened this issue Oct 31, 2022 · 0 comments
Open

CUDA Out Of Memory in Distributed Training #14

YouliangHUANG opened this issue Oct 31, 2022 · 0 comments

Comments

@YouliangHUANG
Copy link

I used to successfully train the StyleGAN2-ADA and StyleGAN3 on my device. However the distributed training for SOAT failed due to out of the cuda memory. I modify the code a little bit which don't involving any training codes, then I use the Slurm to submit my training job to the server and check the model has been successfully distributed to different GPUs. Before the first epoch completes, the job aborts.
The information below is my training environment:
    CPU: Intel Xeon 6348
    GPU: NVIDIA A100 40G PCIe*8
    Script:  python -m torch.distributed.launch --nproc_per_node=8 train.py --dataset=[My Dataset(Grayscale in 1024x1024, and I convert it into RGB when loading dataset)] --batch=X --size=1024 --iter=40000
BTW, I set the batch size as 64, 32, 16. All of them abort. When I using a single GPU to train the SOAT with batch size 8, it succeeds.
Looking for your reply and see if there's any possible solution.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant