CUDA Out Of Memory in Distributed Training #14

YouliangHUANG · 2022-10-31T11:05:48Z

I used to successfully train the StyleGAN2-ADA and StyleGAN3 on my device. However the distributed training for SOAT failed due to out of the cuda memory. I modify the code a little bit which don't involving any training codes, then I use the Slurm to submit my training job to the server and check the model has been successfully distributed to different GPUs. Before the first epoch completes, the job aborts.
The information below is my training environment:
    CPU: Intel Xeon 6348
    GPU: NVIDIA A100 40G PCIe*8
    Script:  python -m torch.distributed.launch --nproc_per_node=8 train.py --dataset=[My Dataset(Grayscale in 1024x1024, and I convert it into RGB when loading dataset)] --batch=X --size=1024 --iter=40000
BTW, I set the batch size as 64, 32, 16. All of them abort. When I using a single GPU to train the SOAT with batch size 8, it succeeds.
Looking for your reply and see if there's any possible solution.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Out Of Memory in Distributed Training #14

CUDA Out Of Memory in Distributed Training #14

YouliangHUANG commented Oct 31, 2022

CUDA Out Of Memory in Distributed Training #14

CUDA Out Of Memory in Distributed Training #14

Comments

YouliangHUANG commented Oct 31, 2022