Training freezes every num_worker iterations #705

HaniItani · 2023-10-24T15:29:29Z

HaniItani
Oct 24, 2023

Hello,

I am encountering freezing behavior when training on CC3M and my own dataset. I noticed that it freezes every num_worker iterations. I was wondering if anyone encountered this behavior before.

I'm using torch 2.0.1 with CUDA 11.7.0. The problem occurs regardless of the scale of the model and regardless of the dataset format (CSV or Webdataset). Here is a wandb log of my system. I tried it on two different systems and encountered the same behavior. I am using 16 V100-32GB GPUs. I also tried different releases of open_clip.

I also found this log, and the training is blazingly fast for them (128 epochs in 26 hrs). Is this what is expected for RN50? Any benchmarks on the speed and hyperparameters used?

Any help or insights would be very much appreciated.
Best,
Hani

Answered by rwightman

Oct 25, 2023

Not many other ideas, but pretty sure it's an issue with the system / hardware / environment. We've run this a LOT on slurm clusters at very significant scale with high GPU power utilization and no significant epoch turnover delays.

A hail mary, you could try pytorch/pytorch#99625 in your python installing conda install 'llvm-openmp<16' (pip install should work too) as I've run into this issue on PT 2.0.x and it kills performance by constraining all cpu activity onto 1-2 cores.

Possibly worth trying an upgrade to PT 2.1 as well.

View full answer

rwightman · 2023-10-24T16:06:09Z

rwightman
Oct 24, 2023
Maintainer

@HaniItani there is an incredibly long wait each new pass of the dataloader on your setup, hard to say exactly what it is, but usually caused by slow disks, even with webdataset that can be a problem if the disk is struggling to keep up. Might be worth benchmarking your storage?

0 replies

rwightman · 2023-10-24T16:06:51Z

rwightman
Oct 24, 2023
Maintainer

Also, worth noting that we are using persistent workers for webdataset so that impact (slow start on epoch transition) shouldnt' be that significant. The only other thing happening in that transition is saving the checkpoint, so could be really really slow write as well...

0 replies

HaniItani · 2023-10-24T19:28:41Z

HaniItani
Oct 24, 2023
Author

@rwightman thank you very much for your prompt response. I'm a big fan of yours and the work you do for the open source community!

I tested the code with the Synthetic dataset class which basically shouldn't be disk R/W bottlenecked, right? Here is a wandb log of the system with synthetic data run. The issue persists. I can still work on getting some storage benchmarks if you think it's still needed.

Other things I've tried are: Tested the code on a setup where the data is stored on an NVMe SSD with CSV dataset (CC3M scale dataset). Tried with small batch sizes. Increased the prefetch_factor in the dataloader. All to no avail.

Your insights are very much appreciated.

0 replies

rom1504 · 2023-10-25T10:23:41Z

rom1504
Oct 25, 2023
Maintainer

any chance something is up with your shared memory ? torch data loader uses that to exchange data between data loading workers and the training process usually that would live in your RAM (as /dev/shm)

…

On Tue, Oct 24, 2023 at 9:28 PM Hani Itani ***@***.***> wrote: @rwightman <https://github.com/rwightman> thank you very much for your prompt response. I'm a big fan of yours and the work you do for the open source community! I tested the code with the Synthetic dataset class <https://github.com/mlfoundations/open_clip/blob/f08f25f3f226bdb538de2b4ed48a9213ba6b179e/src/training/data.py#L476> which basically shouldn't be disk R/W bottlenecked. Here <https://wandb.ai/itanh0b/open-clip/runs/2023_10_24-22_15_23-model_RN50-lr_0.001-b_256-j_4-p_amp/workspace?workspace=user-itanh0b> is a wandb log of the system with synthetic data run. The issue persists. I can still work on getting some storage benchmarks if you think it's still needed. Other things I've tried are: Tested the code on a setup where the data is stored on an NVMe SSD with CSV dataset (CC3M scale dataset). Tried with small batch sizes. Increased the prefetch_factor in the dataloader. All to no avail. Your insights are very much appreciated. — Reply to this email directly, view it on GitHub <#703 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR437VKJCBT2AXUDU4GD73YBAJHLAVCNFSM6AAAAAA6N4SB22VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZXHA4TENBSGE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

0 replies

rom1504 · 2023-10-25T10:24:55Z

rom1504
Oct 25, 2023
Maintainer

maybe something you can try to eliminate some potential source of problems is write a very basic webdataset data loader and run it on your data in a loop doing nothing but benchmarking the speed (for example with tqdm) On Wed, Oct 25, 2023 at 12:23 PM Romain Beaumont ***@***.***> wrote:

…

any chance something is up with your shared memory ? torch data loader uses that to exchange data between data loading workers and the training process usually that would live in your RAM (as /dev/shm) On Tue, Oct 24, 2023 at 9:28 PM Hani Itani ***@***.***> wrote: > @rwightman <https://github.com/rwightman> thank you very much for your > prompt response. I'm a big fan of yours and the work you do for the open > source community! > > I tested the code with the Synthetic dataset class > <https://github.com/mlfoundations/open_clip/blob/f08f25f3f226bdb538de2b4ed48a9213ba6b179e/src/training/data.py#L476> > which basically shouldn't be disk R/W bottlenecked. Here > <https://wandb.ai/itanh0b/open-clip/runs/2023_10_24-22_15_23-model_RN50-lr_0.001-b_256-j_4-p_amp/workspace?workspace=user-itanh0b> > is a wandb log of the system with synthetic data run. The issue persists. I > can still work on getting some storage benchmarks if you think it's still > needed. > > Other things I've tried are: Tested the code on a setup where the data is > stored on an NVMe SSD with CSV dataset (CC3M scale dataset). Tried with > small batch sizes. Increased the prefetch_factor in the dataloader. All to > no avail. > > Your insights are very much appreciated. > > — > Reply to this email directly, view it on GitHub > <#703 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAR437VKJCBT2AXUDU4GD73YBAJHLAVCNFSM6AAAAAA6N4SB22VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZXHA4TENBSGE> > . > You are receiving this because you are subscribed to this thread.Message > ID: ***@***.***> >

0 replies

rwightman · 2023-10-25T16:31:37Z

rwightman
Oct 25, 2023
Maintainer

@HaniItani I thought I responded to this again but guess I didn't hit comment. What kind of drive are checkpoints being saved to and did you completely disable checkpoint saving in any of your tests?

The fact that the problem happens in both webdataset and csv (esp if similar speeds) is odd and suggests it might not be reading. CSV dataset should be a lot worse on slow drives or network drives.

0 replies

HaniItani · 2023-10-25T17:37:38Z

HaniItani
Oct 25, 2023
Author

@rom1504 @rwightman thank you very much for your suggestions.

I'm using a cluster with SLURM, and I usually allocate 6 CPUs/GPU and 64GB RAM/GPU.

@rwightman yes I disabled checkpoint saving for my previous tests. On my main system, the checkpoints are saved on a hard disk with WekaIO file manager.

I wrote a script to just iterate on the dataloader as suggested to get some numbers. I'm getting 1.5-2 seconds/iteration from tqdm with Webdataset with one process and a batch size of 256. For CSV, it seems to be roughly the same. My dataset is around CC3M scale.

Do you guys have any suggestions to mitigate this bottleneck?

0 replies

rwightman · 2023-10-25T17:50:02Z

rwightman
Oct 25, 2023
Maintainer

Not many other ideas, but pretty sure it's an issue with the system / hardware / environment. We've run this a LOT on slurm clusters at very significant scale with high GPU power utilization and no significant epoch turnover delays.

A hail mary, you could try pytorch/pytorch#99625 in your python installing conda install 'llvm-openmp<16' (pip install should work too) as I've run into this issue on PT 2.0.x and it kills performance by constraining all cpu activity onto 1-2 cores.

Possibly worth trying an upgrade to PT 2.1 as well.

1 reply

HaniItani Oct 25, 2023
Author

Installing llvm-openmp<16 solved it! Worked like magic! The train samples per second curves are now steady. I would have never figured this out on my own. Thank you very much!

HaniItani · 2023-10-25T18:06:05Z

HaniItani
Oct 25, 2023
Author

Okay thank you very much for the insights and advice! I'm closing the issue as complete!

0 replies

rwightman · 2023-10-25T18:15:05Z

rwightman
Oct 25, 2023
Maintainer

one other thing, if you ssh to your nodes on the cluster, esp rank 0 and do nvidia-smi -a

Look for the power event section, if you've got low power usage and these are active there is a problem with the hardware, especially if the HW slowdowns are active. SW power cap and SW power slowdown aren't uncommon esp in non-datacenter cooling environments, they are 'okay' as long as your utilization and power usage are high (but also would be unexpected if avg power draw is low).

    Clocks Event Reasons
        Idle                              : Not Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training freezes every num_worker iterations #705

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 10 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Training freezes every num_worker iterations #705

HaniItani Oct 24, 2023

Replies: 10 comments · 1 reply

rwightman Oct 24, 2023 Maintainer

rwightman Oct 24, 2023 Maintainer

HaniItani Oct 24, 2023 Author

rom1504 Oct 25, 2023 Maintainer

rom1504 Oct 25, 2023 Maintainer

rwightman Oct 25, 2023 Maintainer

HaniItani Oct 25, 2023 Author

rwightman Oct 25, 2023 Maintainer

HaniItani Oct 25, 2023 Author

HaniItani Oct 25, 2023 Author

rwightman Oct 25, 2023 Maintainer

HaniItani
Oct 24, 2023

Replies: 10 comments 1 reply

rwightman
Oct 24, 2023
Maintainer

rwightman
Oct 24, 2023
Maintainer

HaniItani
Oct 24, 2023
Author

rom1504
Oct 25, 2023
Maintainer

rom1504
Oct 25, 2023
Maintainer

rwightman
Oct 25, 2023
Maintainer

HaniItani
Oct 25, 2023
Author

rwightman
Oct 25, 2023
Maintainer

HaniItani Oct 25, 2023
Author

HaniItani
Oct 25, 2023
Author

rwightman
Oct 25, 2023
Maintainer