-
Hello, I am encountering freezing behavior when training on CC3M and my own dataset. I noticed that it freezes every num_worker iterations. I was wondering if anyone encountered this behavior before. I'm using torch 2.0.1 with CUDA 11.7.0. The problem occurs regardless of the scale of the model and regardless of the dataset format (CSV or Webdataset). Here is a wandb log of my system. I tried it on two different systems and encountered the same behavior. I am using 16 V100-32GB GPUs. I also tried different releases of open_clip. I also found this log, and the training is blazingly fast for them (128 epochs in 26 hrs). Is this what is expected for RN50? Any benchmarks on the speed and hyperparameters used? Any help or insights would be very much appreciated. |
Beta Was this translation helpful? Give feedback.
Replies: 10 comments 1 reply
-
@HaniItani there is an incredibly long wait each new pass of the dataloader on your setup, hard to say exactly what it is, but usually caused by slow disks, even with webdataset that can be a problem if the disk is struggling to keep up. Might be worth benchmarking your storage? |
Beta Was this translation helpful? Give feedback.
-
Also, worth noting that we are using persistent workers for webdataset so that impact (slow start on epoch transition) shouldnt' be that significant. The only other thing happening in that transition is saving the checkpoint, so could be really really slow write as well... |
Beta Was this translation helpful? Give feedback.
-
@rwightman thank you very much for your prompt response. I'm a big fan of yours and the work you do for the open source community! I tested the code with the Synthetic dataset class which basically shouldn't be disk R/W bottlenecked, right? Here is a wandb log of the system with synthetic data run. The issue persists. I can still work on getting some storage benchmarks if you think it's still needed. Other things I've tried are: Tested the code on a setup where the data is stored on an NVMe SSD with CSV dataset (CC3M scale dataset). Tried with small batch sizes. Increased the prefetch_factor in the dataloader. All to no avail. Your insights are very much appreciated. |
Beta Was this translation helpful? Give feedback.
-
any chance something is up with your shared memory ? torch data loader uses
that to exchange data between data loading workers and the training process
usually that would live in your RAM (as /dev/shm)
…On Tue, Oct 24, 2023 at 9:28 PM Hani Itani ***@***.***> wrote:
@rwightman <https://github.com/rwightman> thank you very much for your
prompt response. I'm a big fan of yours and the work you do for the open
source community!
I tested the code with the Synthetic dataset class
<https://github.com/mlfoundations/open_clip/blob/f08f25f3f226bdb538de2b4ed48a9213ba6b179e/src/training/data.py#L476>
which basically shouldn't be disk R/W bottlenecked. Here
<https://wandb.ai/itanh0b/open-clip/runs/2023_10_24-22_15_23-model_RN50-lr_0.001-b_256-j_4-p_amp/workspace?workspace=user-itanh0b>
is a wandb log of the system with synthetic data run. The issue persists. I
can still work on getting some storage benchmarks if you think it's still
needed.
Other things I've tried are: Tested the code on a setup where the data is
stored on an NVMe SSD with CSV dataset (CC3M scale dataset). Tried with
small batch sizes. Increased the prefetch_factor in the dataloader. All to
no avail.
Your insights are very much appreciated.
—
Reply to this email directly, view it on GitHub
<#703 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437VKJCBT2AXUDU4GD73YBAJHLAVCNFSM6AAAAAA6N4SB22VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZXHA4TENBSGE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
maybe something you can try to eliminate some potential source of problems
is write a very basic webdataset data loader and run it on your data in a
loop doing nothing but benchmarking the speed (for example with tqdm)
On Wed, Oct 25, 2023 at 12:23 PM Romain Beaumont ***@***.***>
wrote:
… any chance something is up with your shared memory ? torch data loader
uses that to exchange data between data loading workers and the training
process
usually that would live in your RAM (as /dev/shm)
On Tue, Oct 24, 2023 at 9:28 PM Hani Itani ***@***.***>
wrote:
> @rwightman <https://github.com/rwightman> thank you very much for your
> prompt response. I'm a big fan of yours and the work you do for the open
> source community!
>
> I tested the code with the Synthetic dataset class
> <https://github.com/mlfoundations/open_clip/blob/f08f25f3f226bdb538de2b4ed48a9213ba6b179e/src/training/data.py#L476>
> which basically shouldn't be disk R/W bottlenecked. Here
> <https://wandb.ai/itanh0b/open-clip/runs/2023_10_24-22_15_23-model_RN50-lr_0.001-b_256-j_4-p_amp/workspace?workspace=user-itanh0b>
> is a wandb log of the system with synthetic data run. The issue persists. I
> can still work on getting some storage benchmarks if you think it's still
> needed.
>
> Other things I've tried are: Tested the code on a setup where the data is
> stored on an NVMe SSD with CSV dataset (CC3M scale dataset). Tried with
> small batch sizes. Increased the prefetch_factor in the dataloader. All to
> no avail.
>
> Your insights are very much appreciated.
>
> —
> Reply to this email directly, view it on GitHub
> <#703 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAR437VKJCBT2AXUDU4GD73YBAJHLAVCNFSM6AAAAAA6N4SB22VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZXHA4TENBSGE>
> .
> You are receiving this because you are subscribed to this thread.Message
> ID: ***@***.***>
>
|
Beta Was this translation helpful? Give feedback.
-
@HaniItani I thought I responded to this again but guess I didn't hit comment. What kind of drive are checkpoints being saved to and did you completely disable checkpoint saving in any of your tests? The fact that the problem happens in both webdataset and csv (esp if similar speeds) is odd and suggests it might not be reading. CSV dataset should be a lot worse on slow drives or network drives. |
Beta Was this translation helpful? Give feedback.
-
@rom1504 @rwightman thank you very much for your suggestions. I'm using a cluster with SLURM, and I usually allocate 6 CPUs/GPU and 64GB RAM/GPU. @rwightman yes I disabled checkpoint saving for my previous tests. On my main system, the checkpoints are saved on a hard disk with WekaIO file manager. I wrote a script to just iterate on the dataloader as suggested to get some numbers. I'm getting 1.5-2 seconds/iteration from tqdm with Webdataset with one process and a batch size of 256. For CSV, it seems to be roughly the same. My dataset is around CC3M scale. Do you guys have any suggestions to mitigate this bottleneck? |
Beta Was this translation helpful? Give feedback.
-
Not many other ideas, but pretty sure it's an issue with the system / hardware / environment. We've run this a LOT on slurm clusters at very significant scale with high GPU power utilization and no significant epoch turnover delays. A hail mary, you could try pytorch/pytorch#99625 in your python installing Possibly worth trying an upgrade to PT 2.1 as well. |
Beta Was this translation helpful? Give feedback.
-
Okay thank you very much for the insights and advice! I'm closing the issue as complete! |
Beta Was this translation helpful? Give feedback.
-
one other thing, if you ssh to your nodes on the cluster, esp rank 0 and do Look for the power event section, if you've got low power usage and these are active there is a problem with the hardware, especially if the HW slowdowns are active. SW power cap and SW power slowdown aren't uncommon esp in non-datacenter cooling environments, they are 'okay' as long as your utilization and power usage are high (but also would be unexpected if avg power draw is low).
|
Beta Was this translation helpful? Give feedback.
Not many other ideas, but pretty sure it's an issue with the system / hardware / environment. We've run this a LOT on slurm clusters at very significant scale with high GPU power utilization and no significant epoch turnover delays.
A hail mary, you could try pytorch/pytorch#99625 in your python installing
conda install 'llvm-openmp<16'
(pip install should work too) as I've run into this issue on PT 2.0.x and it kills performance by constraining all cpu activity onto 1-2 cores.Possibly worth trying an upgrade to PT 2.1 as well.