Dynamic bucket selection rng sync #1341

pzelasko · 2024-05-21T00:26:13Z

Follow up to #863 and #1309

This version seems to work as intended, it consistently picks the same buckets on each DDP rank. It depends on good duration_bins initialization (i.e. it has to be estimated on the actual training data to fit the duration distribution well) and large enough buffer_size, so that all buckets are filled enough to yield at least 1 mini-batch for the most of the time. If it hits a non-ready bucket it tries again with its neighbors.

I'm determining what kind of speedup can be expected from this, also need to add proper tests, and if I find it's good enough, I'll probably make it the default.

pzelasko · 2024-05-22T21:26:48Z

@lifeiteng care to try this one?

kobenaxie · 2024-05-23T10:35:33Z

I tried this with config below, and training on librispeech spent the same(21 minutes/epoch)

sampler = DynamicBucketingSampler(
         cuts,
         max_duration=200,
         shuffle=True,
         num_buckets=30,
         buffer_size=10000,
         shuffle_buffer_size=10000,
         drop_last=True,
         rank=0,
         world_size=1,
     )

     dataset = IterableDatasetWrapper(dataset=dataset, sampler=sampler)

     dloader = torch.utils.data.DataLoader(
         dataset=dataset,
         batch_size=None,
         persistent_workers=False,
         num_workers=8,
         pin_memory=False,
         worker_init_fn=make_worker_init_fn(
             rank=global_rank,
             world_size=world_size,
         ),
         prefetch_factor=40,
     )

pzelasko · 2024-05-23T12:04:07Z

Interesting. How many GPUs? Can you also try increasing the buffer size to 50k? Otherwise maybe the batch duration is too low to notice a difference.

I observed a 10% speedup on a 2 GPU setup but need to investigate further.

kobenaxie · 2024-05-23T12:12:18Z

8 A100 GPUs.
I will try to increase the buffer size to 50K and report result tomorrow.

pzelasko · 2024-05-23T12:19:37Z

Aside from that, your max duration seems low for A100. Try adding quadratic_duration=15 to the sampler and you’ll probably be able to increase max duration by 100-200 (but I’d expect you to be able to set it somewhere around at least 500-600 in the first place, are you using bf16/fp16)?

kobenaxie · 2024-05-23T12:43:25Z

Aside from that, your max duration seems low for A100. Try adding quadratic_duration=15 to the sampler and you’ll probably be able to increase max duration by 100-200 (but I’d expect you to be able to set it somewhere around at least 500-600 in the first place, are you using bf16/fp16)?

FP32 now

kobenaxie · 2024-05-24T01:44:41Z

Aside from that, your max duration seems low for A100. Try adding quadratic_duration=15 to the sampler and you’ll probably be able to increase max duration by 100-200 (but I’d expect you to be able to set it somewhere around at least 500-600 in the first place, are you using bf16/fp16)?

    sampler = DynamicBucketingSampler(
         cuts,
         max_duration=300,
         shuffle=True,
         num_buckets=30,
         buffer_size=50000,
         shuffle_buffer_size=10000,
         quadratic_duration=15,
         drop_last=True,
         rank=0,
         world_size=1,
     )

it cost 31 minutes one epoch with this config.

pzelasko · 2024-05-24T02:04:13Z

Mmm it seems you are using webdataset or Lhotse shar, so when batch size or buffer size grows, the initialization of dataloader (on first step of iteration) takes longer as it has to read more data into memory. Try precomputing duration bins by running ‘estimate_duration_bins’ on your data and providing the output to samplers duration bins argument. You can then try to revert buffer size to the original setting and compare with and without sync buffers again.

…arly bucket depletion cases.

pzelasko · 2024-05-24T17:45:43Z

Just pushed a version that is better tested and supports both map-style and iterable-style datasets.

kobenaxie · 2024-05-27T04:06:44Z

Mmm it seems you are using webdataset or Lhotse shar, so when batch size or buffer size grows, the initialization of dataloader (on first step of iteration) takes longer as it has to read more data into memory. Try precomputing duration bins by running ‘estimate_duration_bins’ on your data and providing the output to samplers duration bins argument. You can then try to revert buffer size to the original setting and compare with and without sync buffers again.

duration	bucket_buffer_size	sync_buffer	mins/epoch	estimate_duration_bins
200	10k	False	21	False
200	10k	True	21	False
200	10k	False	21	True
200	10k	True	21	True
300	50k	Fals	18	False
300	50k	True	17	False
300	50k	False	16	True
300	50k	True	16	True

pzelasko · 2024-06-05T18:54:08Z

I've tested this change more thoroughly and I'm now confident it helps with the training speed. When training NeMo FastConformer RNNT+CTC ASR on a ~20k hours dataset with 16 GPUs I observed a 13% faster training step time when bucket selection is synchronized, everything else being the same. The validation WER of both runs is very close on the same number of steps, but the convergence is actually a bit quicker when we consider validation WER vs total training time.

On a separate experiment with 2 GPUs I observed an 8% speedup. I expect the speedup to grow with the size of distributed training, as the probability of hitting the slowest bucket on each training step grows with the number of GPUs. The speedup is also likely model-dependent (the bigger the variance of processing time per sequence length bucket, the greater the speedup).

pzelasko added 3 commits May 20, 2024 13:23

Support syncing dynamic bucket selection across DDP ranks

4b73c3e

fix

db72d9d

Fix dataset tail iteration with sync_buckets=True

d95c06e

pzelasko marked this pull request as ready for review May 22, 2024 21:26

Tests. Fix the sync_buckets support for map-style dataset usage and e…

9993831

…arly bucket depletion cases.

pzelasko merged commit cf6cde8 into master Jun 5, 2024
10 of 11 checks passed

pzelasko deleted the dynamic-bucket-selection-rng-sync branch June 5, 2024 18:58

This was referenced Jun 5, 2024

More similar mean batch duration across nodes with DynamicBucketingSampler in multi-GPU training #1309

Closed

More similar mean batch duration across nodes with DynamicBucketingSampler in multi-GPU training #863

Closed

pzelasko modified the milestone: v1.24.0 Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic bucket selection rng sync #1341

Dynamic bucket selection rng sync #1341

pzelasko commented May 21, 2024

pzelasko commented May 22, 2024

kobenaxie commented May 23, 2024

pzelasko commented May 23, 2024

kobenaxie commented May 23, 2024

pzelasko commented May 23, 2024

kobenaxie commented May 23, 2024

kobenaxie commented May 24, 2024

pzelasko commented May 24, 2024

pzelasko commented May 24, 2024

kobenaxie commented May 27, 2024

pzelasko commented Jun 5, 2024 •

edited

Loading

Dynamic bucket selection rng sync #1341

Dynamic bucket selection rng sync #1341

Conversation

pzelasko commented May 21, 2024

pzelasko commented May 22, 2024

kobenaxie commented May 23, 2024

pzelasko commented May 23, 2024

kobenaxie commented May 23, 2024

pzelasko commented May 23, 2024

kobenaxie commented May 23, 2024

kobenaxie commented May 24, 2024

pzelasko commented May 24, 2024

pzelasko commented May 24, 2024

kobenaxie commented May 27, 2024

pzelasko commented Jun 5, 2024 • edited Loading

pzelasko commented Jun 5, 2024 •

edited

Loading