Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic bucket selection rng sync #1341

Merged
merged 4 commits into from
Jun 5, 2024
Merged

Conversation

pzelasko
Copy link
Collaborator

Follow up to #863 and #1309

This version seems to work as intended, it consistently picks the same buckets on each DDP rank. It depends on good duration_bins initialization (i.e. it has to be estimated on the actual training data to fit the duration distribution well) and large enough buffer_size, so that all buckets are filled enough to yield at least 1 mini-batch for the most of the time. If it hits a non-ready bucket it tries again with its neighbors.

I'm determining what kind of speedup can be expected from this, also need to add proper tests, and if I find it's good enough, I'll probably make it the default.

@pzelasko
Copy link
Collaborator Author

@lifeiteng care to try this one?

@pzelasko pzelasko marked this pull request as ready for review May 22, 2024 21:26
@kobenaxie
Copy link

I tried this with config below, and training on librispeech spent the same(21 minutes/epoch)

sampler = DynamicBucketingSampler(
         cuts,
         max_duration=200,
         shuffle=True,
         num_buckets=30,
         buffer_size=10000,
         shuffle_buffer_size=10000,
         drop_last=True,
         rank=0,
         world_size=1,
     )

     dataset = IterableDatasetWrapper(dataset=dataset, sampler=sampler)

     dloader = torch.utils.data.DataLoader(
         dataset=dataset,
         batch_size=None,
         persistent_workers=False,
         num_workers=8,
         pin_memory=False,
         worker_init_fn=make_worker_init_fn(
             rank=global_rank,
             world_size=world_size,
         ),
         prefetch_factor=40,
     )

@pzelasko
Copy link
Collaborator Author

Interesting. How many GPUs? Can you also try increasing the buffer size to 50k? Otherwise maybe the batch duration is too low to notice a difference.

I observed a 10% speedup on a 2 GPU setup but need to investigate further.

@kobenaxie
Copy link

8 A100 GPUs.
I will try to increase the buffer size to 50K and report result tomorrow.

@pzelasko
Copy link
Collaborator Author

Aside from that, your max duration seems low for A100. Try adding quadratic_duration=15 to the sampler and you’ll probably be able to increase max duration by 100-200 (but I’d expect you to be able to set it somewhere around at least 500-600 in the first place, are you using bf16/fp16)?

@kobenaxie
Copy link

Aside from that, your max duration seems low for A100. Try adding quadratic_duration=15 to the sampler and you’ll probably be able to increase max duration by 100-200 (but I’d expect you to be able to set it somewhere around at least 500-600 in the first place, are you using bf16/fp16)?

FP32 now

@kobenaxie
Copy link

Aside from that, your max duration seems low for A100. Try adding quadratic_duration=15 to the sampler and you’ll probably be able to increase max duration by 100-200 (but I’d expect you to be able to set it somewhere around at least 500-600 in the first place, are you using bf16/fp16)?

    sampler = DynamicBucketingSampler(
         cuts,
         max_duration=300,
         shuffle=True,
         num_buckets=30,
         buffer_size=50000,
         shuffle_buffer_size=10000,
         quadratic_duration=15,
         drop_last=True,
         rank=0,
         world_size=1,
     )

it cost 31 minutes one epoch with this config.

@pzelasko
Copy link
Collaborator Author

Mmm it seems you are using webdataset or Lhotse shar, so when batch size or buffer size grows, the initialization of dataloader (on first step of iteration) takes longer as it has to read more data into memory. Try precomputing duration bins by running ‘estimate_duration_bins’ on your data and providing the output to samplers duration bins argument. You can then try to revert buffer size to the original setting and compare with and without sync buffers again.

@pzelasko
Copy link
Collaborator Author

Just pushed a version that is better tested and supports both map-style and iterable-style datasets.

@kobenaxie
Copy link

Mmm it seems you are using webdataset or Lhotse shar, so when batch size or buffer size grows, the initialization of dataloader (on first step of iteration) takes longer as it has to read more data into memory. Try precomputing duration bins by running ‘estimate_duration_bins’ on your data and providing the output to samplers duration bins argument. You can then try to revert buffer size to the original setting and compare with and without sync buffers again.

duration bucket_buffer_size sync_buffer mins/epoch estimate_duration_bins
200 10k False 21 False
200 10k True 21 False
200 10k False 21 True
200 10k True 21 True
300 50k Fals 18 False
300 50k True 17 False
300 50k False 16 True
300 50k True 16 True

@pzelasko
Copy link
Collaborator Author

pzelasko commented Jun 5, 2024

I've tested this change more thoroughly and I'm now confident it helps with the training speed. When training NeMo FastConformer RNNT+CTC ASR on a ~20k hours dataset with 16 GPUs I observed a 13% faster training step time when bucket selection is synchronized, everything else being the same. The validation WER of both runs is very close on the same number of steps, but the convergence is actually a bit quicker when we consider validation WER vs total training time.

On a separate experiment with 2 GPUs I observed an 8% speedup. I expect the speedup to grow with the size of distributed training, as the probability of hitting the slowest bucket on each training step grows with the number of GPUs. The speedup is also likely model-dependent (the bigger the variance of processing time per sequence length bucket, the greater the speedup).

image

@pzelasko pzelasko merged commit cf6cde8 into master Jun 5, 2024
10 of 11 checks passed
@pzelasko pzelasko deleted the dynamic-bucket-selection-rng-sync branch June 5, 2024 18:58
@pzelasko pzelasko modified the milestone: v1.24.0 Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants