-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
collect_cuts_in_buckets_ slowness? #678
Comments
... one reason this has more of an effect on speed than you might expect, is that different workers reach this point at different times, so with 8 training processes, you spend 8x as long waiting for this to complete. |
With map-style datasets, the sampler is placed in the same process as the main training loop. I'll think if there's an easy way to offload this to some background thread. The other option is to use |
BTW, if threads become a problem in future, I would have thought it would be possible to simply spread out the work of the sampler across multiple minibatches by wrapping its work in a generator and stepping it on every minibatch. But admittedly this is a little ugly and seems like a hack to do the work of a thread. |
From looking at the output of
py-spy dump --pid=<training-process>
, I notice that sometimes it seems to be spending time in _collect_cuts_in_buckets. I suppose I anticipated that this would perhaps happen in a separate data-loader thread, not in the main thread. Note, this is a run that was started with:python3 ./pruned_transducer_stateless3/train.py --world-size 8 --num-epochs 40 --start-epoch 20 --full-libri 1 --exp-dir pruned_transducer_stateless3/exp --max-duration 300 --use-fp16 1 --lr-epochs 4 --num-workers 5 --giga-prob 0.5
.This doesn't dominate the training-- I get this only maybe 10% or 20% of the time-- but it probably makes an appreciable difference.
It would probably be more ideal to have some more incremental way of doing this.
The text was updated successfully, but these errors were encountered: