New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[trainer] new in pytorch: torch.optim._multi_tensor
faster optimizers
#9965
Comments
I did a quick benchmark, with Testing HF
The change I did was:
and this is from pytorch-nightly from today. |
you must have a really strange bottleneck in that test, neither the latest fairscale nor these are changing anything ? These optimizers are measurably faster in isolation, and sure enough we see a difference in fairscale CI, even on a dummy job / small model (see for instance, two last jobs) |
testing with the same command, I see a vastly varying throughput depending on |
To share with others, @blefaudeux and his team made speed improvements in fairscale (master) recently, which should have been quite visible, but a few days ago we tested this same script with |
I will leave this issue open for now as an incentive to profile this script and identify the bottleneck. |
@stas00 Do you think this should be revisited given the discussion in upstream PyTorch? |
Yes, I was just about to revisit it. edit: I thought you might have wanted to work on that, but the pytorch team asks to run a profiler on it and all, so I probably will look into testing it out again. --- original comment --- Do you want to take a lead on this experiment, @jaketae? The new You can use this tool for benchmarking #14934 if it helps. I think it's pretty stable now, I will propose to PR it. |
Back in September pytorch introduced
torch.optim._multi_tensor
pytorch/pytorch#43507 which should be much more efficient for situations with lots of small feature tensors (transformers
) and thus should show an appreciable speed up in training. If someone is interested in the progress of this project here is the stack to track: pytorch/pytorch#48223This feature is currently an alpha stage, so users can try to use it by simply replacing
torch.optim
withtorch.optim._multi_tensor
in HF Trainer or their own trainer.Eventually it'll replace
torch.optim
so there is nothing that we need to do otherwise.@blefaudeux who alerted me to this improvement suggested it should have good speed ups for the DDP/Sharded DDP training.
If resources allow it'd be good to run some benchmarks. Please feel free to beat me to it.
Thanks to @blefaudeux for the heads up, and @izdeby for working on this enhancement and clarifying where things are at.
heads up to: @sgugger, @patrickvonplaten - nothing else that needs to be done.
The text was updated successfully, but these errors were encountered: