New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MXNet: support broadcasting deferred initialization parameters in Gluon #915
Conversation
50154ba
to
0ac89b1
Compare
@apeforest @eric-haibin-lin @ctcyang Please help review. I will update the README and imagenet example after I get some feedback from you. |
Hi @yuxihu - maybe instead of DistributedOptimized and DistributedInitializer Gluon should just have distributedtrainer? That would make it much more sane I think. |
Does gluon trainer have callback mechanism, where after all the variables are initialized we could add the broadcasting step? It would be nice to keep API the same across frameworks. |
@alsrgv The broadcast needs to happen here, after the try-except and before the hybrid_forward call. We do not have callback here. We have callback after the forward call. It seems a little bit late since the first forward pass has already done. And we also need to have logic to make sure that we only broadcast once. The callback will be called after each forward pass. @eric-haibin-lin any opinion on this? |
@ptrendx DistributedOptimizer is well adopted mechanism in Horovod across frameworks. DistributedInitializer is one option we came up to solve the deferred-init params broadcasting issue. I think the current Gluon Trainer is designed to work well with distributed training using the kvstore (PS) approach. How does the DistributedTrainer differ from the Trainer? We are happy to hear more about what you think. @eric-haibin-lin is working on improving distributed training using MXNet. Maybe we can have some discussions around it. |
What about i.finish_deferred_init, could broadcast happen there? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whenever a parameter is initialized/re-initialized, the mx.initializer.Initializer
will be called. This happens in various places, including .finish_deferred_init
.
so a wrapper around the initializer to broadcast any newly initialized parameter sound reasonable.
There are hooks in Gluon before and after block executes forward, but none of them executes a callback right after parameter initialization. On Gluon side it probably does not make sense to add another register hood function for parameter initialization.
I'm also interested to hear @ptrendx 's DistributedTrainer proposal
What happens when you restore model from the checkpoint? Normally, the model is to reload model on rank 0 and broadcast to everyone else. |
Basically the problem is that what constitutes an "Optimizer" in TF and MXNet differs. In TF optimizer computes gradients, and applies them (so it is at a pretty high level with knowledge about all the gradients etc.), whereas Optimizer in MXNet is a low level class that just takes 1 or a few already computed gradients and just applies them. Additional problem with having Horovod as part of the optimizer in MXNet is that the previous distributed training strategy (kvstore) hooks in different places, which makes it confusing and error-prone for users who need to know when does the actual reduction happen (since in case of kvstore gradients are allreduced after calling trainer.allreduce_grads and in case of horovod they are not). |
Sounds like a great fit. Could you provide an MNIST code example that would showcase DistributedTrainer? |
@alsrgv When restoring from a checkpoint in Gluon, I think each worker needs to load parameters from file and there should be no need to broadcast parameters. If we only load on rank 0, rank 0 worker and other workers do not broadcast the same set of parameters due to deferred-inited parameters, where broadcast does not work. @eric-haibin-lin correct me if I missed anything here. |
@ptrendx If we can integrate with Horovod at Trainer level, the internal workflow between kvstore and Horovod will be more or less consistent. But one complication with this proposal is that Module API does not have Trainer. The current integration on optimizer makes it easy to support both Gluon and Module API. Despite that we need to consider Module API support along the way, what you proposed is very interesting to explore such that we can take advantage of the improvement we are making at Trainer level in Horovod (e.g. the AMP feature you are working on). Shall we set up a meeting to discuss furthur? |
This is an idea of what the
The initial parameter broadcast can be addressed through injecting a horovod broadcast into the The mnist script modifications with this are straightforward. One just needs to add the line Edit: The wrapping of |
@romerojosh Thanks for demonstrating the idea of DistributedTrainer. I do see it solves two major issues: (1) broadcasting deferred initialization parameters without adding special callback in MXNet or DistributedInitializer in Horovod; (2) make training with kvstore and Horovod using MXNet Gluon API have the same internal work logic. One concern I have is that the user experience of training with Gluon and Module API will differ. Although we have been encouraging users to adopt Gluon API, we still see users are using Module API for their training jobs. Module API will continue to use DistributedOptimizer and Gluon API should use DistributedTrainer. We need to document the difference well to avoid confusion as much as possible. @alsrgv Does the DistributedTrainer look good to you? Besides the difference between Gluon and Module within MXNet, it will also introduce slightly different user experience compared to Tensorflow and PyTorch. @eric-haibin-lin any thoughts about the DistributedTrainer proposal? |
Can we have code similar to this in
? This would allow us to keep APIs the same. I think it'd be really useful to do that to keep the portability. One exception to portability rule I feel would be ok is to have situation where all aspects (adding optimizer, broadcasts, data partitioning, GPU pinning, etc) are handled by a magical one-line high level API. What do you guys think? |
3c79e42
to
35ff369
Compare
@alsrgv I changed the broadcast_parameters to inject broadcast into init_impl which solves the deferred initialization issue and keep user experience the same. I think @ptrendx and @romerojosh had some valid points regrading the idea of DistributedTrainer. The optimizer in MXNet is not equivalent to the one in TF. It will let us take advantage of the improvements we are making in MXNet. |
@apeforest @eric-haibin-lin please review |
@alsrgv I think having an all-encapsulating API wrapper seems a bit overkill, just to resolve the design inconsistency pointed out by myself and @ptrendx. The main reason for having the If we remove the broadcast handling from the |
I think |
@yuxihu Are you okay with this idea, in particular the very small difference between Gluon and Module this introduces? |
@romerojosh @eric-haibin-lin and I discussed offline about DistributedTrainer yesterday. We both think it makes sense for Gluon users. I will send out another PR for review soon. |
Here is the DistributedTrainer PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test for deferred initialization? otherwise, LGTM
@apeforest @eric-haibin-lin @alsrgv Added unit test. Please help review and merge. |
@yuxihu, could you rebase on latest master with CI fixes? Otherwise, LGTM. |
Signed-off-by: Yuxi Hu <darrenyxhu@gmail.com>
Signed-off-by: Yuxi Hu <darrenyxhu@gmail.com>
Signed-off-by: Yuxi Hu <darrenyxhu@gmail.com>
Thanks for letting me know. Just rebased and triggered the CI. |
@alsrgv Looks like all the tests are queued but not running for two hours. Any issue with Travis? |
@yuxihu, yeah, Travis free edition hasn't been great recently. I'll try to migrate to Buildkite ASAP. |
@apeforest @eric-haibin-lin are you OK to merge this assuming the tests pass? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…on (horovod#915) * Create DistributedInitializer to broadcast deferred-init param Signed-off-by: Yuxi Hu <darrenyxhu@gmail.com> * inject broadcast to init_impl Signed-off-by: Yuxi Hu <darrenyxhu@gmail.com> * add unit test Signed-off-by: Yuxi Hu <darrenyxhu@gmail.com> Signed-off-by: Yana Shchyokotova <yana.shchyokotova@intel.com>
…on (horovod#915) * Create DistributedInitializer to broadcast deferred-init param Signed-off-by: Yuxi Hu <darrenyxhu@gmail.com> * inject broadcast to init_impl Signed-off-by: Yuxi Hu <darrenyxhu@gmail.com> * add unit test Signed-off-by: Yuxi Hu <darrenyxhu@gmail.com> Signed-off-by: Sihan Zeng <zsh@uber.com>
…on (horovod#915) * Create DistributedInitializer to broadcast deferred-init param Signed-off-by: Yuxi Hu <darrenyxhu@gmail.com> * inject broadcast to init_impl Signed-off-by: Yuxi Hu <darrenyxhu@gmail.com> * add unit test Signed-off-by: Yuxi Hu <darrenyxhu@gmail.com>
Fixes #895
In Gluon, we defer initialization for some parameters until their shapes are known during the first forward pass. This makes it difficult for us to sync parameters among workers when we train with different random seeds for workers.
To solve the issue, we inject broadcast into init_impl of deferred initialized Gluon Parameter. Thanks @romerojosh for suggesting this idea.