Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features/584 data parallel -- DASO #728

Merged
merged 195 commits into from Mar 24, 2021
Merged

Conversation

coquelin77
Copy link
Member

@coquelin77 coquelin77 commented Mar 1, 2021

Description

Implementation of the Distributed Asynchronous and Selective Optimization (DASO) method.
Other changes include a second Image-Net example using DALI on TFRecords with DASO, MPI sum functions for torch.half and torch.bfloat16, as well as a class to determine if a metric is plateauing.

Issue/s resolved: N/A

Changes proposed:

  • DASO implementation
  • MPI sum functions for torch.half and torch.bfloat16 dtypes
  • nn.DataParallelMultiGPU: utilize torch.distributed on a node-local level for the forward and backward steps of a NN
  • optim.DetectMetricPlateau can be used to detect when a metric has plateaued. this is often used during training of the optimizers

Type of change

  • New feature (non-breaking change which adds functionality)

Due Diligence

  • All split configurations tested
  • Multiple dtypes tested in relevant functions
  • Documentation updated (if needed)
  • Updated changelog.md under the title "Pending Additions"

Does this change modify the behaviour of other functions? If so, which?

no

@coquelin77
Copy link
Member Author

rerun tests

@coquelin77 coquelin77 changed the title Features/584 data parallel apex Features/584 data parallel -- DASO Mar 2, 2021
examples/nn/mnist.py Outdated Show resolved Hide resolved
heat/core/communication.py Outdated Show resolved Hide resolved
heat/core/communication.py Outdated Show resolved Hide resolved
heat/core/communication.py Outdated Show resolved Hide resolved
heat/nn/data_parallel.py Show resolved Hide resolved
heat/optim/dp_optimizer.py Outdated Show resolved Hide resolved
heat/optim/dp_optimizer.py Outdated Show resolved Hide resolved
heat/optim/dp_optimizer.py Show resolved Hide resolved
heat/optim/dp_optimizer.py Show resolved Hide resolved
heat/optim/dp_optimizer.py Outdated Show resolved Hide resolved
@coquelin77 coquelin77 merged commit ddec44f into master Mar 24, 2021
@coquelin77 coquelin77 deleted the features/584-DataParallel-apex branch March 24, 2021 14:15
@coquelin77 coquelin77 restored the features/584-DataParallel-apex branch May 28, 2021 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants