Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add partition optimizer status trainer #2951

Closed
wants to merge 7 commits into from
Closed

Conversation

xinyual
Copy link

@xinyual xinyual commented Jun 4, 2021

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

Fixes # (issue).

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

@github-actions
Copy link

github-actions bot commented Jun 4, 2021

Unit Test Results

   438 files  +  12     438 suites  +12   6h 20m 36s ⏱️ + 1h 1m 26s
   716 tests +  14     614 ✔️ +  12     101 💤 +  1  1 +1 
9 537 runs  +364  6 450 ✔️ +320  3 086 💤 +43  1 +1 

For more details on these failures, see this check.

Results for commit 65ae9af. ± Comparison against base commit f0c44d2.

This pull request removes 5 and adds 19 tests. Note that renamed tests count towards both.
test.integration.test_spark_keras.SparkKerasTests ‑ test_batch_generator_fn
test.integration.test_spark_keras.SparkKerasTests ‑ test_convert_custom_sparse_to_dense_bare_keras_fn
test.integration.test_spark_keras.SparkKerasTests ‑ test_prepare_data_bare_keras_fn
test.integration.test_spark_lightning.SparkLightningTests ‑ test_lr_schedualler_callback
test.single.test_ray ‑ test_ray_deprecation
test.integration.test_spark_lightning.SparkLightningTests ‑ test_lr_scheduler_callback
test.integration.test_spark_lightning.SparkLightningTests ‑ test_model_override_trainer_args
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_broadcast_inplace_cpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_broadcast_inplace_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_broadcast_inplace_multiple_cpu
test.parallel.test_torch.TorchTests ‑ test_global_barrier_op
test.parallel.test_torch.TorchTests ‑ test_process_set_barrier_op
test.single.test_ray_elastic_v2 ‑ test_both_num_workers_min_workers
test.single.test_ray_elastic_v2 ‑ test_fault_tolerance_hosts_added_and_removed
test.single.test_ray_elastic_v2 ‑ test_fault_tolerance_hosts_remove_and_add
…

♻️ This comment has been updated with latest results.

@tgaddair
Copy link
Collaborator

tgaddair commented Jun 4, 2021

This is awesome @xinyual! Can you also add a test?

@tgaddair tgaddair requested review from leezu and romerojosh June 4, 2021 20:30
@maxhgerlach
Copy link
Collaborator

This sounds intriguing! I guess a ZeRO-like optimizer would benefit from a ReduceScatter operation in Horovod (#1496)?

@leezu
Copy link
Collaborator

leezu commented Jun 8, 2021

@maxhgerlach yes, though this PR only implements optimizer state sharding and not yet gradient sharding. Do you have any plans to revive #1496?

@maxhgerlach
Copy link
Collaborator

@leezu I was thinking about having a go at #1496 to get Reducescatter, but it would take a while. So if you need it earlier, please go ahead. :)

@tgaddair
Copy link
Collaborator

@xinyual, can you rebase your PR off master? The most recent changes should fix the failing buildkite and docs tests.

Copy link
Collaborator

@romerojosh romerojosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. This looks very cool and would be a great addition to Horovod!

Left a few minor comments.


self._allreduce_grads()
self._update(ignore_stale_grad)
self.broadcast_params()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you've rewritten all of step() here instead of just calling step() of the parent class and adding just the additional call to broadcast_params()?


def step(self, batch_size, ignore_stale_grad=False):
"""
inherit from trainer, only call boardcast to make sure all parameter are consistent
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: boardcast -> broadcast

from mxnet.gluon.parameter import Parameter


class POS_Trainer(mx.gluon.Trainer):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the name of this be more descriptive? Perhaps something like PartitionedParameterTrainer or similar?

Alternatively, do you think this parameter splitting code can be made a feature of the existing hvd.DistributedTrainer instead of creating a completely separate implementation?

self._gradient_predivide_factor = gradient_predivide_factor


def partition_parameters(self, params):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add leading underscore to internal methods like this one. Also, add a brief comment describing what this does.




def broadcast_params(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be renamed to something like _broadcast_partitioned_params to better differentiate it from the existing _broadcast_parameters method. Also, add a brief comment describing what it does.

@stale
Copy link

stale bot commented Aug 31, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Aug 31, 2021
@stale stale bot closed this Sep 8, 2021
@maxhgerlach
Copy link
Collaborator

@xinyual, @leezu: Are you still interested in partitioned, distributed optimizer states? I've just posted a revived PR #3299 that adds a ReduceScatter op, which might be helpful in this context.

@maxhgerlach maxhgerlach reopened this Dec 2, 2021
@stale stale bot removed the wontfix label Dec 2, 2021
@github-actions
Copy link

github-actions bot commented Dec 3, 2021

Unit Test Results (with flaky tests)

     546 files   -   47     546 suites   - 47   8h 19m 25s ⏱️ + 2h 31m 32s
     716 tests +  14     614 ✔️ +12     101 💤 +    2  1 ±0 
12 207 runs   - 479  8 200 ✔️  -   1  4 000 💤  - 484  7 +6 

For more details on these failures, see this check.

Results for commit 65ae9af. ± Comparison against base commit f0c44d2.

This pull request removes 5 and adds 19 tests. Note that renamed tests count towards both.
test.integration.test_spark_keras.SparkKerasTests ‑ test_batch_generator_fn
test.integration.test_spark_keras.SparkKerasTests ‑ test_convert_custom_sparse_to_dense_bare_keras_fn
test.integration.test_spark_keras.SparkKerasTests ‑ test_prepare_data_bare_keras_fn
test.integration.test_spark_lightning.SparkLightningTests ‑ test_lr_schedualler_callback
test.single.test_ray ‑ test_ray_deprecation
test.integration.test_spark_lightning.SparkLightningTests ‑ test_lr_scheduler_callback
test.integration.test_spark_lightning.SparkLightningTests ‑ test_model_override_trainer_args
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_broadcast_inplace_cpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_broadcast_inplace_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_broadcast_inplace_multiple_cpu
test.parallel.test_torch.TorchTests ‑ test_global_barrier_op
test.parallel.test_torch.TorchTests ‑ test_process_set_barrier_op
test.single.test_ray_elastic_v2 ‑ test_both_num_workers_min_workers
test.single.test_ray_elastic_v2 ‑ test_fault_tolerance_hosts_added_and_removed
test.single.test_ray_elastic_v2 ‑ test_fault_tolerance_hosts_remove_and_add
…

@xinyual
Copy link
Author

xinyual commented Dec 10, 2021

@xinyual, @leezu: Are you still interested in partitioned, distributed optimizer states? I've just posted a revived PR #3299 that adds a ReduceScatter op, which might be helpful in this context.

I see that PR but I think it is not suitable. In the paper it mentions reduce-scatter but I think it is in model-level. For parameters, we may need to use reduce/allreduce. I open a new PR [#3309].(#3309) with testcase. Please see that.

@xinyual xinyual mentioned this pull request Dec 10, 2021
4 tasks
@stale
Copy link

stale bot commented Feb 9, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Feb 9, 2022
@stale stale bot closed this Feb 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

None yet

5 participants