New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

add partition optimizer status trainer #2951

Closed

xinyual wants to merge 7 commits into horovod:master from xinyual:master

xinyual commented Jun 4, 2021

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

Fixes # (issue).

Review process to land

All tests and other checks must succeed.
At least one member of the technical steering committee must review and approve.
If any member of the technical steering committee requests changes, they must be addressed.


          add pos partition

e965b80

github-actions bot commented Jun 4, 2021 •

edited

Unit Test Results

  438 files +  12   438 suites +12 6h 20m 36s ⏱️ + 1h 1m 26s
  716 tests +  14   614 ✔️ +  12   101 💤 +  1 1 ❌ +1
9 537 runs +364 6 450 ✔️ +320 3 086 💤 +43 1 ❌ +1

For more details on these failures, see this check.

Results for commit 65ae9af. ± Comparison against base commit f0c44d2.

This pull request removes 5 and adds 19 tests. Note that renamed tests count towards both.

test.integration.test_spark_keras.SparkKerasTests ‑ test_batch_generator_fn
test.integration.test_spark_keras.SparkKerasTests ‑ test_convert_custom_sparse_to_dense_bare_keras_fn
test.integration.test_spark_keras.SparkKerasTests ‑ test_prepare_data_bare_keras_fn
test.integration.test_spark_lightning.SparkLightningTests ‑ test_lr_schedualler_callback
test.single.test_ray ‑ test_ray_deprecation

test.integration.test_spark_lightning.SparkLightningTests ‑ test_lr_scheduler_callback
test.integration.test_spark_lightning.SparkLightningTests ‑ test_model_override_trainer_args
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_broadcast_inplace_cpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_broadcast_inplace_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_broadcast_inplace_multiple_cpu
test.parallel.test_torch.TorchTests ‑ test_global_barrier_op
test.parallel.test_torch.TorchTests ‑ test_process_set_barrier_op
test.single.test_ray_elastic_v2 ‑ test_both_num_workers_min_workers
test.single.test_ray_elastic_v2 ‑ test_fault_tolerance_hosts_added_and_removed
test.single.test_ray_elastic_v2 ‑ test_fault_tolerance_hosts_remove_and_add
…

♻️ This comment has been updated with latest results.

Collaborator

tgaddair commented Jun 4, 2021

This is awesome @xinyual! Can you also add a test?

tgaddair requested review from leezu and romerojosh

June 4, 2021 20:30

Collaborator

maxhgerlach commented Jun 7, 2021

This sounds intriguing! I guess a ZeRO-like optimizer would benefit from a ReduceScatter operation in Horovod (#1496)?


          Update README.md

0ccbce9

Collaborator

leezu commented Jun 8, 2021

@maxhgerlach yes, though this PR only implements optimizer state sharding and not yet gradient sharding. Do you have any plans to revive #1496?

Collaborator

maxhgerlach commented Jun 9, 2021

@leezu I was thinking about having a go at #1496 to get Reducescatter, but it would take a while. So if you need it earlier, please go ahead. :)

Collaborator

tgaddair commented Jun 10, 2021

@xinyual, can you rebase your PR off master? The most recent changes should fix the failing buildkite and docs tests.

romerojosh reviewed

View reviewed changes

Collaborator

romerojosh left a comment

Thanks for the PR. This looks very cool and would be a great addition to Horovod!

Left a few minor comments.

horovod/mxnet/partition_trainer/pos_trainer.py Outdated

+                      self._allreduce_grads()
+                      self._update(ignore_stale_grad)
+                      self.broadcast_params()

Collaborator

romerojosh Jun 7, 2021

Is there a reason you've rewritten all of step() here instead of just calling step() of the parent class and adding just the additional call to broadcast_params()?

horovod/mxnet/partition_trainer/pos_trainer.py

+                  def step(self, batch_size, ignore_stale_grad=False):
+                      """
+                      inherit from trainer, only call boardcast to make sure all parameter are consistent

Collaborator

romerojosh Jun 11, 2021

Typo: boardcast -> broadcast

horovod/mxnet/partition_trainer/pos_trainer.py

		from mxnet.gluon.parameter import Parameter


		class POS_Trainer(mx.gluon.Trainer):

Collaborator

romerojosh Jun 11, 2021

Can the name of this be more descriptive? Perhaps something like PartitionedParameterTrainer or similar?

Alternatively, do you think this parameter splitting code can be made a feature of the existing hvd.DistributedTrainer instead of creating a completely separate implementation?

horovod/mxnet/partition_trainer/pos_trainer.py Outdated

		self._gradient_predivide_factor = gradient_predivide_factor


		def partition_parameters(self, params):

Collaborator

romerojosh Jun 11, 2021

Add leading underscore to internal methods like this one. Also, add a brief comment describing what this does.

romerojosh reviewed

View reviewed changes

horovod/mxnet/partition_trainer/pos_trainer.py Outdated




		def broadcast_params(self):

Collaborator

romerojosh Jun 11, 2021

This should also be renamed to something like _broadcast_partitioned_params to better differentiate it from the existing _broadcast_parameters method. Also, add a brief comment describing what it does.

Ubuntu added 4 commits

July 2, 2021 10:53


          simplify code

9ce7bb7


          Merge branch 'master' of https://github.com/xinyual/horovod

9ad9aff


          rebase on master

56ed935


          Merge remote-tracking branch 'url/master'

8a299d0

stale bot commented Aug 31, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the wontfix label

stale bot closed this


          merge from master

65ae9af

Collaborator

maxhgerlach commented Dec 2, 2021

@xinyual, @leezu: Are you still interested in partitioned, distributed optimizer states? I've just posted a revived PR #3299 that adds a ReduceScatter op, which might be helpful in this context.

maxhgerlach reopened this

stale bot removed the wontfix label

github-actions bot commented Dec 3, 2021

Unit Test Results (with flaky tests)

    546 files -   47   546 suites - 47 8h 19m 25s ⏱️ + 2h 31m 32s
    716 tests +  14   614 ✔️ +12   101 💤 +    2 1 ❌ ±0
12 207 runs - 479 8 200 ✔️ -   1 4 000 💤 - 484 7 ❌ +6

For more details on these failures, see this check.

Results for commit 65ae9af. ± Comparison against base commit f0c44d2.

This pull request removes 5 and adds 19 tests. Note that renamed tests count towards both.

test.integration.test_spark_keras.SparkKerasTests ‑ test_batch_generator_fn
test.integration.test_spark_keras.SparkKerasTests ‑ test_convert_custom_sparse_to_dense_bare_keras_fn
test.integration.test_spark_keras.SparkKerasTests ‑ test_prepare_data_bare_keras_fn
test.integration.test_spark_lightning.SparkLightningTests ‑ test_lr_schedualler_callback
test.single.test_ray ‑ test_ray_deprecation

test.integration.test_spark_lightning.SparkLightningTests ‑ test_lr_scheduler_callback
test.integration.test_spark_lightning.SparkLightningTests ‑ test_model_override_trainer_args
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_broadcast_inplace_cpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_broadcast_inplace_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_broadcast_inplace_multiple_cpu
test.parallel.test_torch.TorchTests ‑ test_global_barrier_op
test.parallel.test_torch.TorchTests ‑ test_process_set_barrier_op
test.single.test_ray_elastic_v2 ‑ test_both_num_workers_min_workers
test.single.test_ray_elastic_v2 ‑ test_fault_tolerance_hosts_added_and_removed
test.single.test_ray_elastic_v2 ‑ test_fault_tolerance_hosts_remove_and_add
…

Author

xinyual commented Dec 10, 2021

@xinyual, @leezu: Are you still interested in partitioned, distributed optimizer states? I've just posted a revived PR #3299 that adds a ReduceScatter op, which might be helpful in this context.

I see that PR but I think it is not suitable. In the paper it mentions reduce-scatter but I think it is in model-level. For parameters, we may need to use reduce/allreduce. I open a new PR [#3309].(#3309) with testcase. Please see that.

xinyual mentioned this pull request

add pos trainer #3309

Closed

4 tasks

stale bot commented Feb 9, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the wontfix label

stale bot closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment