Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix data type in PartialDistributedGradientTape #3985

Merged
merged 3 commits into from
Sep 18, 2023

Conversation

supercharleszhu
Copy link
Contributor

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

In horovod elastic mode we are seeing

[6]<stderr>:        gradients = tape.gradient(loss_value, self._model.trainable_weights)
[6]<stderr>:    File "/home/jobuser/.local/lib/python3.10/site-packages/horovod/tensorflow/__init__.py", line 1089, in gradient  *
[6]<stderr>:        grad = tf.IndexedSlices(grad.values / horovod_size, grad.indices, grad.dense_shape)
[6]<stderr>:
[6]<stderr>:    TypeError: `x` and `y` must have the same dtype, got tf.float32 != tf.int32.

which is a type mismatch from size_op and the gradient values. This PR fixes this issue.

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

Signed-off-by: Chen Zhu <chzhu@linkedin.com>
Signed-off-by: Chen Zhu <chzhu@linkedin.com>
@MrAta
Copy link
Contributor

MrAta commented Sep 13, 2023

Thanks for the fix @supercharleszhu, seems it's for the indexedslice branch.

@@ -182,7 +182,7 @@ def __filtered_reduce_grads(grads, vars):
rg.append(grad)

rg = self._allreduce_grads(rg, rv)
horovod_size = size_op(process_set_id=self.process_set.process_set_id) if int(os.environ.get("HOROVOD_ELASTIC", 0)) else self.process_set.size()
horovod_size = tf.cast(size_op(process_set_id=self.process_set.process_set_id), tf.float32) if int(os.environ.get("HOROVOD_ELASTIC", 0)) else self.process_set.size()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a size of type float is a bit surprising, given the else branch is still an int. I'd rather have float32(horovod_size) wherever division by horovod_size occurs, or introduce a variable horovod_size_float = float32(horovod_size) and use that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EnricoMi Agreed. I just updated to float(horovod_size) wherever division by horovod_size occurs. Thanks!

Signed-off-by: Chen Zhu <chzhu@linkedin.com>
Copy link
Collaborator

@EnricoMi EnricoMi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@github-actions
Copy link

github-actions bot commented Sep 15, 2023

Unit Test Results

     971 files  +   403       971 suites  +403   12h 11m 7s ⏱️ + 3h 12m 2s
     887 tests ±       0       771 ✔️ +     80     116 💤  -      79  0  - 1 
21 647 runs  +8 909  15 323 ✔️ +6 683  6 324 💤 +2 227  0  - 1 

Results for commit 22ff283. ± Comparison against base commit ca21d8d.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Sep 15, 2023

Unit Test Results (with flaky tests)

  1 117 files  +     487    1 117 suites  +487   12h 56m 16s ⏱️ + 3h 30m 21s
     887 tests ±         0       771 ✔️ +     80     116 💤  -      79  0  - 1 
25 255 runs  +11 445  17 399 ✔️ +8 073  7 856 💤 +3 375  0  - 3 

Results for commit 22ff283. ± Comparison against base commit ca21d8d.

♻️ This comment has been updated with latest results.

@maxhgerlach
Copy link
Collaborator

[re-triggered the pipeline: Build and Test macOS (test-cpu-openmpi-py3_7-tf1_15_5-keras2_2_4-torch1_6_0-mxnet1_5_1_p0-macos)]

@maxhgerlach maxhgerlach merged commit 7e4d993 into horovod:master Sep 18, 2023
41 of 45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants