Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ht.print can now print arrays distributed over n>1 GPUs #1170

Merged
merged 16 commits into from
Jul 24, 2023

Conversation

mrfh92
Copy link
Collaborator

@mrfh92 mrfh92 commented Jun 23, 2023

is intended to resolve #1121

alternative solution has been implemented in #1179

@github-actions
Copy link
Contributor

Thank you for the PR!

@mrfh92
Copy link
Collaborator Author

mrfh92 commented Jun 23, 2023

Main problem is in printing.py lines 269ff.: gathering tensors from several GPU-devices via gather results in a list of tensors that are stil on different devices and cannot be concatenated therefore; see #1171 for the details.

@codecov
Copy link

codecov bot commented Jun 23, 2023

Codecov Report

Merging #1170 (7b4e466) into main (2b2ed0f) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1170   +/-   ##
=======================================
  Coverage   92.16%   92.16%           
=======================================
  Files          75       75           
  Lines       10703    10705    +2     
=======================================
+ Hits         9864     9866    +2     
  Misses        839      839           
Flag Coverage Δ
unit 92.16% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
heat/core/printing.py 97.56% <100.00%> (+0.06%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@ghost
Copy link

ghost commented Jun 23, 2023

👇 Click on the image for a new way to code review

Review these changes using an interactive CodeSee Map

Legend

CodeSee Map legend

@mrfh92
Copy link
Collaborator Author

mrfh92 commented Jun 26, 2023

Since the problem mentioned in #1171 seems to be due to CUDA-aware MPI, we decided in the PR talk to resolve this locally, i.e. in the printing functionality, instead of changing gather.

@mrfh92 mrfh92 marked this pull request as ready for review June 26, 2023 09:36
@github-actions
Copy link
Contributor

Thank you for the PR!

@github-actions
Copy link
Contributor

Thank you for the PR!

@github-actions
Copy link
Contributor

Thank you for the PR!

@github-actions
Copy link
Contributor

Thank you for the PR!

@github-actions
Copy link
Contributor

Thank you for the PR!

@github-actions
Copy link
Contributor

Thank you for the PR!

@mrfh92 mrfh92 requested a review from mtar June 26, 2023 13:26
@github-actions
Copy link
Contributor

Thank you for the PR!

@github-actions
Copy link
Contributor

Thank you for the PR!

1 similar comment
@github-actions
Copy link
Contributor

Thank you for the PR!

@github-actions
Copy link
Contributor

Thank you for the PR!

@mrfh92
Copy link
Collaborator Author

mrfh92 commented Jul 3, 2023

or shall we use comm.Gather instead? (just a differene w.r.t. code aesthetics)

@mrfh92 mrfh92 marked this pull request as draft July 3, 2023 08:07
@mrfh92
Copy link
Collaborator Author

mrfh92 commented Jul 3, 2023

Tried usage of Gather in separate branch, but encountered problem #1174

@mrfh92 mrfh92 changed the title Bug/1121 print fails on gpu Bug/1121 print fails on gpu (solution with gather) Jul 10, 2023
@github-actions
Copy link
Contributor

Thank you for the PR!

@mrfh92
Copy link
Collaborator Author

mrfh92 commented Jul 10, 2023

Should be reviewed together with #1179

@mrfh92 mrfh92 marked this pull request as ready for review July 10, 2023 14:52
@mrfh92
Copy link
Collaborator Author

mrfh92 commented Jul 12, 2023

#1179 will most likely only work if some refactoring of communication wrappers (#341) has been done first... therefore, for resolving #1121, this PR is currently the only possible solution

@github-actions
Copy link
Contributor

Thank you for the PR!

@ClaudiaComito ClaudiaComito added the bug Something isn't working label Jul 24, 2023
@ClaudiaComito ClaudiaComito changed the title Bug/1121 print fails on gpu (solution with gather) ht.print can now print arrays distributed over n>1 GPUs Jul 24, 2023
Copy link
Contributor

@ClaudiaComito ClaudiaComito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @mrfh92

@mrfh92
Copy link
Collaborator Author

mrfh92 commented Jul 24, 2023

ok

----------------------------------------------------------------------
Ran 21 tests in 0.305s

OK

Testing on HDFML with 2 nodes (8 GPUs) runs through

@mrfh92 mrfh92 merged commit 009a91a into main Jul 24, 2023
46 checks passed
@mtar mtar deleted the bug/1121-print-fails-on-gpu branch February 28, 2024 08:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: heat.print fails if communication over GPUs is required
3 participants