-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: QR fails on two processes when using CUDA #1317
Comments
Error message:
|
Branch bugs/1317-_Bug_QR_fails_on_one_process_when_using_CUDA created! |
The failure is on two processes. |
Branch bugs/1317-_Bug_QR_fails_on_two_processes_when_using_CUDA created! |
I can reproduce this error on my workstation (CUDA 11.4, PyTorch 2.0.0). Interestingly, the following error arises only for exactly 2 processes:
|
(copied from the PR) There were differents errors arising:
All of these failures happen on 2 processes, split=0 and 1 tiles per proc.
|
(copied from the PR)
|
(copied from the PR)
|
(copied from the PR)
|
we have decided:
|
Regarding the "MPI_ERR_TRUNCATE"-error:
Abstracted example:
yields the same error, but for small shapes no problem appears |
Actually, the fix is replacing
by
in line 901 of QR. |
I had a look into the "wrong results"-error and observed the following: If you compute QR in a constellation for which the "wrong results"-error arises, you can do
However, |
What happened?
Tests for QR fail on two processes for CUDA
Code snippet triggering the error
No response
Error message or erroneous outcome
No response
Version
main (development branch)
Python version
None
PyTorch version
None
MPI version
No response
The text was updated successfully, but these errors were encountered: