[Bug]: QR fails on two processes when using CUDA #1317

mrfh92 · 2024-01-08T09:40:57Z

What happened?

Tests for QR fail on two processes for CUDA

Code snippet triggering the error

No response

Error message or erroneous outcome

No response

Version

main (development branch)

Python version

None

PyTorch version

None

MPI version

No response

mrfh92 · 2024-01-08T09:42:20Z

Error message:

>               self.assertTrue(ht.allclose(a_comp2, qr2.Q @ qr2.R, rtol=1e-5, atol=1e-5))
E               AssertionError: False is not true
heat/core/linalg/tests/test_qr.py:77: AssertionError

github-actions · 2024-01-08T09:43:55Z

Branch bugs/1317-_Bug_QR_fails_on_one_process_when_using_CUDA created!

mtar · 2024-01-09T09:48:05Z

The failure is on two processes.

github-actions · 2024-01-10T08:33:16Z

Branch bugs/1317-_Bug_QR_fails_on_two_processes_when_using_CUDA created!

mrfh92 · 2024-01-10T08:38:18Z

I can reproduce this error on my workstation (CUDA 11.4, PyTorch 2.0.0). Interestingly, the following error arises only for exactly 2 processes:

FAIL: test_qr (heat.core.linalg.tests.test_qr.TestQR)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/hopp_fa/heat/heat/core/linalg/tests/test_qr.py", line 67, in test_qr
    self.assertTrue(ht.allclose((a_comp1 - (qr1.Q @ qr1.R)), 0, rtol=1e-5, atol=1e-5))
AssertionError: False is not true

----------------------------------------------------------------------

mrfh92 · 2024-01-16T14:28:35Z

(copied from the PR)
@ClaudiaComito @mtar Im currently doing a random search (1000 samples) on my workstation with 1-8 processes, (20-40) x (20-40) matrices (float32 on GPU), split=0,1, and tiles_per_proc=1,2 in order to detect possible further problematic configurations.
tested_configs.txt

There were differents errors arising:

13 failures in the sense that the programm terminated, but with wrong results

75: 32 x 34 matrix on 2 procs, split = 0 and 1 tiles per proc.
109: 36 x 30 matrix on 2 procs, split = 0 and 1 tiles per proc.
171: 37 x 29 matrix on 2 procs, split = 0 and 1 tiles per proc.
226: 37 x 28 matrix on 2 procs, split = 0 and 1 tiles per proc.
378: 35 x 35 matrix on 2 procs, split = 0 and 1 tiles per proc.
487: 40 x 40 matrix on 2 procs, split = 0 and 1 tiles per proc.
641: 36 x 24 matrix on 2 procs, split = 0 and 1 tiles per proc.
686: 34 x 36 matrix on 2 procs, split = 0 and 1 tiles per proc.
706: 32 x 38 matrix on 2 procs, split = 0 and 1 tiles per proc.
761: 39 x 22 matrix on 2 procs, split = 0 and 1 tiles per proc.
771: 39 x 39 matrix on 2 procs, split = 0 and 1 tiles per proc.
936: 34 x 40 matrix on 2 procs, split = 0 and 1 tiles per proc.
982: 32 x 30 matrix on 2 procs, split = 0 and 1 tiles per proc.

All of these failures happen on 2 processes, split=0 and 1 tiles per proc.

4 additional failures in the sense that the programm terminated with an error:

Test 201 of 2000: 20 x 24 matrix on 6 procs, split = 0 and 1 tiles per proc.

No protocol specified
Traceback (most recent call last):
  File "qrtests.py", line 20, in <module>
    Q,R = ht.linalg.qr(A,tiles_per_proc=tiles_per_proc)
  File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 146, in qr
    __split0_r_calc(
  File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 354, in __split0_r_calc
    q1, r1 = torch.linalg.qr(base_tile, mode="complete")
TypeError: linalg_qr(): argument 'A' (position 1) must be Tensor, not NoneType
Traceback (most recent call last):
  File "qrtests.py", line 20, in <module>
    Q,R = ht.linalg.qr(A,tiles_per_proc=tiles_per_proc)
  File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 146, in qr
    __split0_r_calc(
  File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 392, in __split0_r_calc
    __split0_merge_tile_rows(
  File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 524, in __split0_merge_tile_rows
    lower_inds = r_tiles.get_start_stop(key=(lower_row, column))
  File "/home/hopp_fa/heat/heat/core/tiling.py", line 853, in get_start_stop
    pr = self.tile_map[key][..., 2].unique()
IndexError: index 4 is out of bounds for dimension 0 with size 4
Traceback (most recent call last):
  File "qrtests.py", line 20, in <module>
    Q,R = ht.linalg.qr(A,tiles_per_proc=tiles_per_proc)
  File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 146, in qr
    __split0_r_calc(
  File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 453, in __split0_r_calc
    __split0_merge_tile_rows(
  File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 524, in __split0_merge_tile_rows
    lower_inds = r_tiles.get_start_stop(key=(lower_row, column))
  File "/home/hopp_fa/heat/heat/core/tiling.py", line 853, in get_start_stop
    pr = self.tile_map[key][..., 2].unique()
IndexError: index 4 is out of bounds for dimension 0 with size 4

Test 337 of 2000: 40 x 30 matrix on 2 procs, split = 1 and 1 tiles per proc.

No protocol specified
Traceback (most recent call last):
  File "qrtests.py", line 20, in <module>
    Q,R = ht.linalg.qr(A,tiles_per_proc=tiles_per_proc)
  File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 170, in qr
    __split1_qr_loop(dcol=dcol, r_tiles=r_tiles, q0_tiles=q_tiles, calc_q=calc_q)
  File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 929, in __split1_qr_loop
    r_tiles.arr.comm.Bcast(q1, root=diag_process)
  File "/home/hopp_fa/heat/heat/core/communication.py", line 746, in Bcast
    ret, sbuf, rbuf, buf = self.__broadcast_like(self.handle.Bcast, buf, root)
  File "/home/hopp_fa/heat/heat/core/communication.py", line 733, in __broadcast_like
    return func(self.as_buffer(srbuf), root), srbuf, srbuf, buf
  File "mpi4py/MPI/Comm.pyx", line 691, in mpi4py.MPI.Comm.Bcast
mpi4py.MPI.Exception: MPI_ERR_TRUNCATE: message truncated

Test 610 of 2000: 40 x 31 matrix on 2 procs, split = 1 and 1 tiles per proc.

No protocol specified
Traceback (most recent call last):
  File "qrtests.py", line 20, in <module>
    Q,R = ht.linalg.qr(A,tiles_per_proc=tiles_per_proc)
  File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 170, in qr
    __split1_qr_loop(dcol=dcol, r_tiles=r_tiles, q0_tiles=q_tiles, calc_q=calc_q)
  File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 929, in __split1_qr_loop
    r_tiles.arr.comm.Bcast(q1, root=diag_process)
  File "/home/hopp_fa/heat/heat/core/communication.py", line 746, in Bcast
    ret, sbuf, rbuf, buf = self.__broadcast_like(self.handle.Bcast, buf, root)
  File "/home/hopp_fa/heat/heat/core/communication.py", line 733, in __broadcast_like
    return func(self.as_buffer(srbuf), root), srbuf, srbuf, buf
  File "mpi4py/MPI/Comm.pyx", line 691, in mpi4py.MPI.Comm.Bcast
mpi4py.MPI.Exception: MPI_ERR_TRUNCATE: message truncated

Test 972 of 2000: 40 x 30 matrix on 2 procs, split = 1 and 1 tiles per proc.

No protocol specified
Traceback (most recent call last):
  File "qrtests.py", line 20, in <module>
    Q,R = ht.linalg.qr(A,tiles_per_proc=tiles_per_proc)
  File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 170, in qr
    __split1_qr_loop(dcol=dcol, r_tiles=r_tiles, q0_tiles=q_tiles, calc_q=calc_q)
  File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 929, in __split1_qr_loop
    r_tiles.arr.comm.Bcast(q1, root=diag_process)
  File "/home/hopp_fa/heat/heat/core/communication.py", line 746, in Bcast
    ret, sbuf, rbuf, buf = self.__broadcast_like(self.handle.Bcast, buf, root)
  File "/home/hopp_fa/heat/heat/core/communication.py", line 733, in __broadcast_like
    return func(self.as_buffer(srbuf), root), srbuf, srbuf, buf
  File "mpi4py/MPI/Comm.pyx", line 691, in mpi4py.MPI.Comm.Bcast
mpi4py.MPI.Exception: MPI_ERR_TRUNCATE: message truncated

mrfh92 · 2024-01-16T14:28:47Z

(copied from the PR)
I have repeated this test with 500 samples, for 1-8 processes but only split=0 and tiles per proc = 1.
Again, there are two kinds of errors:

wrong result, but programm completes

program does not complete due to an error:

Test 128 of 500: 29 x 33 matrix on 7 procs, split = 0 and 1 tiles per proc.

No protocol specified
Traceback (most recent call last):
  File "qrtests.py", line 20, in <module>
    Q,R = ht.linalg.qr(A,tiles_per_proc=tiles_per_proc)
  File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 146, in qr
    __split0_r_calc(
  File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 354, in __split0_r_calc
    q1, r1 = torch.linalg.qr(base_tile, mode="complete")
TypeError: linalg_qr(): argument 'A' (position 1) must be Tensor, not NoneType
Traceback (most recent call last):
  File "qrtests.py", line 20, in <module>
    Q,R = ht.linalg.qr(A,tiles_per_proc=tiles_per_proc)
  File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 146, in qr
    __split0_r_calc(
  File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 392, in __split0_r_calc
    __split0_merge_tile_rows(
  File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 524, in __split0_merge_tile_rows
    lower_inds = r_tiles.get_start_stop(key=(lower_row, column))
  File "/home/hopp_fa/heat/heat/core/tiling.py", line 853, in get_start_stop
    pr = self.tile_map[key][..., 2].unique()
IndexError: index 5 is out of bounds for dimension 0 with size 5

mrfh92 · 2024-01-16T14:29:01Z

(copied from the PR)
Additional errors:

wrong result, but programm runs through:

44 x 54 matrix on 2 procs, split = 0 and 1 tiles per proc.
52 x 32 matrix on 3 procs, split = 0 and 1 tiles per proc.

programm aborts withe error:

47 x 43 matrix on 2 procs, split = 1 and 1 tiles per proc.
38 x 31 matrix on 12 procs, split = 0 and 1 tiles per proc.

mrfh92 · 2024-01-16T14:29:13Z

(copied from the PR)
Scripts used for finding the errors

import heat as ht 
import os
import argparse

parser = argparse.ArgumentParser(description="Test the QR implementation in HEAT")

# Add arguments
parser.add_argument("-m", type=int, help="number of rows for test matrix", required=True)
parser.add_argument("-n", type=int, help="number of columns for test matrix", required=True)
parser.add_argument("-split", type=int, help="split dimension for test matrix", required=True)
parser.add_argument("-tiles", type=int, help="tiles per process in the QR decomposition", required=True)

# Parse the arguments
args = parser.parse_args()


m = args.m
n = args.n
split = args.split
tiles_per_proc = args.tiles

dtype = ht.float32
device="gpu"

A = ht.random.randn(m,n,split=split,dtype=dtype,device=device)

if A.comm.rank == 0:
    normalfile = open('tested_configs.txt', 'a')
    print(f'{m} x {n} matrix on {ht.MPI_WORLD.size} procs, split = {split} and {tiles_per_proc} tiles per proc.',file=normalfile)
    normalfile.close()

Q,R = ht.linalg.qr(A,tiles_per_proc=tiles_per_proc)

noerror = ht.allclose((A.resplit(Q.split) - (Q @ R)), 0, rtol=1e-5, atol=1e-5) and ht.allclose(Q.T @ Q, ht.eye(m,device=device,split=Q.T.split), rtol=1e-5, atol=1e-5) and ht.allclose(ht.eye(m,device=device,split=Q.split), Q @ Q.T, rtol=1e-5, atol=1e-5)

if A.comm.rank == 0:
    if not noerror: 
        errorfile = open('failures_config.txt', 'a')
        print(f'{m} x {n} matrix on {ht.MPI_WORLD.size} procs, split = {split} and {tiles_per_proc} tiles per proc.',file=errorfile)
        errorfile.close()

import ray
import subprocess
import numpy as np

# Initialize Ray
ray.init()

# Define a remote function
@ray.remote(num_gpus=0.125)
def run_script(params):
    # Call your Python script with MPI and other parameters
    n_procs = params[0]
    n = params[1]
    m = params[2]
    split = params[3]
    tiles = params[4]
    subprocess.run(f"mpirun -n {n_procs} python qrtests.py -n {n} -m {m} -split {split} -tiles {tiles}", shell=True)

# Number of runs
num_runs = 5000
parameter_space = [[np.random.randint(1,9), np.random.randint(20,41), np.random.randint(20,41), np.random.randint(0,2), np.random.randint(1,3)] for _ in range(num_runs)] 

# Generate a random number of processes for each run and set parameters
futures = [run_script.remote(params) for params in parameter_space]

# Execute the script
# If your script returns results, you can gather them
results = ray.get(futures)

mrfh92 · 2024-01-17T11:25:15Z

we have decided:

@ClaudiaComito will look into the "wrong results"-error
@mrfh92 will look into the "MPI_ERR_TRUNCATE"-error
@mtar will look into the "indexing"-error

mrfh92 · 2024-01-17T11:55:50Z

Regarding the "MPI_ERR_TRUNCATE"-error:

this does not depend on whether CPU or GPU is used
it also does not depend on the number of processes
it seems to be related to buffers for Bcast being contiguous/not contiguous

Abstracted example:

import heat as ht 
import torch
from time import sleep

A = ht.random.randn(50,50,dtype=ht.float32,device="gpu",split=1)

shape = (25,25)
root = 0

if A.comm.rank == root:
    x = torch.ones(shape, dtype=A.dtype.torch_type(), device=A.larray.device)
    print(A.comm.rank, x.shape, x.dtype, x.device, x.is_contiguous())
    A.comm.Bcast(x.clone(), root=root)
elif A.comm.rank > root:
    x = torch.zeros(shape, dtype=A.dtype.torch_type(), device=A.larray.device)
    print(A.comm.rank, x.shape, x.dtype, x.device, x.is_contiguous())
    A.comm.Bcast(x, root=root)
else:
    x = torch.zeros(shape, dtype=A.dtype.torch_type(), device=A.larray.device)
    print(A.comm.rank, x.shape, x.dtype, x.device, x.is_contiguous())
    A.comm.Bcast(x, root=root)
    sleep(1)

print(A.comm.rank, (x == torch.ones(shape, device=A.larray.device)).all())

yields the same error, but for small shapes no problem appears

mrfh92 · 2024-01-17T12:16:22Z

Actually, the fix is replacing

r_tiles.arr.comm.Bcast(q1.clone(), root=diag_process)

by

r_tiles.arr.comm.Bcast(q1.clone(memory_format=torch.contiguous_format), root=diag_process)

in line 901 of QR.
The fix is in PR #1325

mrfh92 · 2024-01-18T15:51:53Z

I had a look into the "wrong results"-error and observed the following:

If you compute QR in a constellation for which the "wrong results"-error arises, you can do print(R) without problem afterwards, but print(Q) fails with the following error:

Traceback (most recent call last):
  File "../PythonFiles/qrtests.py", line 16, in <module>
    print(Q)
  File "/home/hopp_fa/heat/heat/core/dndarray.py", line 1808, in __str__
    return printing.__str__(self)
  File "/home/hopp_fa/heat/heat/core/printing.py", line 198, in __str__
    tensor_string = _tensor_str(dndarray, __INDENT + 1)
  File "/home/hopp_fa/heat/heat/core/printing.py", line 285, in _tensor_str
    torch_data = _torch_data(dndarray, summarize)
  File "/home/hopp_fa/heat/heat/core/printing.py", line 252, in _torch_data
    data = torch.index_select(data, i, torch.arange(end))
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

However, print(Q.larray.device) yields "cuda:0" on both processes (and there is only one GPU because Im on my workstation...); morevoer, it seems to be the case that (Q @ R - A).larray is of order eps on rank 0, but not on rank 1.

mrfh92 · 2024-01-22T12:01:24Z

This problem has been fixed in #1328 and #1325, respectively, for the moment
More extensive refactoring of QR code will be done within #1237

mrfh92 added the bug Something isn't working label Jan 8, 2024

mrfh92 added the linalg label Jan 8, 2024

ClaudiaComito self-assigned this Jan 8, 2024

ClaudiaComito mentioned this issue Jan 8, 2024

Fix two-process QR tests on CUDA runner #1318

Closed

5 tasks

mtar changed the title ~~[Bug]: QR fails on one process when using CUDA~~ [Bug]: QR fails on two processes when using CUDA Jan 9, 2024

mtar linked a pull request Jan 9, 2024 that will close this issue

Fix two-process QR tests on CUDA runner #1318

Closed

5 tasks

mrfh92 self-assigned this Jan 10, 2024

mrfh92 added the ❗ High priority, urgent label Jan 11, 2024

This was referenced Jan 17, 2024

[Bug]: MPI_ERR_TRUNCATE in QR for split=1 #1324

Closed

Fix for MPI_ERR_TRUNCATE in QR with split=1, deprecate tiles_per_proc=1 for split=0 #1325

Merged

mrfh92 closed this as completed Jan 22, 2024

mrfh92 mentioned this issue Jan 22, 2024

Refactoring of QR: try out pure stabilized Gram-Schmidt (split=1) and TS-QR (split=0) #1237

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: QR fails on two processes when using CUDA #1317

[Bug]: QR fails on two processes when using CUDA #1317

mrfh92 commented Jan 8, 2024 •

edited by mtar

Loading

mrfh92 commented Jan 8, 2024

github-actions bot commented Jan 8, 2024

mtar commented Jan 9, 2024 •

edited

Loading

github-actions bot commented Jan 10, 2024

mrfh92 commented Jan 10, 2024

mrfh92 commented Jan 16, 2024

mrfh92 commented Jan 16, 2024

mrfh92 commented Jan 16, 2024

mrfh92 commented Jan 16, 2024

mrfh92 commented Jan 17, 2024

mrfh92 commented Jan 17, 2024 •

edited

Loading

mrfh92 commented Jan 17, 2024 •

edited

Loading

mrfh92 commented Jan 18, 2024 •

edited

Loading

mrfh92 commented Jan 22, 2024

[Bug]: QR fails on two processes when using CUDA #1317

[Bug]: QR fails on two processes when using CUDA #1317

Comments

mrfh92 commented Jan 8, 2024 • edited by mtar Loading

What happened?

Code snippet triggering the error

Error message or erroneous outcome

Version

Python version

PyTorch version

MPI version

mrfh92 commented Jan 8, 2024

github-actions bot commented Jan 8, 2024

mtar commented Jan 9, 2024 • edited Loading

github-actions bot commented Jan 10, 2024

mrfh92 commented Jan 10, 2024

mrfh92 commented Jan 16, 2024

mrfh92 commented Jan 16, 2024

mrfh92 commented Jan 16, 2024

mrfh92 commented Jan 16, 2024

mrfh92 commented Jan 17, 2024

mrfh92 commented Jan 17, 2024 • edited Loading

mrfh92 commented Jan 17, 2024 • edited Loading

mrfh92 commented Jan 18, 2024 • edited Loading

mrfh92 commented Jan 22, 2024

mrfh92 commented Jan 8, 2024 •

edited by mtar

Loading

mtar commented Jan 9, 2024 •

edited

Loading

mrfh92 commented Jan 17, 2024 •

edited

Loading

mrfh92 commented Jan 17, 2024 •

edited

Loading

mrfh92 commented Jan 18, 2024 •

edited

Loading