Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: QR fails on two processes when using CUDA #1317

Closed
mrfh92 opened this issue Jan 8, 2024 · 14 comments
Closed

[Bug]: QR fails on two processes when using CUDA #1317

mrfh92 opened this issue Jan 8, 2024 · 14 comments
Assignees
Labels
bug Something isn't working High priority, urgent linalg

Comments

@mrfh92
Copy link
Collaborator

mrfh92 commented Jan 8, 2024

What happened?

Tests for QR fail on two processes for CUDA

Code snippet triggering the error

No response

Error message or erroneous outcome

No response

Version

main (development branch)

Python version

None

PyTorch version

None

MPI version

No response

@mrfh92 mrfh92 added the bug Something isn't working label Jan 8, 2024
@mrfh92
Copy link
Collaborator Author

mrfh92 commented Jan 8, 2024

Error message:

>               self.assertTrue(ht.allclose(a_comp2, qr2.Q @ qr2.R, rtol=1e-5, atol=1e-5))
E               AssertionError: False is not true
heat/core/linalg/tests/test_qr.py:77: AssertionError

@mrfh92 mrfh92 added the linalg label Jan 8, 2024
@ClaudiaComito ClaudiaComito self-assigned this Jan 8, 2024
Copy link
Contributor

github-actions bot commented Jan 8, 2024

@mtar
Copy link
Collaborator

mtar commented Jan 9, 2024

The failure is on two processes.

@mtar mtar changed the title [Bug]: QR fails on one process when using CUDA [Bug]: QR fails on two processes when using CUDA Jan 9, 2024
@mtar mtar linked a pull request Jan 9, 2024 that will close this issue
5 tasks
@mrfh92 mrfh92 self-assigned this Jan 10, 2024
Copy link
Contributor

@mrfh92
Copy link
Collaborator Author

mrfh92 commented Jan 10, 2024

I can reproduce this error on my workstation (CUDA 11.4, PyTorch 2.0.0). Interestingly, the following error arises only for exactly 2 processes:

FAIL: test_qr (heat.core.linalg.tests.test_qr.TestQR)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/hopp_fa/heat/heat/core/linalg/tests/test_qr.py", line 67, in test_qr
    self.assertTrue(ht.allclose((a_comp1 - (qr1.Q @ qr1.R)), 0, rtol=1e-5, atol=1e-5))
AssertionError: False is not true

----------------------------------------------------------------------

@mrfh92 mrfh92 added the High priority, urgent label Jan 11, 2024
@mrfh92
Copy link
Collaborator Author

mrfh92 commented Jan 16, 2024

(copied from the PR)
@ClaudiaComito @mtar Im currently doing a random search (1000 samples) on my workstation with 1-8 processes, (20-40) x (20-40) matrices (float32 on GPU), split=0,1, and tiles_per_proc=1,2 in order to detect possible further problematic configurations.
tested_configs.txt

There were differents errors arising:

  • 13 failures in the sense that the programm terminated, but with wrong results
75: 32 x 34 matrix on 2 procs, split = 0 and 1 tiles per proc.
109: 36 x 30 matrix on 2 procs, split = 0 and 1 tiles per proc.
171: 37 x 29 matrix on 2 procs, split = 0 and 1 tiles per proc.
226: 37 x 28 matrix on 2 procs, split = 0 and 1 tiles per proc.
378: 35 x 35 matrix on 2 procs, split = 0 and 1 tiles per proc.
487: 40 x 40 matrix on 2 procs, split = 0 and 1 tiles per proc.
641: 36 x 24 matrix on 2 procs, split = 0 and 1 tiles per proc.
686: 34 x 36 matrix on 2 procs, split = 0 and 1 tiles per proc.
706: 32 x 38 matrix on 2 procs, split = 0 and 1 tiles per proc.
761: 39 x 22 matrix on 2 procs, split = 0 and 1 tiles per proc.
771: 39 x 39 matrix on 2 procs, split = 0 and 1 tiles per proc.
936: 34 x 40 matrix on 2 procs, split = 0 and 1 tiles per proc.
982: 32 x 30 matrix on 2 procs, split = 0 and 1 tiles per proc.

All of these failures happen on 2 processes, split=0 and 1 tiles per proc.

  • 4 additional failures in the sense that the programm terminated with an error:

    Test 201 of 2000: 20 x 24 matrix on 6 procs, split = 0 and 1 tiles per proc.

    No protocol specified
    Traceback (most recent call last):
      File "qrtests.py", line 20, in <module>
        Q,R = ht.linalg.qr(A,tiles_per_proc=tiles_per_proc)
      File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 146, in qr
        __split0_r_calc(
      File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 354, in __split0_r_calc
        q1, r1 = torch.linalg.qr(base_tile, mode="complete")
    TypeError: linalg_qr(): argument 'A' (position 1) must be Tensor, not NoneType
    Traceback (most recent call last):
      File "qrtests.py", line 20, in <module>
        Q,R = ht.linalg.qr(A,tiles_per_proc=tiles_per_proc)
      File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 146, in qr
        __split0_r_calc(
      File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 392, in __split0_r_calc
        __split0_merge_tile_rows(
      File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 524, in __split0_merge_tile_rows
        lower_inds = r_tiles.get_start_stop(key=(lower_row, column))
      File "/home/hopp_fa/heat/heat/core/tiling.py", line 853, in get_start_stop
        pr = self.tile_map[key][..., 2].unique()
    IndexError: index 4 is out of bounds for dimension 0 with size 4
    Traceback (most recent call last):
      File "qrtests.py", line 20, in <module>
        Q,R = ht.linalg.qr(A,tiles_per_proc=tiles_per_proc)
      File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 146, in qr
        __split0_r_calc(
      File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 453, in __split0_r_calc
        __split0_merge_tile_rows(
      File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 524, in __split0_merge_tile_rows
        lower_inds = r_tiles.get_start_stop(key=(lower_row, column))
      File "/home/hopp_fa/heat/heat/core/tiling.py", line 853, in get_start_stop
        pr = self.tile_map[key][..., 2].unique()
    IndexError: index 4 is out of bounds for dimension 0 with size 4
    

    Test 337 of 2000: 40 x 30 matrix on 2 procs, split = 1 and 1 tiles per proc.

    No protocol specified
    Traceback (most recent call last):
      File "qrtests.py", line 20, in <module>
        Q,R = ht.linalg.qr(A,tiles_per_proc=tiles_per_proc)
      File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 170, in qr
        __split1_qr_loop(dcol=dcol, r_tiles=r_tiles, q0_tiles=q_tiles, calc_q=calc_q)
      File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 929, in __split1_qr_loop
        r_tiles.arr.comm.Bcast(q1, root=diag_process)
      File "/home/hopp_fa/heat/heat/core/communication.py", line 746, in Bcast
        ret, sbuf, rbuf, buf = self.__broadcast_like(self.handle.Bcast, buf, root)
      File "/home/hopp_fa/heat/heat/core/communication.py", line 733, in __broadcast_like
        return func(self.as_buffer(srbuf), root), srbuf, srbuf, buf
      File "mpi4py/MPI/Comm.pyx", line 691, in mpi4py.MPI.Comm.Bcast
    mpi4py.MPI.Exception: MPI_ERR_TRUNCATE: message truncated
    

    Test 610 of 2000: 40 x 31 matrix on 2 procs, split = 1 and 1 tiles per proc.

    No protocol specified
    Traceback (most recent call last):
      File "qrtests.py", line 20, in <module>
        Q,R = ht.linalg.qr(A,tiles_per_proc=tiles_per_proc)
      File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 170, in qr
        __split1_qr_loop(dcol=dcol, r_tiles=r_tiles, q0_tiles=q_tiles, calc_q=calc_q)
      File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 929, in __split1_qr_loop
        r_tiles.arr.comm.Bcast(q1, root=diag_process)
      File "/home/hopp_fa/heat/heat/core/communication.py", line 746, in Bcast
        ret, sbuf, rbuf, buf = self.__broadcast_like(self.handle.Bcast, buf, root)
      File "/home/hopp_fa/heat/heat/core/communication.py", line 733, in __broadcast_like
        return func(self.as_buffer(srbuf), root), srbuf, srbuf, buf
      File "mpi4py/MPI/Comm.pyx", line 691, in mpi4py.MPI.Comm.Bcast
    mpi4py.MPI.Exception: MPI_ERR_TRUNCATE: message truncated
    

    Test 972 of 2000: 40 x 30 matrix on 2 procs, split = 1 and 1 tiles per proc.

    No protocol specified
    Traceback (most recent call last):
      File "qrtests.py", line 20, in <module>
        Q,R = ht.linalg.qr(A,tiles_per_proc=tiles_per_proc)
      File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 170, in qr
        __split1_qr_loop(dcol=dcol, r_tiles=r_tiles, q0_tiles=q_tiles, calc_q=calc_q)
      File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 929, in __split1_qr_loop
        r_tiles.arr.comm.Bcast(q1, root=diag_process)
      File "/home/hopp_fa/heat/heat/core/communication.py", line 746, in Bcast
        ret, sbuf, rbuf, buf = self.__broadcast_like(self.handle.Bcast, buf, root)
      File "/home/hopp_fa/heat/heat/core/communication.py", line 733, in __broadcast_like
        return func(self.as_buffer(srbuf), root), srbuf, srbuf, buf
      File "mpi4py/MPI/Comm.pyx", line 691, in mpi4py.MPI.Comm.Bcast
    mpi4py.MPI.Exception: MPI_ERR_TRUNCATE: message truncated
    

@mrfh92
Copy link
Collaborator Author

mrfh92 commented Jan 16, 2024

(copied from the PR)
I have repeated this test with 500 samples, for 1-8 processes but only split=0 and tiles per proc = 1.
Again, there are two kinds of errors:

  • wrong result, but programm completes

  • program does not complete due to an error:

    Test 128 of 500: 29 x 33 matrix on 7 procs, split = 0 and 1 tiles per proc.

    No protocol specified
    Traceback (most recent call last):
      File "qrtests.py", line 20, in <module>
        Q,R = ht.linalg.qr(A,tiles_per_proc=tiles_per_proc)
      File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 146, in qr
        __split0_r_calc(
      File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 354, in __split0_r_calc
        q1, r1 = torch.linalg.qr(base_tile, mode="complete")
    TypeError: linalg_qr(): argument 'A' (position 1) must be Tensor, not NoneType
    Traceback (most recent call last):
      File "qrtests.py", line 20, in <module>
        Q,R = ht.linalg.qr(A,tiles_per_proc=tiles_per_proc)
      File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 146, in qr
        __split0_r_calc(
      File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 392, in __split0_r_calc
        __split0_merge_tile_rows(
      File "/home/hopp_fa/heat/heat/core/linalg/qr.py", line 524, in __split0_merge_tile_rows
        lower_inds = r_tiles.get_start_stop(key=(lower_row, column))
      File "/home/hopp_fa/heat/heat/core/tiling.py", line 853, in get_start_stop
        pr = self.tile_map[key][..., 2].unique()
    IndexError: index 5 is out of bounds for dimension 0 with size 5
    

@mrfh92
Copy link
Collaborator Author

mrfh92 commented Jan 16, 2024

(copied from the PR)
Additional errors:

  • wrong result, but programm runs through:
44 x 54 matrix on 2 procs, split = 0 and 1 tiles per proc.
52 x 32 matrix on 3 procs, split = 0 and 1 tiles per proc.
  • programm aborts withe error:
47 x 43 matrix on 2 procs, split = 1 and 1 tiles per proc.
38 x 31 matrix on 12 procs, split = 0 and 1 tiles per proc.

@mrfh92
Copy link
Collaborator Author

mrfh92 commented Jan 16, 2024

(copied from the PR)
Scripts used for finding the errors

import heat as ht 
import os
import argparse

parser = argparse.ArgumentParser(description="Test the QR implementation in HEAT")

# Add arguments
parser.add_argument("-m", type=int, help="number of rows for test matrix", required=True)
parser.add_argument("-n", type=int, help="number of columns for test matrix", required=True)
parser.add_argument("-split", type=int, help="split dimension for test matrix", required=True)
parser.add_argument("-tiles", type=int, help="tiles per process in the QR decomposition", required=True)

# Parse the arguments
args = parser.parse_args()


m = args.m
n = args.n
split = args.split
tiles_per_proc = args.tiles

dtype = ht.float32
device="gpu"

A = ht.random.randn(m,n,split=split,dtype=dtype,device=device)

if A.comm.rank == 0:
    normalfile = open('tested_configs.txt', 'a')
    print(f'{m} x {n} matrix on {ht.MPI_WORLD.size} procs, split = {split} and {tiles_per_proc} tiles per proc.',file=normalfile)
    normalfile.close()

Q,R = ht.linalg.qr(A,tiles_per_proc=tiles_per_proc)

noerror = ht.allclose((A.resplit(Q.split) - (Q @ R)), 0, rtol=1e-5, atol=1e-5) and ht.allclose(Q.T @ Q, ht.eye(m,device=device,split=Q.T.split), rtol=1e-5, atol=1e-5) and ht.allclose(ht.eye(m,device=device,split=Q.split), Q @ Q.T, rtol=1e-5, atol=1e-5)

if A.comm.rank == 0:
    if not noerror: 
        errorfile = open('failures_config.txt', 'a')
        print(f'{m} x {n} matrix on {ht.MPI_WORLD.size} procs, split = {split} and {tiles_per_proc} tiles per proc.',file=errorfile)
        errorfile.close()
import ray
import subprocess
import numpy as np

# Initialize Ray
ray.init()

# Define a remote function
@ray.remote(num_gpus=0.125)
def run_script(params):
    # Call your Python script with MPI and other parameters
    n_procs = params[0]
    n = params[1]
    m = params[2]
    split = params[3]
    tiles = params[4]
    subprocess.run(f"mpirun -n {n_procs} python qrtests.py -n {n} -m {m} -split {split} -tiles {tiles}", shell=True)

# Number of runs
num_runs = 5000
parameter_space = [[np.random.randint(1,9), np.random.randint(20,41), np.random.randint(20,41), np.random.randint(0,2), np.random.randint(1,3)] for _ in range(num_runs)] 

# Generate a random number of processes for each run and set parameters
futures = [run_script.remote(params) for params in parameter_space]

# Execute the script
# If your script returns results, you can gather them
results = ray.get(futures)

@mrfh92
Copy link
Collaborator Author

mrfh92 commented Jan 17, 2024

we have decided:

  • @ClaudiaComito will look into the "wrong results"-error
  • @mrfh92 will look into the "MPI_ERR_TRUNCATE"-error
  • @mtar will look into the "indexing"-error

@mrfh92
Copy link
Collaborator Author

mrfh92 commented Jan 17, 2024

Regarding the "MPI_ERR_TRUNCATE"-error:

  • this does not depend on whether CPU or GPU is used
  • it also does not depend on the number of processes
  • it seems to be related to buffers for Bcast being contiguous/not contiguous

Abstracted example:

import heat as ht 
import torch
from time import sleep

A = ht.random.randn(50,50,dtype=ht.float32,device="gpu",split=1)

shape = (25,25)
root = 0

if A.comm.rank == root:
    x = torch.ones(shape, dtype=A.dtype.torch_type(), device=A.larray.device)
    print(A.comm.rank, x.shape, x.dtype, x.device, x.is_contiguous())
    A.comm.Bcast(x.clone(), root=root)
elif A.comm.rank > root:
    x = torch.zeros(shape, dtype=A.dtype.torch_type(), device=A.larray.device)
    print(A.comm.rank, x.shape, x.dtype, x.device, x.is_contiguous())
    A.comm.Bcast(x, root=root)
else:
    x = torch.zeros(shape, dtype=A.dtype.torch_type(), device=A.larray.device)
    print(A.comm.rank, x.shape, x.dtype, x.device, x.is_contiguous())
    A.comm.Bcast(x, root=root)
    sleep(1)

print(A.comm.rank, (x == torch.ones(shape, device=A.larray.device)).all())

yields the same error, but for small shapes no problem appears

@mrfh92
Copy link
Collaborator Author

mrfh92 commented Jan 17, 2024

Actually, the fix is replacing

r_tiles.arr.comm.Bcast(q1.clone(), root=diag_process)

by

r_tiles.arr.comm.Bcast(q1.clone(memory_format=torch.contiguous_format), root=diag_process)

in line 901 of QR.
The fix is in PR #1325

@mrfh92
Copy link
Collaborator Author

mrfh92 commented Jan 18, 2024

I had a look into the "wrong results"-error and observed the following:

If you compute QR in a constellation for which the "wrong results"-error arises, you can do print(R) without problem afterwards, but print(Q) fails with the following error:

Traceback (most recent call last):
  File "../PythonFiles/qrtests.py", line 16, in <module>
    print(Q)
  File "/home/hopp_fa/heat/heat/core/dndarray.py", line 1808, in __str__
    return printing.__str__(self)
  File "/home/hopp_fa/heat/heat/core/printing.py", line 198, in __str__
    tensor_string = _tensor_str(dndarray, __INDENT + 1)
  File "/home/hopp_fa/heat/heat/core/printing.py", line 285, in _tensor_str
    torch_data = _torch_data(dndarray, summarize)
  File "/home/hopp_fa/heat/heat/core/printing.py", line 252, in _torch_data
    data = torch.index_select(data, i, torch.arange(end))
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

However, print(Q.larray.device) yields "cuda:0" on both processes (and there is only one GPU because Im on my workstation...); morevoer, it seems to be the case that (Q @ R - A).larray is of order eps on rank 0, but not on rank 1.

@mrfh92
Copy link
Collaborator Author

mrfh92 commented Jan 22, 2024

This problem has been fixed in #1328 and #1325, respectively, for the moment
More extensive refactoring of QR code will be done within #1237

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working High priority, urgent linalg
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants