Efficient error reporting in CUDA #79

RaulPPelaez · 2023-01-12T16:38:08Z

We need to design and integrate an efficient way to detect an error state in a CUDA kernel and capture it in the CPU. In particular detecting when some particle has more neighbours than the maximum allowed, replacing this:

NNPOps/src/pytorch/neighbors/getNeighborPairsCUDA.cu

Line 67 in 8b2d427

assert(i_pair < neighbors.size(1));

which currently leaves the CUDA context in an invalid state, requiring a full context reset.

Additionally, the strategy should be compatible with CUDA graphs.
This is related to this PR #70

The main difficulty here is that there is no way to communicate information between a kernel and the CPU that does not involve a synchronization barrier and a memory copy.

I think we should go about this in a similar way as the native CUDA reporting goes, by somehow building an error checking function into the interface that is allowed to synchronize and memcpy.

The class building the list, here

NNPOps/src/pytorch/neighbors/getNeighborPairsCUDA.cu

Line 100 in 8b2d427

class Autograd : public Function<Autograd> {

could own a device array/value storing error states (maybe an enum, or a simple integer), the function building the neighbour list would atomically set this error state instead of the assertion above.

Then, checking this error state in the CPU should be delayed as much as possible. For instance, before constructing a CUDA graph a series of error-checking calls to

NNPOps/src/pytorch/neighbors/getNeighborPairsCUDA.cu

Lines 102 to 106 in 8b2d427

    
           static tensor_list forward(AutogradContext* ctx, 
        
                                      const Tensor& positions, 
        
                                      const Scalar& cutoff, 
        
                                      const Scalar& max_num_neighbors, 
        
                                      const Tensor& box_vectors) {

with increasing max_num_neighbours could be made to determine an upper bound for it. Then a graph is constructed in a way such that this error state is no longer automatically checked.

This has of course the downside that errors would go silent during a simulation, with the code crashing in an uncontrolled way.

RaulPPelaez · 2023-01-12T16:47:34Z

What is the intended usage of this function

NNPOps/src/pytorch/neighbors/getNeighborPairs.py

Line 5 in 8b2d427

    
           def getNeighborPairs(positions: Tensor, cutoff: float, max_num_neighbors: int = -1, box_vectors: Optional[Tensor] = None) -> Tuple[Tensor, Tensor]:

when the maximum number of neighbours is not known beforehand?

Is the user expected to find out this number outside of the interface? be conservative?

I would expect the function to autocompute it by default or provide some way to inform and try again quickly if the provided max_num_neighbors is not enough.

peastman · 2023-01-12T17:10:00Z

I believe this is the most efficient way of communicating an error state back to the host.

Allocate the flag in host memory with cudaHostAlloc().
Clear the flag before launching the kernel.
If an error occurs, the kernel sets the flag.
On the host, record an event after launching the kernel.
To check for an error on the host, wait on the event and then check the flag.

For efficiency, it's best to delay that last step as long as possible. If you can launch one or more other kernels before waiting on the event, that will allow the GPU to keep working.

peastman · 2023-01-12T17:12:42Z

Is the user expected to find out this number outside of the interface? be conservative?

It expects you to just know it, which really isn't ideal. This is inherent in the neighbor list format it uses: a 2D tensor where one axis corresponds to the neighbors of each atom. It would be great if we could support a second neighbor list format that doesn't require the maximum neighbors to be fixed.

RaulPPelaez · 2023-01-13T09:00:18Z

Sounds good.

I believe this is the most efficient way of communicating an error state back to the host.
* Allocate the flag in host memory with `cudaHostAlloc()`.

* Clear the flag before launching the kernel.

* If an error occurs, the kernel sets the flag.

waiting on the event, that will allow the GPU to keep working.

If the flag is in managed memory, setting it in the host before the kernel launch should not incur in a memcpy or a sync if the kernel does not access the flag (no error), right?

* On the host, record an event after launching the kernel.

* To check for an error on the host, wait on the event and then check the flag.
For efficiency, it's best to delay that last step as long as possible. If you can launch one or more other kernels before
There is then another question, when is the event synced?

At some point before leaving the function?
As the first step the next time the function is called?
The user must explicitly request the error state?

RaulPPelaez · 2023-01-13T09:12:37Z

Is the user expected to find out this number outside of the interface? be conservative?

It expects you to just know it, which really isn't ideal. This is inherent in the neighbor list format it uses: a 2D tensor where one axis corresponds to the neighbors of each atom. It would be great if we could support a second neighbor list format that doesn't require the maximum neighbors to be fixed.

Is it an actual requirement to know the shape of the tensor before calling? If it is not the getNeighborsPair function could just allocate whatever it needs. For instance, max_num_neighs=-2 could mean: figure out how many you need.
In that case we could just do internally:

while(not tryToBuildList(..., max_num_neighs))
    max_num_neighs +=32;

RaulPPelaez · 2023-01-13T09:27:36Z

On the other hand, if the format can change wildly we could try a cell list, it is really fast to construct. To traverse, instead of having a "private" nlist for each particle, you get a list of particles in each cell and you need to go over the 27 neighbouring cells to go over your neighbours.
This way you eliminate the need of max_num_neighs. The cell list is stored in three arrays, one of size N and two of size ncells=int(L/rc)^3, all of which are known in advance.
The code gets messier, though.

peastman · 2023-01-13T17:00:35Z

The user must explicitly request the error state?

Users rarely check error states. If they have to take extra steps to see if there was an error, it's not a lot better than no error checking at all.

RaulPPelaez · 2023-01-17T09:01:43Z

#80

RaulPPelaez · 2023-04-19T14:50:12Z

Closed as completed

raimis added the enhancement New feature or request label Mar 2, 2023

RaulPPelaez closed this as completed Apr 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient error reporting in CUDA #79

Efficient error reporting in CUDA #79

RaulPPelaez commented Jan 12, 2023 •

edited

Loading

RaulPPelaez commented Jan 12, 2023

peastman commented Jan 12, 2023

peastman commented Jan 12, 2023

RaulPPelaez commented Jan 13, 2023

RaulPPelaez commented Jan 13, 2023

RaulPPelaez commented Jan 13, 2023

peastman commented Jan 13, 2023

RaulPPelaez commented Jan 17, 2023

RaulPPelaez commented Apr 19, 2023

Efficient error reporting in CUDA #79

Efficient error reporting in CUDA #79

Comments

RaulPPelaez commented Jan 12, 2023 • edited Loading

RaulPPelaez commented Jan 12, 2023

peastman commented Jan 12, 2023

peastman commented Jan 12, 2023

RaulPPelaez commented Jan 13, 2023

RaulPPelaez commented Jan 13, 2023

RaulPPelaez commented Jan 13, 2023

peastman commented Jan 13, 2023

RaulPPelaez commented Jan 17, 2023

RaulPPelaez commented Apr 19, 2023

RaulPPelaez commented Jan 12, 2023 •

edited

Loading