Add error checking to CUDA version of getNeighborPairs #80

RaulPPelaez · 2023-01-16T16:21:37Z

I have added a flag in managed memory. It will be atomically written if too many neighbors are found for some particle.
It is checked using an event just after execution of the kernel.

All tests are passing (even the too_many_neighbors one in the GPU)

RaulPPelaez · 2023-01-16T16:23:30Z

The Autograd class is not allowed to hold any state, right?
As of now, the forward function must create and delete and event every time it is called. Any ideas on how to take event creation elsewhere?

src/pytorch/neighbors/getNeighborPairsCUDA.cu

RaulPPelaez · 2023-01-16T17:28:49Z

Can't one add a host node for that? El lun, 16 ene 2023 17:38, Raimondas Galvelis ***@***.***> escribió:

…

***@***.**** commented on this pull request. ------------------------------ In src/pytorch/neighbors/getNeighborPairsCUDA.cu <#80 (comment)>: > @@ -151,6 +166,13 @@ public: get_accessor<scalar_t, 2>(deltas), get_accessor<scalar_t, 1>(distances), get_accessor<scalar_t, 2>(box_vectors)); + cudaEventRecord(event, stream); + cudaEventSynchronize(event); + //Check the error flag + TORCH_CHECK(tooManyNeighborsErrorFlag == 0, "Some particle has too many neighbours, found " + This won't work with CUDA Graphs, because the CPU code will be executed only during a graph creation. — Reply to this email directly, view it on GitHub <#80 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADDJV4FYNFQWMUWWGSPJCRDWSV2RPANCNFSM6AAAAAAT45ST3A> . You are receiving this because you authored the thread.Message ID: ***@***.***>

peastman · 2023-01-17T17:23:59Z

What about moving the host side checking of the flag into the backward pass? There will usually be a lot of other kernels launched between the two, so waiting on the event at that point won't hurt performance. The disadvantage is that the check will be missed if the backward pass is skipped, for example if someone computes only energy but not forces.

raimis · 2023-01-17T17:32:16Z

What if a user runs just the forward pass?

peastman · 2023-01-17T18:27:39Z

Right, that's the disadvantage. It would give people error checking in most of the common use cases without significantly hurting performance. But there exist use cases where error checking would be skipped.

RaulPPelaez · 2023-01-18T08:54:25Z

Since this is going to be part of a graph the check should go after launching the graph, and the entity doing so should be the one checking for errors in its execution. What function is going to be building and launching the graph?

An exception from inside a CUDA graph is problematic. One solution is triggering a cudaError during graph execution.
For instance setting cudaErrorInvalidValue by calling cudaMemcpy(NULL, NULL, 0, 0). This would appear the next time the cuda error state is checked after launching the graph.

OTOH we could report the error in the two ways: A direct call to forward can check the host flag and launch and throw an exception. A cudaError can be raised if the call to the forward kernel is happening as part of a CUDA graph execution.

EDIT: One cannot call any cuda api function from inside a callback, so I do not know how to raise a cuda error.

…true will force the function to synchronize and throw an exception if some error was found, so it can be catched. The default will throw the error asynchronously, which will crash the program. In both cases a meaningful message is printed.

RaulPPelaez · 2023-01-19T14:11:54Z

This commit introduces a new optional bool flag, check_errors, to getNeighborPairs. The default (False) will check for errors and throw asynchronously, printing a meaningful message but crashing.
Passing True will instead synchronize and raise the error synchronously, such that the exception can be handled. The same message is printed in both cases.

In case of a cuda graph the False option is forced, the error is thrown asynchronously, crashing the code with a meaningful message.

If False is chosen error checking is virtually free. There is no synchronization penalty and since the error flag lives in managed memory there should not be a mem transfer footprint at all if the error did not happen.

raimis · 2023-02-06T11:47:25Z

Regular mode (check_errors = False): @RaulPPelaez solution is with cudaLaunchHostFunc is very good.
- Pros
  - A user will get a meaningful error message
  - The overhead is insignificant
  - It is compatible with CUDA Graphs
- Cons
  - The error will happen asynchronously
  - There is no way to catch the error
Advanced mode (check_errors = True): I think, we getNeighborPairs should just return the number of detected pairs and it is up to the user to check if there was enough space for space. This eliminates the cons of the regular mode by passing the burden of error checking to the users. The current approach with synchronization is only good for debugging, not production. The same effect can be achieved with the debug mode of PyTorch, where it runs the kernel synchronously.

RaulPPelaez · 2023-02-06T16:48:51Z

The same effect can be achieved with the debug mode of PyTorch, where it runs the kernel synchronously.

While this would make the error synchronous, it would be non catchable AFAIK. Do you know if Pytorch defines something that can be detected C++ side when using debug mode? That way I could make the exception synchronous AND catchable in debug mode only.

Also, I believe the meaning of check_errors should be the opposite as you wrote.

True should check and throw asynchronously.
False would just not even check host-side

In both cases the kernel could write a especial value (say -1 or NaN) to, for instance neighbors[1][max_pairs-1] when some particle has too many neighs. At least with the current format and strategy I think this can be implemented without overhead when there is no error.

- Add a new optional flag, sync_exceptions on top of the current check_errors. - Three behaviors are possible: 1. Default (both false). Operation is CUDA-graph compatible and an uncatchable exception is thrown in case of number of pairs being too high. 2. check_errors=True. Operation is CUDA-graph compatible. No exception is thrown and the number of found pairs is returned, which can be higher than max_number_pairs. 3. check_errors=False and sync_exceptions=True. Operation is NOT CUDA-graph compatible. The operation synchronizes to check for errors and throws a catchable exception if necessary.

RaulPPelaez · 2023-03-03T11:47:13Z

I followed @raimis suggestion and added a bit of mine, ending up with the following:
hree behaviors are possible:

Default (both false). Operation is CUDA-graph compatible and an uncatchable exception is thrown in case of number of pairs being too high.
check_errors=True. Operation is CUDA-graph compatible. No exception is thrown and the number of found pairs is returned as a fourth output argument, which can be higher than max_number_pairs.
check_errors=False and sync_exceptions=True. Operation is NOT CUDA-graph compatible. The operation synchronizes to check for errors and throws a catchable exception if necessary.

Pros:
-Default behavior is what is same as without this PR but CUDA-graph compatible.
-User can choose to make exceptions recoverable.
-Users can choose to manage errors themselves and keep things CUDA-graph compatible.
Cons:
-There are two parameters dealing with just error reporting.
-No unit test can be written to test the default behavior, since it crashes the code without pytest being able to overcome it.

The current unit test for this functionality might help understand how it works:

def test_too_many_neighbors(device, dtype):
    if not pt.cuda.is_available() and device == 'cuda':
        pytest.skip('No GPU')
    # 4 points result into 6 pairs, but there is a storage just for 4.
    positions = pt.zeros((4, 3,), device=device, dtype=dtype)
    with pytest.raises(RuntimeError):
        # checkErrors = False will throw due to exceeding neighbours
        # syncExceptions = True makes  this exception catchable at the
        # expense of performance (even when no error ocurred)
        getNeighborPairs(positions, cutoff=1, max_num_neighbors=1, check_errors=False, sync_exceptions=True)
        pt.cuda.synchronize()

    # checkErrors = True will never throw due to exceeding neighbours,
    # but  will return  the number  of pairs  found.
    # syncExceptions is ignored in this case
    neighbors, deltas, distances, number_found_pairs = getNeighborPairs(positions, cutoff=1, max_num_neighbors=1, check_errors=True)
    assert number_found_pairs == 6

RaulPPelaez · 2023-03-07T18:40:41Z

I made some changes to make getNeighborPairs CUDA-graph compatible, now one can do something like:

    device = 'cuda'
    dtype = pt.float32
    num_atoms = 100
    positions = 10 * pt.randn((num_atoms, 3), device=device, dtype=dtype)
    cutoff = 5
    graph = pt.cuda.CUDAGraph()
    with pt.cuda.graph(graph):
        neighbors, deltas, distances = getNeighborPairs(positions, cutoff=cutoff, max_num_neighbors=num_atoms*num_atoms)

    graph.replay()
    pt.cuda.synchronize()

…cripts

RaulPPelaez · 2023-03-13T13:41:47Z

@peastman @raimis , this is ready for review.

RaulPPelaez · 2023-03-27T09:19:59Z

We decided to change the interface so that the number of pairs is always returned, meaning the user can now easily check if the maximum number of pairs is exceeded. This changes the restrictions of the original problem a bit.

We wanted the user to be informed (via an exception, for instance) in the case of the number of pairs found being larger than the maximum allowed.

Alas, informing the user in a recoverable way requires synchronizing (slow and incompatible with CUDA graphs), so I believe it is sensible that this functionality is guarded behind a flag.

Best we can do AFAIK is let the user choose between:

Fast but (informatively) crashes in the event of an error.
Slow but you can recover from the error.

Right now we can do that with only the check_errors flag. If you guys are ok then I will remove sync_exceptions.

The only option I see if we really do not want the code to crash is to let the results be silently wrong when check_errors is false, passing onto the user the responsibility to check for the num_pairs return value.

In my opinion this function should not let the code progress further if num_pairs> maximum_neighbors. The user is probably not going to bother checking and the danger of being silently wrong is not tolerable.
If the user wants to be able to do something about the exception (a.i. increase the maximum neighs) they can simply pass check_errors=True. Since recovering must be a conscious choice I believe this is a good behavior.

For me, the ideal use case for this function would be something as follows:

import torch as pt
from NNPOps.neighbors import getNeighborPairs

positions = pt.tensor(...)
max_num_neighbors = 1
# Find the maximum number of neighbors
while True:
    try:
        getNeighborPairs(positions, cutoff=3.0, max_num_neighbors=max_num_neighbors, check_errors=True)
    except RuntimeError:
        max_num_neighbors += 32
        continue
    break
# Fast and CUDA-graph compatible calls that should not ever raise, but will crash if an error occurs 
neigh, deltas, distances, num_pairs = getNeighborPairs(positions, cutoff=3.0, max_num_neighbors=max_num_neighbors)

But we can also make it be something like this:

import torch as pt
from NNPOps.neighbors import getNeighborPairs

positions = pt.tensor(...)
max_num_neighbors = 1
# Find the maximum number of neighbors. This call will not ever raise, but be silently wrong.
neigh, deltas, distances, num_pairs = getNeighborPairs(positions, cutoff=3.0, max_num_neighbors=max_num_neighbors)
if num_pairs> max_num_neighbors:
    max_num_neighbors = num_pairs + 32
# This will also never raise, also silently wrong. This call and the above will be fast and CUDA-graph compatible
neigh, deltas, distances, num_pairs = getNeighborPairs(positions, cutoff=3.0, max_num_neighbors=max_num_neighbors)
# This will raise if necessary, but not be CUDA-graph compatible.
neigh, deltas, distances, num_pairs = getNeighborPairs(positions, cutoff=3.0, max_num_neighbors=max_num_neighbors, check_errors=True)

Let me know what you think!

peastman · 2023-03-27T15:55:16Z

I don't think a crash is ever a good way of reporting an error. I would vote for combining the flags so you have a single option.

check_errors=True: Slower but you get an exception if it exceeds the maximum number of neighbors.
check_errors=False: Faster but it's up to you to check the returned value to see if some neighbors were missed.

RaulPPelaez · 2023-03-28T08:15:03Z

Then check_errors=True would be the default.
In your proposal only check_errors=False would be compatible with CUDA graphs, which is ok with me. OTOH it would be really inconvenient for the user to both put getNeighborPairs into a larger CUDA graph and ensure correctness.
It is hard to check the number of pairs without synchronizing, which was part of the functionality added by this PR.

If you agree that the responsibility to check errors in CUDA graph mode should fall onto the user I will go ahead and implement @peastman 's proposal.
cc @raimis

peastman · 2023-03-28T17:35:34Z

Then check_errors=True would be the default.

check_errors=False would be the default.

In your proposal only check_errors=False would be compatible with CUDA graphs

Correct.

with CUDA graphs. If check_errors=False (the default) getNeighborPairs does not check for errors and is compatible with graphs. If check_errors=True, the function raises if necessary but it is incompatible with graphs

RaulPPelaez · 2023-03-29T13:10:23Z

This simplifies the logic greatly, no kernel-side atomic error flag is required and the graph can be constructed without requiring a host node.
I removed the sync_exceptions flag and followed @peastman 's proposal.
The function never crashes the program.
It is compatible with CUDA graphs by default, but can be silently wrong. The user must check that the returned num_pairs is lower than the provided max_num_neighbors.
If check_errors=True, the function is not CUDA graph compatible because it requires copying the number of neighbors to host, but will raise if necessary.

RaulPPelaez · 2023-03-30T10:08:51Z

@peastman @raimis this is ready for review again.

src/pytorch/neighbors/TestNeighbors.py

src/pytorch/neighbors/getNeighborPairs.py

src/pytorch/neighbors/getNeighborPairsCPU.cpp

src/pytorch/neighbors/getNeighborPairsCUDA.cu

src/pytorch/neighbors/getNeighborPairs.py

Enforce that the found number of pairs is less than num_pairs

enforced. Right now this does not pass, since the function allows that an atom has more neighbors than max_num_neighbors as long as num_found_pairs<num_atoms*max_num_neighbors

src/pytorch/neighbors/TestNeighbors.py

… neighbors per particle) to max_num_pairs (maximum number of total pairs).

RaulPPelaez · 2023-04-11T11:47:17Z

This is ready for review again.

raimis · 2023-04-11T11:51:53Z

Ping @peastman

src/pytorch/neighbors/getNeighborPairs.py

peastman · 2023-04-13T21:02:38Z

This addresses all the issues I raised. Looks good to me now!

raimis · 2023-04-14T09:21:52Z

@RaulPPelaez I suppose, this is ready to merge?

RaulPPelaez · 2023-04-14T09:41:03Z

This is ready for merge.

Add error checking to CUDA version of getNeighborPairs

657e748

raimis reviewed Jan 16, 2023

View reviewed changes

src/pytorch/neighbors/getNeighborPairsCUDA.cu Outdated Show resolved Hide resolved

raimis reviewed Jan 16, 2023

View reviewed changes

src/pytorch/neighbors/getNeighborPairsCUDA.cu Outdated Show resolved Hide resolved

RaulPPelaez mentioned this pull request Jan 17, 2023

Efficient error reporting in CUDA #79

Closed

RaulPPelaez added 2 commits January 19, 2023 14:55

Remove unnecessarily static variable

8c13952

RaulPPelaez mentioned this pull request Mar 3, 2023

NNPOps 0.4 #86

Closed

RaulPPelaez added 4 commits March 6, 2023 16:34

Make getNeighborPairs CUDA-graph compatible, add test for it

477e9cd

Remove incorrect comment

822c691

Change not by !

e46fe2d

Move all torch.ops.load calls to the __init__.py scripts

e80cd5e

RaulPPelaez added 6 commits March 9, 2023 09:32

Change how the location of libNNPOpsPyTorch.so is found at __init__ s…

2a7cd3a

…cripts

Remove spurious lines in CMakeLists.txt

e4df3cf

Update again how libNNPOpsPyTorch.so is found in __init__.py

d6eb763

Remove redundant torch load

ca821c3

Merge remote-tracking branch 'origin/master'

46ddf3d

Skip CUDA graph test if no GPU is available

676f83b

Remove sync_exceptions flag, simplifying the behavior and relation

562d522

with CUDA graphs. If check_errors=False (the default) getNeighborPairs does not check for errors and is compatible with graphs. If check_errors=True, the function raises if necessary but it is incompatible with graphs

Remove unused function

751ee12

peastman reviewed Mar 30, 2023

View reviewed changes

RaulPPelaez added 10 commits March 31, 2023 09:58

Remove unnecessary synchronization in test

ad8bbaf

Clarify documentation of check_errors

6593331

Clarify documentation of number_found_pairs

75608cf

Clarify documentation of CUDA graph functionality

4c624e5

Remove obsolete comment

355860f

Fix formatting

5ccc98f

Fix formatting

bc78d15

Update documentation

e1a965a

Change the (misleading) num_pairs variable name to max_num_pairs.

130b13b

Enforce that the found number of pairs is less than num_pairs

Add test that checks if the max_num_neighbors per particle is

2d8d02b

enforced. Right now this does not pass, since the function allows that an atom has more neighbors than max_num_neighbors as long as num_found_pairs<num_atoms*max_num_neighbors

RaulPPelaez commented Mar 31, 2023

View reviewed changes

src/pytorch/neighbors/TestNeighbors.py Outdated Show resolved Hide resolved

RaulPPelaez added 2 commits March 31, 2023 12:10

Merge remote-tracking branch 'origin/master'

6a67cad

Change the meaning and name from max_num_neighbors (maximum number of…

90a584e

… neighbors per particle) to max_num_pairs (maximum number of total pairs).

Fix typo in comment

c97a6f2

peastman reviewed Apr 13, 2023

View reviewed changes

src/pytorch/neighbors/getNeighborPairs.py Show resolved Hide resolved

raimis merged commit b27ec97 into openmm:master Apr 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add error checking to CUDA version of getNeighborPairs #80

Add error checking to CUDA version of getNeighborPairs #80

RaulPPelaez commented Jan 16, 2023 •

edited

Loading

RaulPPelaez commented Jan 16, 2023

RaulPPelaez commented Jan 16, 2023 via email

peastman commented Jan 17, 2023

raimis commented Jan 17, 2023

peastman commented Jan 17, 2023

RaulPPelaez commented Jan 18, 2023 •

edited

Loading

RaulPPelaez commented Jan 19, 2023

raimis commented Feb 6, 2023

RaulPPelaez commented Feb 6, 2023

RaulPPelaez commented Mar 3, 2023

RaulPPelaez commented Mar 7, 2023

RaulPPelaez commented Mar 13, 2023

RaulPPelaez commented Mar 27, 2023

peastman commented Mar 27, 2023

RaulPPelaez commented Mar 28, 2023

peastman commented Mar 28, 2023

RaulPPelaez commented Mar 29, 2023 •

edited

Loading

RaulPPelaez commented Mar 30, 2023

RaulPPelaez commented Apr 11, 2023

raimis commented Apr 11, 2023

peastman commented Apr 13, 2023

raimis commented Apr 14, 2023

RaulPPelaez commented Apr 14, 2023

Add error checking to CUDA version of getNeighborPairs #80

Add error checking to CUDA version of getNeighborPairs #80

Conversation

RaulPPelaez commented Jan 16, 2023 • edited Loading

RaulPPelaez commented Jan 16, 2023

RaulPPelaez commented Jan 16, 2023 via email

peastman commented Jan 17, 2023

raimis commented Jan 17, 2023

peastman commented Jan 17, 2023

RaulPPelaez commented Jan 18, 2023 • edited Loading

RaulPPelaez commented Jan 19, 2023

raimis commented Feb 6, 2023

RaulPPelaez commented Feb 6, 2023

RaulPPelaez commented Mar 3, 2023

RaulPPelaez commented Mar 7, 2023

RaulPPelaez commented Mar 13, 2023

RaulPPelaez commented Mar 27, 2023

peastman commented Mar 27, 2023

RaulPPelaez commented Mar 28, 2023

peastman commented Mar 28, 2023

RaulPPelaez commented Mar 29, 2023 • edited Loading

RaulPPelaez commented Mar 30, 2023

RaulPPelaez commented Apr 11, 2023

raimis commented Apr 11, 2023

peastman commented Apr 13, 2023

raimis commented Apr 14, 2023

RaulPPelaez commented Apr 14, 2023

RaulPPelaez commented Jan 16, 2023 •

edited

Loading

RaulPPelaez commented Jan 18, 2023 •

edited

Loading

RaulPPelaez commented Mar 29, 2023 •

edited

Loading