Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add error checking to CUDA version of getNeighborPairs #80

Merged
merged 46 commits into from
Apr 14, 2023

Conversation

RaulPPelaez
Copy link
Contributor

@RaulPPelaez RaulPPelaez commented Jan 16, 2023

I have added a flag in managed memory. It will be atomically written if too many neighbors are found for some particle.
It is checked using an event just after execution of the kernel.

All tests are passing (even the too_many_neighbors one in the GPU)

@RaulPPelaez
Copy link
Contributor Author

The Autograd class is not allowed to hold any state, right?
As of now, the forward function must create and delete and event every time it is called. Any ideas on how to take event creation elsewhere?

@RaulPPelaez
Copy link
Contributor Author

RaulPPelaez commented Jan 16, 2023 via email

@peastman
Copy link
Member

What about moving the host side checking of the flag into the backward pass? There will usually be a lot of other kernels launched between the two, so waiting on the event at that point won't hurt performance. The disadvantage is that the check will be missed if the backward pass is skipped, for example if someone computes only energy but not forces.

@raimis
Copy link
Contributor

raimis commented Jan 17, 2023

What if a user runs just the forward pass?

@peastman
Copy link
Member

Right, that's the disadvantage. It would give people error checking in most of the common use cases without significantly hurting performance. But there exist use cases where error checking would be skipped.

@RaulPPelaez
Copy link
Contributor Author

RaulPPelaez commented Jan 18, 2023

Since this is going to be part of a graph the check should go after launching the graph, and the entity doing so should be the one checking for errors in its execution. What function is going to be building and launching the graph?

An exception from inside a CUDA graph is problematic. One solution is triggering a cudaError during graph execution.
For instance setting cudaErrorInvalidValue by calling cudaMemcpy(NULL, NULL, 0, 0). This would appear the next time the cuda error state is checked after launching the graph.

OTOH we could report the error in the two ways: A direct call to forward can check the host flag and launch and throw an exception. A cudaError can be raised if the call to the forward kernel is happening as part of a CUDA graph execution.

EDIT: One cannot call any cuda api function from inside a callback, so I do not know how to raise a cuda error.

…true will

force the function to synchronize and throw an exception if some error
was found, so it can be catched.
The default will throw the error asynchronously, which will crash the
program.
In both cases a meaningful message is printed.
@RaulPPelaez
Copy link
Contributor Author

This commit introduces a new optional bool flag, check_errors, to getNeighborPairs. The default (False) will check for errors and throw asynchronously, printing a meaningful message but crashing.
Passing True will instead synchronize and raise the error synchronously, such that the exception can be handled. The same message is printed in both cases.

In case of a cuda graph the False option is forced, the error is thrown asynchronously, crashing the code with a meaningful message.

If False is chosen error checking is virtually free. There is no synchronization penalty and since the error flag lives in managed memory there should not be a mem transfer footprint at all if the error did not happen.

@raimis
Copy link
Contributor

raimis commented Feb 6, 2023

  • Regular mode (check_errors = False): @RaulPPelaez solution is with cudaLaunchHostFunc is very good.
    • Pros
      • A user will get a meaningful error message
      • The overhead is insignificant
      • It is compatible with CUDA Graphs
    • Cons
      • The error will happen asynchronously
      • There is no way to catch the error
  • Advanced mode (check_errors = True): I think, we getNeighborPairs should just return the number of detected pairs and it is up to the user to check if there was enough space for space. This eliminates the cons of the regular mode by passing the burden of error checking to the users. The current approach with synchronization is only good for debugging, not production. The same effect can be achieved with the debug mode of PyTorch, where it runs the kernel synchronously.

@RaulPPelaez
Copy link
Contributor Author

  • The same effect can be achieved with the debug mode of PyTorch, where it runs the kernel synchronously.

While this would make the error synchronous, it would be non catchable AFAIK. Do you know if Pytorch defines something that can be detected C++ side when using debug mode? That way I could make the exception synchronous AND catchable in debug mode only.

Also, I believe the meaning of check_errors should be the opposite as you wrote.

  • True should check and throw asynchronously.
  • False would just not even check host-side

In both cases the kernel could write a especial value (say -1 or NaN) to, for instance neighbors[1][max_pairs-1] when some particle has too many neighs. At least with the current format and strategy I think this can be implemented without overhead when there is no error.

@RaulPPelaez RaulPPelaez mentioned this pull request Mar 3, 2023
- Add a new optional flag, sync_exceptions on top of the current
check_errors.
- Three behaviors are possible:
  1. Default (both false). Operation is CUDA-graph compatible and an
  uncatchable exception is thrown in case of number of pairs being too
  high.
  2. check_errors=True. Operation is CUDA-graph compatible. No
  exception is thrown and the number of found pairs is returned, which
  can be higher than max_number_pairs.
  3. check_errors=False and sync_exceptions=True. Operation is NOT
  CUDA-graph compatible. The operation synchronizes to check for
  errors and throws a catchable exception if necessary.
@RaulPPelaez
Copy link
Contributor Author

I followed @raimis suggestion and added a bit of mine, ending up with the following:
hree behaviors are possible:

  1. Default (both false). Operation is CUDA-graph compatible and an uncatchable exception is thrown in case of number of pairs being too high.
  2. check_errors=True. Operation is CUDA-graph compatible. No exception is thrown and the number of found pairs is returned as a fourth output argument, which can be higher than max_number_pairs.
  3. check_errors=False and sync_exceptions=True. Operation is NOT CUDA-graph compatible. The operation synchronizes to check for errors and throws a catchable exception if necessary.

Pros:
-Default behavior is what is same as without this PR but CUDA-graph compatible.
-User can choose to make exceptions recoverable.
-Users can choose to manage errors themselves and keep things CUDA-graph compatible.
Cons:
-There are two parameters dealing with just error reporting.
-No unit test can be written to test the default behavior, since it crashes the code without pytest being able to overcome it.

The current unit test for this functionality might help understand how it works:

def test_too_many_neighbors(device, dtype):
    if not pt.cuda.is_available() and device == 'cuda':
        pytest.skip('No GPU')
    # 4 points result into 6 pairs, but there is a storage just for 4.
    positions = pt.zeros((4, 3,), device=device, dtype=dtype)
    with pytest.raises(RuntimeError):
        # checkErrors = False will throw due to exceeding neighbours
        # syncExceptions = True makes  this exception catchable at the
        # expense of performance (even when no error ocurred)
        getNeighborPairs(positions, cutoff=1, max_num_neighbors=1, check_errors=False, sync_exceptions=True)
        pt.cuda.synchronize()

    # checkErrors = True will never throw due to exceeding neighbours,
    # but  will return  the number  of pairs  found.
    # syncExceptions is ignored in this case
    neighbors, deltas, distances, number_found_pairs = getNeighborPairs(positions, cutoff=1, max_num_neighbors=1, check_errors=True)
    assert number_found_pairs == 6

@RaulPPelaez
Copy link
Contributor Author

I made some changes to make getNeighborPairs CUDA-graph compatible, now one can do something like:

    device = 'cuda'
    dtype = pt.float32
    num_atoms = 100
    positions = 10 * pt.randn((num_atoms, 3), device=device, dtype=dtype)
    cutoff = 5
    graph = pt.cuda.CUDAGraph()
    with pt.cuda.graph(graph):
        neighbors, deltas, distances = getNeighborPairs(positions, cutoff=cutoff, max_num_neighbors=num_atoms*num_atoms)

    graph.replay()
    pt.cuda.synchronize()

@RaulPPelaez
Copy link
Contributor Author

@peastman @raimis , this is ready for review.

@RaulPPelaez
Copy link
Contributor Author

We decided to change the interface so that the number of pairs is always returned, meaning the user can now easily check if the maximum number of pairs is exceeded. This changes the restrictions of the original problem a bit.

We wanted the user to be informed (via an exception, for instance) in the case of the number of pairs found being larger than the maximum allowed.

Alas, informing the user in a recoverable way requires synchronizing (slow and incompatible with CUDA graphs), so I believe it is sensible that this functionality is guarded behind a flag.

Best we can do AFAIK is let the user choose between:

  1. Fast but (informatively) crashes in the event of an error.
  2. Slow but you can recover from the error.

Right now we can do that with only the check_errors flag. If you guys are ok then I will remove sync_exceptions.

The only option I see if we really do not want the code to crash is to let the results be silently wrong when check_errors is false, passing onto the user the responsibility to check for the num_pairs return value.

In my opinion this function should not let the code progress further if num_pairs> maximum_neighbors. The user is probably not going to bother checking and the danger of being silently wrong is not tolerable.
If the user wants to be able to do something about the exception (a.i. increase the maximum neighs) they can simply pass check_errors=True. Since recovering must be a conscious choice I believe this is a good behavior.

For me, the ideal use case for this function would be something as follows:

import torch as pt
from NNPOps.neighbors import getNeighborPairs

positions = pt.tensor(...)
max_num_neighbors = 1
# Find the maximum number of neighbors
while True:
    try:
        getNeighborPairs(positions, cutoff=3.0, max_num_neighbors=max_num_neighbors, check_errors=True)
    except RuntimeError:
        max_num_neighbors += 32
        continue
    break
# Fast and CUDA-graph compatible calls that should not ever raise, but will crash if an error occurs 
neigh, deltas, distances, num_pairs = getNeighborPairs(positions, cutoff=3.0, max_num_neighbors=max_num_neighbors)

But we can also make it be something like this:

import torch as pt
from NNPOps.neighbors import getNeighborPairs

positions = pt.tensor(...)
max_num_neighbors = 1
# Find the maximum number of neighbors. This call will not ever raise, but be silently wrong.
neigh, deltas, distances, num_pairs = getNeighborPairs(positions, cutoff=3.0, max_num_neighbors=max_num_neighbors)
if num_pairs> max_num_neighbors:
    max_num_neighbors = num_pairs + 32
# This will also never raise, also silently wrong. This call and the above will be fast and CUDA-graph compatible
neigh, deltas, distances, num_pairs = getNeighborPairs(positions, cutoff=3.0, max_num_neighbors=max_num_neighbors)
# This will raise if necessary, but not be CUDA-graph compatible.
neigh, deltas, distances, num_pairs = getNeighborPairs(positions, cutoff=3.0, max_num_neighbors=max_num_neighbors, check_errors=True)

Let me know what you think!

@peastman
Copy link
Member

I don't think a crash is ever a good way of reporting an error. I would vote for combining the flags so you have a single option.

  • check_errors=True: Slower but you get an exception if it exceeds the maximum number of neighbors.
  • check_errors=False: Faster but it's up to you to check the returned value to see if some neighbors were missed.

@RaulPPelaez
Copy link
Contributor Author

Then check_errors=True would be the default.
In your proposal only check_errors=False would be compatible with CUDA graphs, which is ok with me. OTOH it would be really inconvenient for the user to both put getNeighborPairs into a larger CUDA graph and ensure correctness.
It is hard to check the number of pairs without synchronizing, which was part of the functionality added by this PR.

If you agree that the responsibility to check errors in CUDA graph mode should fall onto the user I will go ahead and implement @peastman 's proposal.
cc @raimis

@peastman
Copy link
Member

Then check_errors=True would be the default.

check_errors=False would be the default.

In your proposal only check_errors=False would be compatible with CUDA graphs

Correct.

with CUDA graphs.
If check_errors=False (the default) getNeighborPairs does not check
for errors and is compatible with graphs.
If check_errors=True, the function raises if necessary but it is
incompatible with graphs
@RaulPPelaez
Copy link
Contributor Author

RaulPPelaez commented Mar 29, 2023

This simplifies the logic greatly, no kernel-side atomic error flag is required and the graph can be constructed without requiring a host node.
I removed the sync_exceptions flag and followed @peastman 's proposal.
The function never crashes the program.
It is compatible with CUDA graphs by default, but can be silently wrong. The user must check that the returned num_pairs is lower than the provided max_num_neighbors.
If check_errors=True, the function is not CUDA graph compatible because it requires copying the number of neighbors to host, but will raise if necessary.

@RaulPPelaez
Copy link
Contributor Author

@peastman @raimis this is ready for review again.

src/pytorch/neighbors/TestNeighbors.py Outdated Show resolved Hide resolved
src/pytorch/neighbors/getNeighborPairs.py Outdated Show resolved Hide resolved
src/pytorch/neighbors/getNeighborPairs.py Outdated Show resolved Hide resolved
src/pytorch/neighbors/getNeighborPairs.py Outdated Show resolved Hide resolved
src/pytorch/neighbors/getNeighborPairsCPU.cpp Outdated Show resolved Hide resolved
src/pytorch/neighbors/getNeighborPairsCUDA.cu Outdated Show resolved Hide resolved
src/pytorch/neighbors/getNeighborPairsCUDA.cu Outdated Show resolved Hide resolved
src/pytorch/neighbors/getNeighborPairs.py Outdated Show resolved Hide resolved
@RaulPPelaez
Copy link
Contributor Author

This is ready for review again.

@raimis
Copy link
Contributor

raimis commented Apr 11, 2023

Ping @peastman

@peastman
Copy link
Member

This addresses all the issues I raised. Looks good to me now!

@raimis
Copy link
Contributor

raimis commented Apr 14, 2023

@RaulPPelaez I suppose, this is ready to merge?

@RaulPPelaez
Copy link
Contributor Author

This is ready for merge.

@raimis raimis merged commit b27ec97 into openmm:master Apr 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants