Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient error reporting in CUDA #79

Closed
RaulPPelaez opened this issue Jan 12, 2023 · 9 comments
Closed

Efficient error reporting in CUDA #79

RaulPPelaez opened this issue Jan 12, 2023 · 9 comments
Labels
enhancement New feature or request

Comments

@RaulPPelaez
Copy link
Contributor

RaulPPelaez commented Jan 12, 2023

We need to design and integrate an efficient way to detect an error state in a CUDA kernel and capture it in the CPU. In particular detecting when some particle has more neighbours than the maximum allowed, replacing this:

assert(i_pair < neighbors.size(1));

which currently leaves the CUDA context in an invalid state, requiring a full context reset.

Additionally, the strategy should be compatible with CUDA graphs.
This is related to this PR #70

The main difficulty here is that there is no way to communicate information between a kernel and the CPU that does not involve a synchronization barrier and a memory copy.

I think we should go about this in a similar way as the native CUDA reporting goes, by somehow building an error checking function into the interface that is allowed to synchronize and memcpy.

The class building the list, here

class Autograd : public Function<Autograd> {

could own a device array/value storing error states (maybe an enum, or a simple integer), the function building the neighbour list would atomically set this error state instead of the assertion above.

Then, checking this error state in the CPU should be delayed as much as possible. For instance, before constructing a CUDA graph a series of error-checking calls to

static tensor_list forward(AutogradContext* ctx,
const Tensor& positions,
const Scalar& cutoff,
const Scalar& max_num_neighbors,
const Tensor& box_vectors) {

with increasing max_num_neighbours could be made to determine an upper bound for it. Then a graph is constructed in a way such that this error state is no longer automatically checked.

This has of course the downside that errors would go silent during a simulation, with the code crashing in an uncontrolled way.

@RaulPPelaez
Copy link
Contributor Author

What is the intended usage of this function

def getNeighborPairs(positions: Tensor, cutoff: float, max_num_neighbors: int = -1, box_vectors: Optional[Tensor] = None) -> Tuple[Tensor, Tensor]:

when the maximum number of neighbours is not known beforehand?

Is the user expected to find out this number outside of the interface? be conservative?

I would expect the function to autocompute it by default or provide some way to inform and try again quickly if the provided max_num_neighbors is not enough.

@peastman
Copy link
Member

I believe this is the most efficient way of communicating an error state back to the host.

  • Allocate the flag in host memory with cudaHostAlloc().
  • Clear the flag before launching the kernel.
  • If an error occurs, the kernel sets the flag.
  • On the host, record an event after launching the kernel.
  • To check for an error on the host, wait on the event and then check the flag.

For efficiency, it's best to delay that last step as long as possible. If you can launch one or more other kernels before waiting on the event, that will allow the GPU to keep working.

@peastman
Copy link
Member

Is the user expected to find out this number outside of the interface? be conservative?

It expects you to just know it, which really isn't ideal. This is inherent in the neighbor list format it uses: a 2D tensor where one axis corresponds to the neighbors of each atom. It would be great if we could support a second neighbor list format that doesn't require the maximum neighbors to be fixed.

@RaulPPelaez
Copy link
Contributor Author

Sounds good.

I believe this is the most efficient way of communicating an error state back to the host.

* Allocate the flag in host memory with `cudaHostAlloc()`.

* Clear the flag before launching the kernel.

* If an error occurs, the kernel sets the flag.

waiting on the event, that will allow the GPU to keep working.

If the flag is in managed memory, setting it in the host before the kernel launch should not incur in a memcpy or a sync if the kernel does not access the flag (no error), right?

* On the host, record an event after launching the kernel.

* To check for an error on the host, wait on the event and then check the flag.

For efficiency, it's best to delay that last step as long as possible. If you can launch one or more other kernels before
There is then another question, when is the event synced?

  • At some point before leaving the function?
  • As the first step the next time the function is called?
  • The user must explicitly request the error state?

@RaulPPelaez
Copy link
Contributor Author

Is the user expected to find out this number outside of the interface? be conservative?

It expects you to just know it, which really isn't ideal. This is inherent in the neighbor list format it uses: a 2D tensor where one axis corresponds to the neighbors of each atom. It would be great if we could support a second neighbor list format that doesn't require the maximum neighbors to be fixed.

Is it an actual requirement to know the shape of the tensor before calling? If it is not the getNeighborsPair function could just allocate whatever it needs. For instance, max_num_neighs=-2 could mean: figure out how many you need.
In that case we could just do internally:

while(not tryToBuildList(..., max_num_neighs))
    max_num_neighs +=32;

@RaulPPelaez
Copy link
Contributor Author

On the other hand, if the format can change wildly we could try a cell list, it is really fast to construct. To traverse, instead of having a "private" nlist for each particle, you get a list of particles in each cell and you need to go over the 27 neighbouring cells to go over your neighbours.
This way you eliminate the need of max_num_neighs. The cell list is stored in three arrays, one of size N and two of size ncells=int(L/rc)^3, all of which are known in advance.
The code gets messier, though.

@peastman
Copy link
Member

The user must explicitly request the error state?

Users rarely check error states. If they have to take extra steps to see if there was an error, it's not a lot better than no error checking at all.

@RaulPPelaez
Copy link
Contributor Author

#80

@raimis raimis added the enhancement New feature or request label Mar 2, 2023
@RaulPPelaez
Copy link
Contributor Author

Closed as completed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants