Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Libomptarget] Fix Nvidia offloading hanging on dataRetrieve using RPC #66817

Merged
merged 1 commit into from
Sep 26, 2023

Conversation

jhuber6
Copy link
Contributor

@jhuber6 jhuber6 commented Sep 19, 2023

Summary:
The RPC server is responsible for providing host services from the GPU.
Generally, the client running on the GPU will spin in place until the
host checks the server. Inside the runtime, we elected to have the user
thread do this checking while it would be otherwise waiting for the
kernel to finish. However, for Nvidia this caused problems when
offloading to a target region that requires a copy back.

This is caused by the implementation of dataRetrieve on Nvidia. We
initialize an asynchronous copy-back on the same stream that the kernel
is running on. This creates an implicit sync on the kernel to finish
before we issue the D2H copy, which we then wait on. This implicit sync
happens inside of the CUDA runtime. This is problematic when running the
RPC server because we need someone to check the RPC server. If no one
checks the RPC server then the kernel will never finish, meaning that
the memcpy will never be issued and the program hangs. This patch adds
an explicit check for unfinished work on the stream and waits for it to
complete.

CUresult Res = cuMemcpyDtoHAsync(HstPtr, (CUdeviceptr)TgtPtr, Size, Stream);
// If there is already pending work on the stream it could be waiting for
// someone to check the RPC server.
CUresult Res = cuStreamQuery(Stream);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (auto RCPServer = getRPCServer()) on the outside, not in the while.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^

// CHECK: PASS
#pragma omp target map(from : r)
{ r = fwrite("PASS\n", 1, sizeof("PASS\n") - 1, stdout); }
assert(r == sizeof("PASS\n") - 1 && "Incorrect number of bytes written");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain what thread/action uses which stream here?
I'm not 100% sure the extra sync above is sufficient (e.g., what about H2D?) and necessary.
Maybe we should just have more streams involved, but I am not clear what streams we currently use (for the RPC parts).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, as far as I understand it, for a target region with a copy-back, we simply call this function to retrieve the data. Currently, NVPTX does an implicit sync because we schedule both the kernel and the copy-back on the steam stream. i.e. we cannot issue the memcpy until the kernel is done so we wait inside of the CUDA runtime. This is bad because it means that we aren't periodicaly checking the RPC server while waiting.

Copy link
Member

@jdoerfert jdoerfert Sep 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we schedule both the kernel and the copy-back on the steam stream.

Yes, that is correct.

i.e. we cannot issue the memcpy until the kernel is done so we wait inside of the CUDA runtime.

That I don't get.

This is bad because it means that we aren't periodicaly checking the RPC server while waiting.

The checking is done by the thread that issued the kernel, right? It does not wait for the stream with the kernel to finish before it checks the RPC stream, correct? I don't get why this interferes. I would assume it checks the RCP, then non-blocking the kernel, and so on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you go into a debugger, it waits forever inside of the D2H memcpy function. This is because CUDA is most likely waiting internally for the kernel to be done. THat means no one on our side is checking the server.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you go into a debugger, it waits forever inside of the D2H memcpy function. This is because CUDA is most likely waiting internally for the kernel to be done. THat means no one on our side is checking the server.

I don't disagree with this but it does not answer my questions above. I can try to look into the current state next week myself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current behavior looks like all those async operations can only be "enqueued" at certain point but because the device is busy waiting for the RPC request, it didn't reach the point. As a consequence, the host side is stuck at "issuing" stage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this could be solved by forking a thread for the RPC server to run in the background, but it seemed better to re-use waiting threads we already have. Changing that is another alternative if we wanted to just fork another "hidden" thread.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, turns out the call is blocking, according to https://docs.nvidia.com/cuda/cuda-driver-api/api-sync-behavior.html#api-sync-behavior__memcpy-async rule synchronous 3.
The solution above seems sensible to me.

Copy link
Member

@jdoerfert jdoerfert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the one nit.

CUresult Res = cuMemcpyDtoHAsync(HstPtr, (CUdeviceptr)TgtPtr, Size, Stream);
// If there is already pending work on the stream it could be waiting for
// someone to check the RPC server.
CUresult Res = cuStreamQuery(Stream);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^

// CHECK: PASS
#pragma omp target map(from : r)
{ r = fwrite("PASS\n", 1, sizeof("PASS\n") - 1, stdout); }
assert(r == sizeof("PASS\n") - 1 && "Incorrect number of bytes written");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, turns out the call is blocking, according to https://docs.nvidia.com/cuda/cuda-driver-api/api-sync-behavior.html#api-sync-behavior__memcpy-async rule synchronous 3.
The solution above seems sensible to me.

Summary:
The RPC server is responsible for providing host services from the GPU.
Generally, the client running on the GPU will spin in place until the
host checks the server. Inside the runtime, we elected to have the user
thread do this checking while it would be otherwise waiting for the
kernel to finish. However, for Nvidia this caused problems when
offloading to a target region that requires a copy back.

This is caused by the implementation of `dataRetrieve` on Nvidia. We
initialize an asynchronous copy-back on the same stream that the kernel
is running on. This creates an implicit sync on the kernel to finish
before we issue the D2H copy, which we then wait on. This implicit sync
happens inside of the CUDA runtime. This is problematic when running the
RPC server because we need someone to check the RPC server. If no one
checks the RPC server then the kernel will never finish, meaning that
the memcpy will never be issued and the program hangs. This patch adds
an explicit check for unfinished work on the stream and waits for it to
complete.
@jhuber6 jhuber6 merged commit 0f88be7 into llvm:main Sep 26, 2023
2 checks passed
legrosbuffle pushed a commit to legrosbuffle/llvm-project that referenced this pull request Sep 29, 2023
llvm#66817)

Summary:
The RPC server is responsible for providing host services from the GPU.
Generally, the client running on the GPU will spin in place until the
host checks the server. Inside the runtime, we elected to have the user
thread do this checking while it would be otherwise waiting for the
kernel to finish. However, for Nvidia this caused problems when
offloading to a target region that requires a copy back.

This is caused by the implementation of `dataRetrieve` on Nvidia. We
initialize an asynchronous copy-back on the same stream that the kernel
is running on. This creates an implicit sync on the kernel to finish
before we issue the D2H copy, which we then wait on. This implicit sync
happens inside of the CUDA runtime. This is problematic when running the
RPC server because we need someone to check the RPC server. If no one
checks the RPC server then the kernel will never finish, meaning that
the memcpy will never be issued and the program hangs. This patch adds
an explicit check for unfinished work on the stream and waits for it to
complete.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants