[Libomptarget] Fix Nvidia offloading hanging on dataRetrieve using RPC #66817

jhuber6 · 2023-09-19T20:47:48Z

Summary:
The RPC server is responsible for providing host services from the GPU.
Generally, the client running on the GPU will spin in place until the
host checks the server. Inside the runtime, we elected to have the user
thread do this checking while it would be otherwise waiting for the
kernel to finish. However, for Nvidia this caused problems when
offloading to a target region that requires a copy back.

This is caused by the implementation of dataRetrieve on Nvidia. We
initialize an asynchronous copy-back on the same stream that the kernel
is running on. This creates an implicit sync on the kernel to finish
before we issue the D2H copy, which we then wait on. This implicit sync
happens inside of the CUDA runtime. This is problematic when running the
RPC server because we need someone to check the RPC server. If no one
checks the RPC server then the kernel will never finish, meaning that
the memcpy will never be issued and the program hangs. This patch adds
an explicit check for unfinished work on the stream and waits for it to
complete.

jdoerfert · 2023-09-23T14:48:04Z

openmp/libomptarget/plugins-nextgen/cuda/src/rtl.cpp

-    CUresult Res = cuMemcpyDtoHAsync(HstPtr, (CUdeviceptr)TgtPtr, Size, Stream);
+    // If there is already pending work on the stream it could be waiting for
+    // someone to check the RPC server.
+    CUresult Res = cuStreamQuery(Stream);


if (auto RCPServer = getRPCServer()) on the outside, not in the while.

jdoerfert · 2023-09-23T14:50:12Z

openmp/libomptarget/test/libc/fwrite.c

+// CHECK: PASS
+#pragma omp target map(from : r)
+  { r = fwrite("PASS\n", 1, sizeof("PASS\n") - 1, stdout); }
+  assert(r == sizeof("PASS\n") - 1 && "Incorrect number of bytes written");


Could you explain what thread/action uses which stream here?
I'm not 100% sure the extra sync above is sufficient (e.g., what about H2D?) and necessary.
Maybe we should just have more streams involved, but I am not clear what streams we currently use (for the RPC parts).

So, as far as I understand it, for a target region with a copy-back, we simply call this function to retrieve the data. Currently, NVPTX does an implicit sync because we schedule both the kernel and the copy-back on the steam stream. i.e. we cannot issue the memcpy until the kernel is done so we wait inside of the CUDA runtime. This is bad because it means that we aren't periodicaly checking the RPC server while waiting.

we schedule both the kernel and the copy-back on the steam stream.

Yes, that is correct.

i.e. we cannot issue the memcpy until the kernel is done so we wait inside of the CUDA runtime.

That I don't get.

This is bad because it means that we aren't periodicaly checking the RPC server while waiting.

The checking is done by the thread that issued the kernel, right? It does not wait for the stream with the kernel to finish before it checks the RPC stream, correct? I don't get why this interferes. I would assume it checks the RCP, then non-blocking the kernel, and so on?

If you go into a debugger, it waits forever inside of the D2H memcpy function. This is because CUDA is most likely waiting internally for the kernel to be done. THat means no one on our side is checking the server.

If you go into a debugger, it waits forever inside of the D2H memcpy function. This is because CUDA is most likely waiting internally for the kernel to be done. THat means no one on our side is checking the server.

I don't disagree with this but it does not answer my questions above. I can try to look into the current state next week myself.

The current behavior looks like all those async operations can only be "enqueued" at certain point but because the device is busy waiting for the RPC request, it didn't reach the point. As a consequence, the host side is stuck at "issuing" stage.

Yeah, this could be solved by forking a thread for the RPC server to run in the background, but it seemed better to re-use waiting threads we already have. Changing that is another alternative if we wanted to just fork another "hidden" thread.

So, turns out the call is blocking, according to https://docs.nvidia.com/cuda/cuda-driver-api/api-sync-behavior.html#api-sync-behavior__memcpy-async rule synchronous 3.
The solution above seems sensible to me.

jdoerfert

See the one nit.

jdoerfert · 2023-09-26T20:54:20Z

openmp/libomptarget/plugins-nextgen/cuda/src/rtl.cpp

-    CUresult Res = cuMemcpyDtoHAsync(HstPtr, (CUdeviceptr)TgtPtr, Size, Stream);
+    // If there is already pending work on the stream it could be waiting for
+    // someone to check the RPC server.
+    CUresult Res = cuStreamQuery(Stream);


jdoerfert · 2023-09-26T20:55:23Z

openmp/libomptarget/test/libc/fwrite.c

+// CHECK: PASS
+#pragma omp target map(from : r)
+  { r = fwrite("PASS\n", 1, sizeof("PASS\n") - 1, stdout); }
+  assert(r == sizeof("PASS\n") - 1 && "Incorrect number of bytes written");


So, turns out the call is blocking, according to https://docs.nvidia.com/cuda/cuda-driver-api/api-sync-behavior.html#api-sync-behavior__memcpy-async rule synchronous 3.
The solution above seems sensible to me.

Summary: The RPC server is responsible for providing host services from the GPU. Generally, the client running on the GPU will spin in place until the host checks the server. Inside the runtime, we elected to have the user thread do this checking while it would be otherwise waiting for the kernel to finish. However, for Nvidia this caused problems when offloading to a target region that requires a copy back. This is caused by the implementation of `dataRetrieve` on Nvidia. We initialize an asynchronous copy-back on the same stream that the kernel is running on. This creates an implicit sync on the kernel to finish before we issue the D2H copy, which we then wait on. This implicit sync happens inside of the CUDA runtime. This is problematic when running the RPC server because we need someone to check the RPC server. If no one checks the RPC server then the kernel will never finish, meaning that the memcpy will never be issued and the program hangs. This patch adds an explicit check for unfinished work on the stream and waits for it to complete.

llvm#66817) Summary: The RPC server is responsible for providing host services from the GPU. Generally, the client running on the GPU will spin in place until the host checks the server. Inside the runtime, we elected to have the user thread do this checking while it would be otherwise waiting for the kernel to finish. However, for Nvidia this caused problems when offloading to a target region that requires a copy back. This is caused by the implementation of `dataRetrieve` on Nvidia. We initialize an asynchronous copy-back on the same stream that the kernel is running on. This creates an implicit sync on the kernel to finish before we issue the D2H copy, which we then wait on. This implicit sync happens inside of the CUDA runtime. This is problematic when running the RPC server because we need someone to check the RPC server. If no one checks the RPC server then the kernel will never finish, meaning that the memcpy will never be issued and the program hangs. This patch adds an explicit check for unfinished work on the stream and waits for it to complete.

jhuber6 requested review from jdoerfert, JonChesterfield and shiltian September 19, 2023 20:47

jdoerfert reviewed Sep 23, 2023

View reviewed changes

jdoerfert approved these changes Sep 26, 2023

View reviewed changes

jhuber6 force-pushed the FixNvidiaRPC branch from 8facd91 to bc82ee4 Compare September 26, 2023 21:02

jhuber6 force-pushed the FixNvidiaRPC branch from bc82ee4 to f779ebc Compare September 26, 2023 21:03

jhuber6 merged commit 0f88be7 into llvm:main Sep 26, 2023
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Libomptarget] Fix Nvidia offloading hanging on dataRetrieve using RPC #66817

[Libomptarget] Fix Nvidia offloading hanging on dataRetrieve using RPC #66817

jhuber6 commented Sep 19, 2023

jdoerfert Sep 23, 2023

jdoerfert Sep 26, 2023

jdoerfert Sep 23, 2023

jhuber6 Sep 23, 2023

jdoerfert Sep 23, 2023 •

edited

Loading

jhuber6 Sep 23, 2023

jdoerfert Sep 23, 2023

shiltian Sep 24, 2023

jhuber6 Sep 25, 2023

jdoerfert Sep 26, 2023

jdoerfert left a comment

jdoerfert Sep 26, 2023

jdoerfert Sep 26, 2023

[Libomptarget] Fix Nvidia offloading hanging on dataRetrieve using RPC #66817

[Libomptarget] Fix Nvidia offloading hanging on dataRetrieve using RPC #66817

Conversation

jhuber6 commented Sep 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdoerfert Sep 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdoerfert left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdoerfert Sep 23, 2023 •

edited

Loading