Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA: Fix source location on kernel entry and enable breakpoints to be set on kernels by mangled name #6841

Merged
merged 7 commits into from Jun 23, 2021

Conversation

gmarkall
Copy link
Member

@gmarkall gmarkall commented Mar 18, 2021

This PR improves a couple of aspects of debugging in CUDA:

  • Breaking on any kernel launch, for example with set cuda break_on_launch application now results in GDB being able to find the source line at the break location - this enables the backtrace to show the mangled function name, source file, and line number. It also enables the text user interface to display the source when the kernel launches. From this point, stepping through
    line-by-line is feasible with the next command.
  • Breakpoints can be set on a specific function and argument types, by setting a breakpoint on the mangled name.

(neither of these worked prior to this change).

This PR also separates the DIBuilder API from the Numba IR API in the first commit, to avoid building Numba IR for no reason (or strange reasons) when generating the kernel wrapper.

There's still a little API weirdness that bothers me, but I'm not sure I can see the better way - perhaps the reviewer can suggest an improvement (I'll comment on the diff).

Marking a subprogram, variable, or location in the DIBuilder API
requires passing in a `numba.core.ir.Loc` object. This makes sense when
all debuginfo generation comes from lowering, but creates weird
dependencies when generating code that doesn't relate to a specific
piece of IR - for example, the kernel wrapper generated in the CUDA
target. To mark locations in the kernel wrapper, one must import the
`Loc` class from `numba.core.ir`, and construct instances of it to pass
in, just to mark locations in generated code (or do something more odd
with a "pretend" `Loc` object).

Since the `Loc` instances are only used for the line number, this commit
changes the interface of the relevant `DIBuilder` methods to accept a
line number instead of a `Loc`, breaking the spurious coupling.
@gmarkall gmarkall added 2 - In Progress CUDA CUDA related issue/PR labels Mar 18, 2021
Doing this provides a couple of improvements:

- Breaking on any kernel launch, for example with
  `set cuda break_on_launch application` now results in GDB being able
  to find the source line at the break location - this enables the
  backtrace to show the mangled function name, source file, and line
  number. It also enables the text user interface to display the source
  when the kernel launches. From this point, stepping through
  line-by-line is feasible with the `next` command.
- Breakpoints can be set on a specific function and argument types, by
  setting a breakpoint on the mangled name.

(neither of these worked prior to this change)

Issues that remain include:

- GDB doesn't demangle the names of functions, like it seems to in the
  CPU target examples. Oddly, the act of marking the subprogram seems to
  break demangling - prior to this commit, the location of the wrapper
  kernel isn't known, but it does demangle correctly.
- Most locals appear to be optimized out even with `opt=0`. This also
  needs further investigation.
@gmarkall gmarkall changed the title WIP: CUDA debug improvements CUDA: Fix source location on kernel entry and enable breakpoints to be set on kernels by mangled name Mar 18, 2021
@@ -479,8 +481,11 @@ def __init__(self, py_func, argtypes, link=None, debug=False, inline=False,
}

tgt_ctx = cres.target_context
filename = cres.type_annotation.filename
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird that I have to get the source file name from the type annotation at this stage, but I think the function IR is now lost (it was never saved after the pipeline finished).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably in the self.py_func.__code__.co_filename and cres.fndesc.lookup_module.__file__ too if they are preferable? IIRC the compile result doesn't hold reference to the IR as the IR could hold reference to e.g. globals, some of which might be large (e.g. NumPy arrays) and this would create memory pressure etc.

@@ -479,8 +481,11 @@ def __init__(self, py_func, argtypes, link=None, debug=False, inline=False,
}

tgt_ctx = cres.target_context
filename = cres.type_annotation.filename
linenum = int(cres.type_annotation.linenum)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the line number being a string in the type annotation is a minor problem, as LLVM expects line numbers to be integers (e.g. 5), not strings ("5").

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foo.py_func.__code__.co_firstlineno might help, as might inspect module?

@gmarkall gmarkall added this to the Numba 0.54 RC milestone Mar 18, 2021
@gmarkall
Copy link
Member Author

Example of the text user interface showing source layout and backtrace (start of the kernel is repro.py line 7, the wrapper function name begins _ZN6cudapy8):

image

@gmarkall gmarkall marked this pull request as ready for review March 18, 2021 18:11
@gmarkall gmarkall added 3 - Ready for Review Effort - medium Medium size effort needed and removed 2 - In Progress labels Mar 18, 2021
@gmarkall
Copy link
Member Author

A couple more notes on the lack of un-mangling in cuda-gdb:

For the CUDA target, DICompileUnit metadata looks like:

!2 = distinct !DICompileUnit(emissionKind: 1, file: !1, isOptimized: true, language: DW_LANG_Python, producer: "Numba", runtimeVersion: 0)

whereas on the CPU target the emissionKind is FullDebug, i.e. like:

!2 = distinct !DICompileUnit(emissionKind: FullDebug, file: !1, isOptimized: true, language: DW_LANG_Python, producer: "Numba", runtimeVersion: 0)

At the moment the name and the linkage name are the same for each subprogram, e.g.:

!5 = distinct !DISubprogram(file: !1, isDefinition: true, isLocal: false, isOptimized: true, line: 29, linkageName: "_ZN6cudapy8__main__10kernel$241E5ArrayIiLi1E1A7mutable7alignedE", name: "_ZN6cudapy8__main__10kernel$241E5ArrayIiLi1E1A7mutable7alignedE", scope: !1, scopeLine: 29, type: !4, unit: !2)

instead of

!5 = distinct !DISubprogram(file: !1, isDefinition: true, isLocal: false, isOptimized: true, line: 29, linkageName: "_ZN6cudapy8__main__10kernel$241E5ArrayIiLi1E1A7mutable7alignedE", name: "kernel", scope: !1, scopeLine: 29, type: !4, unit: !2)

However, for device functions the name is correct, and these are also not un-mangled.

Fixing these alone doesn't seem to help (I've tried locally by applying):

diff --git a/numba/cuda/cudadrv/nvvm.py b/numba/cuda/cudadrv/nvvm.py
index c77ec456e..35ab0a260 100644
--- a/numba/cuda/cudadrv/nvvm.py
+++ b/numba/cuda/cudadrv/nvvm.py
@@ -776,6 +777,9 @@ def llvm100_to_70_ir(ir):
             attrs = ' '.join(a for a in attrs if a != 'willreturn')
             line = line.replace(m.group(1), attrs)
 
+        if '= distinct !DICompileUnit' in line:
+            line = line.replace('emissionKind: 1', 'emissionKind: FullDebug')
+
         buf.append(line)
 
     return '\n'.join(buf)
diff --git a/numba/cuda/target.py b/numba/cuda/target.py
index 090c58d34..82c21c794 100644
--- a/numba/cuda/target.py
+++ b/numba/cuda/target.py
@@ -145,12 +145,13 @@ class CUDATargetContext(BaseContext):
                                                 nvvm_options=nvvm_options,
                                                 max_registers=max_registers)
         library.add_linking_library(codelib)
-        wrapper = self.generate_kernel_wrapper(library, kernel_name, func_name,
+        wrapper = self.generate_kernel_wrapper(library, codelib.name,
+                                               kernel_name, func_name,
                                                argtypes, debug, filename,
                                                linenum)
         return library, wrapper
 
-    def generate_kernel_wrapper(self, library, kernel_name, func_name,
+    def generate_kernel_wrapper(self, library, py_name, kernel_name, func_name,
                                 argtypes, debug, filename, linenum):
         """
         Generate the kernel wrapper in the given ``library``.
@@ -172,7 +173,10 @@ class CUDATargetContext(BaseContext):
 
         if debug:
             debuginfo = self.DIBuilder(module=wrapper_module, filepath=filename)
-            debuginfo.mark_subprogram(wrapfn, kernel_name, linenum)
+            debuginfo.mark_subprogram(wrapfn, py_name, linenum)
             debuginfo.mark_location(builder, linenum)
 
         # Define error handling variables

Since it doesn't help I've not added it to this PR - I'd prefer only to add things that clearly make improvements - are there any other thoughts on what might be missing for mangling to work here?

@gmarkall
Copy link
Member Author

(Popping this back to "In progress" because I just noticed it breaks with NVVM 3.4, and needs #5311 fixing for it to work)

Includes:

- Calls `llvm_to_ptx` once for each IR module when compiling for debug.
- Don't adjust linkage of functions in linked modules when debugging,
  because we need device functions to be externally visible.
- Fixed setting of NVVM options when calling `compile_cuda` from kernel
  compilation and device function template compilation.
- Removes `debug_pubnames` patch.

Outcomes:

- The "Error: Debugging support cannot be enabled when number of debug
  compile units is more than 1" message is no longer produced with NVVM
  3.4.
- NVVM 7.0: Everything still seems to "work" as much as it did before.
  Stepping may be more stable, but this needs a bit more verification.

Fixes numba#5311.
@gmarkall
Copy link
Member Author

I've included a fix for #5311 in this branch now - this resolves the long-standing "Error: Debugging support cannot be enabled when number of debug compile units is more than 1" message.

@gmarkall
Copy link
Member Author

gmarkall commented Apr 7, 2021

I think this is ready for review - looks like I somehow forgot to mark it as such.

@stuartarchibald
Copy link
Contributor

Buildfarm ID: numba_smoketest_cuda_yaml_70. Just to see if there's any immediate issues with the patch.

@stuartarchibald
Copy link
Contributor

Buildfarm ID: numba_smoketest_cuda_yaml_70. Just to see if there's any immediate issues with the patch.

Passed.

@stuartarchibald
Copy link
Contributor

@gmarkall please could you resolve the conflicts? then I'll look to review this, thanks.

@stuartarchibald stuartarchibald added 4 - Waiting on author Waiting for author to respond to review and removed 3 - Ready for Review labels Jun 16, 2021
@gmarkall gmarkall added 4 - Waiting on reviewer Waiting for reviewer to respond to author and removed 4 - Waiting on author Waiting for author to respond to review labels Jun 16, 2021
@gmarkall
Copy link
Member Author

@stuartarchibald conflicts now resolved.

Comment on lines +147 to +148
filename: The source filename that the function is contained in.
linenum: The source line that the function is on.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if someone materialized a function from a string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filename is <string> and linenum is the line in the string on which the function began - I think this is normal / expected?

Copy link
Contributor

@stuartarchibald stuartarchibald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch, I've taken a first look through it but would like to do some manual testing. There's a few minor things to look at and some questions/suggestions above. Thanks again!

numba/cuda/tests/cudapy/test_debuginfo.py Outdated Show resolved Hide resolved
numba/cuda/tests/cudapy/test_debuginfo.py Outdated Show resolved Hide resolved
@stuartarchibald stuartarchibald added 4 - Waiting on author Waiting for author to respond to review and removed 4 - Waiting on reviewer Waiting for reviewer to respond to author labels Jun 16, 2021
- Use `self.py_func.__code__` to get the filename and line number for
  the function for `prepare_cuda_kernel`, which seems a bit cleaner /
  neater that digging into the type annotation for it.
- Fix typo and explicitly link to an issue comment.
@gmarkall
Copy link
Member Author

@stuartarchibald Thanks for the review and suggestions! All comments should be addressed now.

@gmarkall gmarkall added 4 - Waiting on reviewer Waiting for reviewer to respond to author and removed 4 - Waiting on author Waiting for author to respond to review labels Jun 21, 2021
@esc
Copy link
Member

esc commented Jun 22, 2021

Build Farm ID: numba_smoketest_cuda_yaml_76

Copy link
Contributor

@stuartarchibald stuartarchibald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch and fixes.

@stuartarchibald stuartarchibald added 4 - Waiting on CI Review etc done, waiting for CI to finish Pending BuildFarm For PRs that have been reviewed but pending a push through our buildfarm and removed 4 - Waiting on reviewer Waiting for reviewer to respond to author labels Jun 22, 2021
@stuartarchibald
Copy link
Contributor

Note: I manually verified this patch:

$ cuda-gdb --args python add_one.py
<snip>
Reading symbols from python...
(cuda-gdb) set cuda break_on_launch application
(cuda-gdb) r
Starting program: bin/python add_one.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[Detaching after fork from child process 7719]
[Detaching after fork from child process 7720]
[Detaching after fork from child process 7721]
[Detaching after fork from child process 7724]
[New Thread 0x7fffda69b700 (LWP 7731)]
[New Thread 0x7fffd9e9a700 (LWP 7732)]
[New Thread 0x7fffd9699700 (LWP 7733)]
numba/cuda/compiler.py:865: NumbaPerformanceWarning: Grid size (1) < 2 * SM count (26) will likely result in GPU under utilization due to low occupancy.
  warn(NumbaPerformanceWarning(msg))
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
0x0000555556b98c88 in _ZN6cudapy8__main__20increment_by_one$241E5ArrayIdLi1E1C7mutable7alignedE<<<(1,1,1),(1,1,1)>>> ()
    at add_one.py:4
4       @cuda.jit(debug=True, opt=0)
(cuda-gdb) l
1       import numpy as np
2       from numba import cuda
3
4       @cuda.jit(debug=True, opt=0)
5       def increment_by_one(an_array):
6           # Thread id in a 1D block
7           tx = cuda.threadIdx.x
8           # Block id in a 1D grid
9           ty = cuda.blockIdx.x
10          # Block width, i.e. number of threads per block
(cuda-gdb) s
0x0000555556b7f168 in _ZN8__main__20increment_by_one$241E5ArrayIdLi1E1C7mutable7alignedE () at add_one.py:4
4       @cuda.jit(debug=True, opt=0)
(cuda-gdb) s
7           tx = cuda.threadIdx.x
(cuda-gdb) s
13          pos = tx + ty * bw
(cuda-gdb) s
14          if pos < an_array.size:  # Check array boundaries

@esc esc added 5 - Ready to merge Review and testing done, is ready to merge BuildFarm Passed For PRs that have been through the buildfarm and passed and removed 4 - Waiting on CI Review etc done, waiting for CI to finish Pending BuildFarm For PRs that have been reviewed but pending a push through our buildfarm labels Jun 22, 2021
@stuartarchibald stuartarchibald merged commit bb50fa8 into numba:master Jun 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to merge Review and testing done, is ready to merge BuildFarm Passed For PRs that have been through the buildfarm and passed CUDA CUDA related issue/PR Effort - medium Medium size effort needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants