Fix backtrace deduplication for Clang debug info#4
Conversation
Deduplicate stack frames by PC address to handle Clang's DWARF generation issues where multiple virtual frames are reported for the same program counter with incorrect source locations. Keep only the last frame for each unique PC as it typically represents the actual call site. Also add "object.h" to the filename filter list to exclude Python object header frames from backtraces.
Summary of ChangesHello @Hzfengsy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the backtrace generation logic by introducing a frame deduplication strategy to mitigate issues with Clang's debug information, specifically when dealing with inlined functions. Additionally, it refines the backtrace filtering to exclude irrelevant Python internal frames, leading to more accurate and readable stack traces. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a fix for backtrace deduplication to handle issues with Clang's debug information, and adds object.h to the filename filter list. The changes look good, but I have one suggestion to improve the robustness of the deduplication logic. The current implementation assumes that frames with the same program counter are always consecutive, which could be brittle. My suggestion provides a more robust solution that handles non-consecutive duplicate frames as well.
| std::vector<StackFrame> deduplicated_frames; | ||
| for (const auto& frame : reversed_frames) { | ||
| if (frame.pc != 0 && !deduplicated_frames.empty() && deduplicated_frames.back().pc == frame.pc) { | ||
| // Same PC as the previous frame - this is likely a spurious inline frame. | ||
| // Replace with the current frame (which is typically more accurate). | ||
| deduplicated_frames.back() = frame; | ||
| } else { | ||
| deduplicated_frames.push_back(frame); | ||
| } | ||
| } |
There was a problem hiding this comment.
The current deduplication logic assumes that stack frames with the same program counter (PC) are always consecutive in the reversed_frames vector. While this might be true for the specific Clang DWARF issue this change is addressing, this assumption makes the implementation fragile. If frames with the same PC were to appear non-consecutively for any reason, this logic would fail to deduplicate them correctly.
A more robust approach would be to iterate through the frames from most-recent to least-recent and use a set to track seen PCs. This ensures that only the most recent frame for each unique PC is kept, regardless of its position in the stack trace. This would correctly handle both consecutive and non-consecutive duplicate frames.
Note: This change would require adding #include <unordered_set> at the top of the file.
std::vector<StackFrame> deduplicated_frames;
std::unordered_set<uintptr_t> seen_pcs;
// Iterate from most-recent to least-recent to keep the most recent frame for each PC.
for (auto it = reversed_frames.rbegin(); it != reversed_frames.rend(); ++it) {
const auto& frame = *it;
if (frame.pc == 0 || seen_pcs.find(frame.pc) == seen_pcs.end()) {
deduplicated_frames.push_back(frame);
if (frame.pc != 0) {
seen_pcs.insert(frame.pc);
}
}
}
// Reverse to get the correct order for display (least-recent to most-recent).
std::reverse(deduplicated_frames.begin(), deduplicated_frames.end());There was a problem hiding this comment.
Pull request overview
This PR addresses Clang's DWARF debug information generation issues by implementing PC-based deduplication of stack frames in backtraces, and adds Python's object.h to the filename filter list.
- Implements deduplication logic to handle multiple virtual frames reported for the same program counter
- Adds "object.h" to the backtrace filename filter to exclude Python C API header frames
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for (const auto& frame : reversed_frames) { | ||
| if (frame.pc != 0 && !deduplicated_frames.empty() && deduplicated_frames.back().pc == frame.pc) { | ||
| // Same PC as the previous frame - this is likely a spurious inline frame. | ||
| // Replace with the current frame (which is typically more accurate). | ||
| deduplicated_frames.back() = frame; | ||
| } else { | ||
| deduplicated_frames.push_back(frame); | ||
| } | ||
| } |
There was a problem hiding this comment.
The deduplication implementation only removes consecutive frames with the same PC address. While this may be sufficient for Clang's inline frame generation (which likely produces consecutive duplicates), the PR description states "Deduplicate stack frames by PC address" and "keep only the last frame for each unique PC", which suggests complete deduplication is intended. If non-consecutive frames can have the same PC, consider using an unordered_map to track the last frame for each unique PC value. Otherwise, clarify in the comments that only consecutive duplicates are handled.
| // Deduplicate frames by PC address to handle Clang's debug info issues. | ||
| // When Clang generates DWARF info for inlined functions/templates, it may | ||
| // report multiple "virtual" frames for the same PC with incorrect source | ||
| // locations. We keep only the last frame for each unique PC, as that's | ||
| // typically the actual call site. | ||
| std::vector<StackFrame> deduplicated_frames; | ||
| for (const auto& frame : reversed_frames) { | ||
| if (frame.pc != 0 && !deduplicated_frames.empty() && deduplicated_frames.back().pc == frame.pc) { | ||
| // Same PC as the previous frame - this is likely a spurious inline frame. | ||
| // Replace with the current frame (which is typically more accurate). | ||
| deduplicated_frames.back() = frame; | ||
| } else { | ||
| deduplicated_frames.push_back(frame); | ||
| } | ||
| } |
There was a problem hiding this comment.
The new PC-based deduplication logic lacks test coverage. Consider adding a test case that verifies frames with duplicate PC addresses are correctly deduplicated, especially to ensure the behavior works as intended for Clang's DWARF generation issues. This could be done by creating a mock scenario with duplicate PC values or by testing with actual Clang-compiled code that exhibits the issue.
| "pybind11", // Python binding layer | ||
| "__libc_", // C library internals | ||
| "include/c++/", // C++ standard library | ||
| "object.h", // Python object.h |
There was a problem hiding this comment.
The comment "Python object.h" could be more specific to clarify this filters out Python C API header frames, not a local header file. Consider changing to "Python C API object.h" or "CPython object.h" for clarity.
| "object.h", // Python object.h | |
| "object.h", // Python C API object.h (CPython header) |
- Fix conftest docstring to say BEFORE_AND_AFTER (#1, hw-native-sys#5, hw-native-sys#8) - Align InitMemRef pass property table and docs with code (hw-native-sys#2) - Add INTERNAL_CHECK for null pass result in Pass::operator() (hw-native-sys#3) - Switch SSA assignment tracking to pointer identity (hw-native-sys#4) - Add null instrument checks in RunBeforePass/RunAfterPass (hw-native-sys#6) - Add strict=True to xfail with descriptive reason (hw-native-sys#7) - Fix pass_manager.md verification mode to BEFORE_AND_AFTER (hw-native-sys#9) - Add ExitContext stack invariant check (hw-native-sys#10)
- Use memory_order_relaxed for atomic ID counter (comment hw-native-sys#3) - Fix UniqueId doc: "stable for the lifetime of the process" (comment hw-native-sys#2) - Clarify ir.pyi docstring: variable identity is part of hash (comment #1) - Fix Type overload docstring: note enable_auto_mapping scope (comment hw-native-sys#4) - Remove redundant static_cast in hash_var_identity (comment hw-native-sys#5) - Optimize triple As<T> to single GetKind() check (comment hw-native-sys#6)
- Add tuple arity validation in Function binding (comments hw-native-sys#2/hw-native-sys#4) - Add strict validation in param_directions deserialization (comments hw-native-sys#3/hw-native-sys#5)
- Fix closure variable merging order to follow lexical scoping (comment #1) - Fix error.span isinstance check - span is ir.Span not dict (comment #6) - Use forward-compatible AST field copying with ast.copy_location (comment #3) - Add return handling in inline mode with early termination after return (comment #4) - Prevent leaking inline function variables to caller scope (comments #2, #5, #7)
- Reorder __all__ to match import order (comment #1) - Document all 5 system ops in en/zh-cn syntax docs (comment hw-native-sys#3) - Fix frame_offset=2 in helper functions for accurate span capture (comment hw-native-sys#4) - Exclude 'system' from unified dispatch to prevent misleading errors (comment hw-native-sys#5)
- Derive repo owner/name dynamically via gh repo view (hw-native-sys#2) - Use comments(last: 1) to fetch latest comment per thread (hw-native-sys#4) - Add pagination with pageInfo/endCursor for >100 threads (hw-native-sys#6) - Extract run ID from check link field for gh run view (hw-native-sys#5, hw-native-sys#7) - Document external check fallback (open link URL directly)
- Expanded GraphQL query for readability (gemini hw-native-sys#2) - Added gh pr checks command to define $LINK variable (gemini #1, copilot hw-native-sys#7) - Replaced grep -oP with portable sed for run ID extraction (gemini #1, copilot hw-native-sys#7) - Fixed grep pattern to handle whitespace in JSON (copilot hw-native-sys#6) - Replaced non-standard BRE with grep -E (copilot hw-native-sys#5) - Added local reproduction fallback for CI logs (gemini hw-native-sys#3) - Removed ellipsis placeholder, save output inline (copilot hw-native-sys#4)
- inline pass: extend DefVarCollector + VarSubstituteMutator to cover WhileStmt return_vars and ForStmt return_vars (Copilot hw-native-sys#5). - inline pass: reject inline bodies with non-trailing ReturnStmts via NestedReturnCounter — silent miscompile if hand-built IR ever does early-return inside an If/For branch (Gemini #1). - inline pass: drop the stale "Inline function as program entry: detected before splicing; raises" doc-comment line; the actual behaviour is silent removal in the cleanup phase (Copilot hw-native-sys#6). - verifier: also flag *dangling* Calls — Calls whose callee GlobalVar isn't in program->functions_ — with error_code 2. The previous code short-circuited when no Inline functions survived, missing the case where the InlineFunctions pass dropped the function but left a Call behind (Copilot hw-native-sys#4). - JIT _resolve_int: handle ast.Pow (only with non-negative int exponents) and ast.UAdd (Gemini hw-native-sys#2). - docs/zh-cn/dev/passes/22-lower_pipeline_loops.md: narrow CanonicalizeIOOrder scope wording from "全程序每一个 SeqStmts" to "ForKind::Pipeline 作用域内的 SeqStmts" (CodeRabbit hw-native-sys#9). - examples/models/qwen3_jit/kernels/rmsnorm.py: cast resid chunks to FP32 in post_rmsnorm to mirror input_rmsnorm and avoid BF16 accumulation precision loss (CodeRabbit hw-native-sys#11).
- inline pass: extend DefVarCollector + VarSubstituteMutator to cover WhileStmt return_vars and ForStmt return_vars (Copilot hw-native-sys#5). - inline pass: reject inline bodies with non-trailing ReturnStmts via NestedReturnCounter — silent miscompile if hand-built IR ever does early-return inside an If/For branch (Gemini #1). - inline pass: drop the stale "Inline function as program entry: detected before splicing; raises" doc-comment line; the actual behaviour is silent removal in the cleanup phase (Copilot hw-native-sys#6). - verifier: also flag *dangling* Calls — Calls whose callee GlobalVar isn't in program->functions_ — with error_code 2. The previous code short-circuited when no Inline functions survived, missing the case where the InlineFunctions pass dropped the function but left a Call behind (Copilot hw-native-sys#4). - JIT _resolve_int: handle ast.Pow (only with non-negative int exponents) and ast.UAdd (Gemini hw-native-sys#2). - docs/zh-cn/dev/passes/22-lower_pipeline_loops.md: narrow CanonicalizeIOOrder scope wording from "全程序每一个 SeqStmts" to "ForKind::Pipeline 作用域内的 SeqStmts" (CodeRabbit hw-native-sys#9). - examples/models/qwen3_jit/kernels/rmsnorm.py: cast resid chunks to FP32 in post_rmsnorm to mirror input_rmsnorm and avoid BF16 accumulation precision loss (CodeRabbit hw-native-sys#11).
CI fix (root cause of pypto-lib-model + system-tests failures): - P6's call-site injector previously inlined `tensor.as_layout(arg, DN)` into the kernel-call args directly, which the orchestration codegen rejects with `Call to '<callee>' arg N is neither a variable nor a recognized constant literal`. Refactor `CallSiteAsLayoutInjector` to operate at the statement level: for every AssignStmt / EvalStmt / ReturnStmt whose RHS targets a promoted callee, emit one `bridged_<param> = tensor.as_layout(arg, DN)` AssignStmt immediately before the call statement and replace the inline Call arg with the bound Var. Net IR is SSA-form and matches what orchestration codegen consumes per arg slot (Var | const-literal). Review comments addressed: - gemini #1: codegen `tensor.as_layout` now special-cases the identity flip (target layout == source layout) and emits a plain `Tensor result = input;` alias instead of a spurious `.transpose()`. Simplify still folds these before codegen in the default pipeline, but the codegen is now robust against ad-hoc compile paths that skip Simplify. - coderabbitai hw-native-sys#2 / hw-native-sys#3: drop the "next default pass" wording in en/zh-cn doc 17 — `MaterializeTensorStrides` runs later in the pipeline (after `CanonicalizeIOOrder`), not immediately after. The zh-cn doc's "17th pass" text is also clarified — the 17 is the docs/passes/ slot, not a literal pipeline call-count. - coderabbitai hw-native-sys#4: `DeduceTensorAsLayoutType` now preserves the source TensorView's `valid_shape` (with trailing-pair swap on cross-layout flips) and `pad` through `tensor.as_layout`. Previously these fields were dropped, making the reinterpret silently lossy for sliced or fill-padded inputs. - coderabbitai hw-native-sys#6: `MaterializeTensorStrides` direct-ctor rebuild path now forwards `op->attrs_`. The previous version preserved type and kwargs but dropped attrs, which would have silently discarded call metadata (arg_directions, manual_dep_edges) attached by earlier passes. - coderabbitai hw-native-sys#7: update the stale comment block above `VisitExpr_` in `simplify_pass.cpp` — it still described the dropped shape-bearing `as_layout(x, shape, layout)` form and the never-implemented chain folding. New comment accurately describes the single identity-elimination rule and explains why chain folding is deferred. - coderabbitai hw-native-sys#8 / hw-native-sys#9: the unit-test pattern that previously inspected an inline `tensor.as_layout` Call as a kernel-call arg no longer applies after the SSA refactor above. Tests now look up the bridge via `_find_assign_rhs(orch, var)` and guard `op is not None` before reading `op.name` (matching the defensive pattern already used in the B^T test). Skipped (with reason): - coderabbitai hw-native-sys#5: "Handle 3-arg `tile.load`". `tile.load` registers four mandatory args (tensor, offsets, shapes, valid_shapes) and the Python builder always materializes `valid_shapes` (defaults to `shapes` when the caller omits it). Once IR is constructed, every `tile.load` is 4-arg — the 3-arg form only exists at the DSL surface. The internal check stays as-is.
CI fix (root cause of pypto-lib-model + system-tests failures): - P6's call-site injector previously inlined `tensor.as_layout(arg, DN)` into the kernel-call args directly, which the orchestration codegen rejects with `Call to '<callee>' arg N is neither a variable nor a recognized constant literal`. Refactor `CallSiteAsLayoutInjector` to operate at the statement level: for every AssignStmt / EvalStmt / ReturnStmt whose RHS targets a promoted callee, emit one `bridged_<param> = tensor.as_layout(arg, DN)` AssignStmt immediately before the call statement and replace the inline Call arg with the bound Var. Net IR is SSA-form and matches what orchestration codegen consumes per arg slot (Var | const-literal). Review comments addressed: - gemini #1: codegen `tensor.as_layout` now special-cases the identity flip (target layout == source layout) and emits a plain `Tensor result = input;` alias instead of a spurious `.transpose()`. Simplify still folds these before codegen in the default pipeline, but the codegen is now robust against ad-hoc compile paths that skip Simplify. - coderabbitai hw-native-sys#2 / hw-native-sys#3: drop the "next default pass" wording in en/zh-cn doc 17 — `MaterializeTensorStrides` runs later in the pipeline (after `CanonicalizeIOOrder`), not immediately after. The zh-cn doc's "17th pass" text is also clarified — the 17 is the docs/passes/ slot, not a literal pipeline call-count. - coderabbitai hw-native-sys#4: `DeduceTensorAsLayoutType` now preserves the source TensorView's `valid_shape` (with trailing-pair swap on cross-layout flips) and `pad` through `tensor.as_layout`. Previously these fields were dropped, making the reinterpret silently lossy for sliced or fill-padded inputs. - coderabbitai hw-native-sys#6: `MaterializeTensorStrides` direct-ctor rebuild path now forwards `op->attrs_`. The previous version preserved type and kwargs but dropped attrs, which would have silently discarded call metadata (arg_directions, manual_dep_edges) attached by earlier passes. - coderabbitai hw-native-sys#7: update the stale comment block above `VisitExpr_` in `simplify_pass.cpp` — it still described the dropped shape-bearing `as_layout(x, shape, layout)` form and the never-implemented chain folding. New comment accurately describes the single identity-elimination rule and explains why chain folding is deferred. - coderabbitai hw-native-sys#8 / hw-native-sys#9: the unit-test pattern that previously inspected an inline `tensor.as_layout` Call as a kernel-call arg no longer applies after the SSA refactor above. Tests now look up the bridge via `_find_assign_rhs(orch, var)` and guard `op is not None` before reading `op.name` (matching the defensive pattern already used in the B^T test). Skipped (with reason): - coderabbitai hw-native-sys#5: "Handle 3-arg `tile.load`". `tile.load` registers four mandatory args (tensor, offsets, shapes, valid_shapes) and the Python builder always materializes `valid_shapes` (defaults to `shapes` when the caller omits it). Once IR is constructed, every `tile.load` is 4-arg — the 3-arg form only exists at the DSL surface. The internal check stays as-is.
CI fix (root cause of pypto-lib-model + system-tests failures): - P6's call-site injector previously inlined `tensor.as_layout(arg, DN)` into the kernel-call args directly, which the orchestration codegen rejects with `Call to '<callee>' arg N is neither a variable nor a recognized constant literal`. Refactor `CallSiteAsLayoutInjector` to operate at the statement level: for every AssignStmt / EvalStmt / ReturnStmt whose RHS targets a promoted callee, emit one `bridged_<param> = tensor.as_layout(arg, DN)` AssignStmt immediately before the call statement and replace the inline Call arg with the bound Var. Net IR is SSA-form and matches what orchestration codegen consumes per arg slot (Var | const-literal). Review comments addressed: - gemini #1: codegen `tensor.as_layout` now special-cases the identity flip (target layout == source layout) and emits a plain `Tensor result = input;` alias instead of a spurious `.transpose()`. Simplify still folds these before codegen in the default pipeline, but the codegen is now robust against ad-hoc compile paths that skip Simplify. - coderabbitai hw-native-sys#2 / hw-native-sys#3: drop the "next default pass" wording in en/zh-cn doc 17 — `MaterializeTensorStrides` runs later in the pipeline (after `CanonicalizeIOOrder`), not immediately after. The zh-cn doc's "17th pass" text is also clarified — the 17 is the docs/passes/ slot, not a literal pipeline call-count. - coderabbitai hw-native-sys#4: `DeduceTensorAsLayoutType` now preserves the source TensorView's `valid_shape` (with trailing-pair swap on cross-layout flips) and `pad` through `tensor.as_layout`. Previously these fields were dropped, making the reinterpret silently lossy for sliced or fill-padded inputs. - coderabbitai hw-native-sys#6: `MaterializeTensorStrides` direct-ctor rebuild path now forwards `op->attrs_`. The previous version preserved type and kwargs but dropped attrs, which would have silently discarded call metadata (arg_directions, manual_dep_edges) attached by earlier passes. - coderabbitai hw-native-sys#7: update the stale comment block above `VisitExpr_` in `simplify_pass.cpp` — it still described the dropped shape-bearing `as_layout(x, shape, layout)` form and the never-implemented chain folding. New comment accurately describes the single identity-elimination rule and explains why chain folding is deferred. - coderabbitai hw-native-sys#8 / hw-native-sys#9: the unit-test pattern that previously inspected an inline `tensor.as_layout` Call as a kernel-call arg no longer applies after the SSA refactor above. Tests now look up the bridge via `_find_assign_rhs(orch, var)` and guard `op is not None` before reading `op.name` (matching the defensive pattern already used in the B^T test). Skipped (with reason): - coderabbitai hw-native-sys#5: "Handle 3-arg `tile.load`". `tile.load` registers four mandatory args (tensor, offsets, shapes, valid_shapes) and the Python builder always materializes `valid_shapes` (defaults to `shapes` when the caller omits it). Once IR is constructed, every `tile.load` is 4-arg — the 3-arg form only exists at the DSL surface. The internal check stays as-is.
Deduplicate stack frames by PC address to handle Clang's DWARF generation issues where multiple virtual frames are reported for the same program counter with incorrect source locations. Keep only the last frame for each unique PC as it typically represents the actual call site.
Also add "object.h" to the filename filter list to exclude Python object header frames from backtraces.