Skip to content

Add investigate-trace template and improve root cause analysis#251

Merged
Alan-Jowett merged 4 commits intomicrosoft:mainfrom
Alan-Jowett:improve-investigation-from-etl-feedback
Apr 21, 2026
Merged

Add investigate-trace template and improve root cause analysis#251
Alan-Jowett merged 4 commits intomicrosoft:mainfrom
Alan-Jowett:improve-investigation-from-etl-feedback

Conversation

@Alan-Jowett
Copy link
Copy Markdown
Member

Summary

Adds a purpose-built \investigate-trace\ template for ETW/telemetry/profiling trace analysis and improves the root cause analysis protocol with iterative deepening and cross-component causal chain analysis. Based on real-world feedback from an LLM that executed an assembled \investigate-bug\ prompt for a Windows power/ETL trace investigation.

Motivation

The \investigate-bug\ template is code-centric — it asks for file:line evidence, code-level fixes, and tests that would have caught the bug. When used for trace/telemetry analysis (e.g., analyzing an ETW capture for power usage), the executing LLM identified several gaps:

  1. Call stack analysis was an afterthought — module-level attribution only shows where, stacks show why
  2. Cross-process amplification was the most valuable finding but the prompt never asked for it (e.g., OneDrive writes → Defender scans → EDR hashing → NIS inspection)
  3. Energy-vs-CPU divergence missing — a process used 0.94% CPU but 12.3% of total energy
  4. Iterative deepening needed — best results came from layered analysis, not single-pass
  5. Stack Lifetime Hazards taxonomy was irrelevant — wasted context on C/C++ memory safety for a power trace task
  6. Operational constraints were code-centric — file counts don't apply to trace queries
  7. Epistemic labeling was heavyweight for authoritative machine telemetry

Changes

New template

  • *\investigate-trace* — Purpose-built for ETW/telemetry/profiling trace analysis with call stack analysis as primary technique, energy-vs-metric divergence detection, cross-process amplification cascade analysis, tool-agnostic steps, and iterative deepening workflow

Protocol improvements

  • *
    oot-cause-analysis*
    — Added Phase 3a (Iterative Deepening: broad survey → attribution → deep analysis → cross-component tracing) and Phase 4a (Cross-Component Causal Chains: trigger-response pairs, amplification cascades, leverage point identification). These are domain-agnostic improvements that benefit all investigation tasks.

Guardrail improvements

  • *\�nti-hallucination* — Scoped labeling relaxation: direct observations from authoritative tool output (metrics, measurements, counters) get implicit KNOWN status; causal explanations and interpretations retain full labeling requirements
  • *\operational-constraints* — Added data-driven scoping rules: trace/telemetry analysis scopes by data categories and time ranges (not file counts), and structured query results use volume-aware retrieval

Format and template improvements

  • *\investigation-report* — Recognizes template-level full-format overrides (investigation templates can require full format regardless of finding count)
  • *\investigate-bug* — Added explicit full-format override instruction
  • *\�ootstrap* — Added taxonomy relevance evaluation guidance (evaluate template-declared taxonomies for relevance before including them)

Validation

\
$ python tests/validate-manifest.py
OK: manifest.yaml protocols match all template frontmatter.
\\

Design Decisions

  • Option B chosen: Keep \investigate-bug\ code-focused, create separate \investigate-trace\ for telemetry. Each template is sharp for its domain.
  • Taxonomy unchanged: \stack-lifetime-hazards\ stays in \investigate-bug\ (it's relevant for code bugs). \investigate-trace\ has no default taxonomy — bootstrap suggests contextually.
  • RCA phases are domain-agnostic: Iterative deepening and cross-component causal chains benefit all investigation types (memory leaks, CI failures, performance issues), not just ETW analysis.
  • Scoped labeling relaxation: Only raw measurements get implicit KNOWN — causal claims retain full labeling to prevent hallucination in interpretations.

Add a purpose-built investigate-trace template for ETW/telemetry/profiling
trace analysis, based on real-world feedback from an LLM executing an
assembled investigate-bug prompt for Windows power trace analysis.

New template (investigate-trace):
- Call stack analysis as primary technique, not afterthought
- Energy-vs-metric divergence detection (CPU% vs energy%)
- Cross-process amplification cascade analysis
- Tool-agnostic analysis steps (not WPA-specific)
- Iterative deepening workflow: broad survey → module → stack → cross-process

Protocol improvements (root-cause-analysis):
- Phase 3a: Iterative Deepening — investigation proceeds in layers of
  increasing resolution; do not write report until deep analysis complete
- Phase 4a: Cross-Component Causal Chains — trace trigger-response pairs,
  map amplification cascades, quantify amplification factors, identify
  leverage points

Guardrail improvements:
- anti-hallucination: scoped labeling relaxation for direct observations
  from authoritative tool output; causal claims retain full labeling
- operational-constraints: data-driven scoping rules for trace/telemetry
  analysis (data categories and time ranges, not file counts)

Format and template improvements:
- investigation-report: recognize template-level full-format overrides
- investigate-bug: explicit full-format override for root cause tasks
- bootstrap: taxonomy relevance evaluation during assembly

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 21, 2026 20:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new trace/telemetry-focused investigation template and updates shared investigation protocols/formats so root-cause workflows better support iterative deepening and cross-component causal chains.

Changes:

  • Added investigate-trace template tailored for ETW/ETL/telemetry investigations (stack-first analysis, energy/metric divergence, amplification cascades).
  • Enhanced root-cause-analysis with iterative deepening (Phase 3a) and cross-component causal chains (Phase 4a).
  • Updated guardrails, bootstrap guidance, and investigation-report format to better fit data-driven/trace investigations and full-format overrides.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
templates/investigate-trace.md New template for systematic trace/telemetry investigations using investigation-report output.
templates/investigate-bug.md Forces full investigation-report format for bug investigations.
protocols/reasoning/root-cause-analysis.md Adds iterative deepening and cross-component causal-chain phases.
protocols/guardrails/operational-constraints.md Adds scoping/retrieval constraints for traces/logs and structured query results.
protocols/guardrails/anti-hallucination.md Relaxes explicit [KNOWN] labeling for authoritative tool/telemetry measurements while keeping inference labeling.
formats/investigation-report.md Clarifies that some investigation templates always require full format.
bootstrap.md Adds guidance to evaluate relevance of template-declared taxonomies before including them.
manifest.yaml Registers the new investigate-trace template.

Comment thread templates/investigate-trace.md Outdated
Comment thread templates/investigate-trace.md Outdated
Comment thread protocols/reasoning/root-cause-analysis.md Outdated
Comment thread protocols/reasoning/root-cause-analysis.md
Comment thread formats/investigation-report.md Outdated
- Align ASSUMED marker with anti-hallucination protocol ([ASSUMPTION])
- Soften 'at least top 5' to 'up to 5' with data-limitation escape hatch
  in both investigate-trace template and root-cause-analysis protocol
- Add investigate-trace to root-cause-analysis applicable_to list
- Remove root-cause-ci-failure from full-format example list (not applicable)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Comment thread templates/investigate-trace.md Outdated
Comment thread templates/investigate-trace.md Outdated
Comment thread templates/investigate-trace.md Outdated
Comment thread protocols/guardrails/anti-hallucination.md
Comment thread templates/investigate-trace.md
- Add low-CPU edge case handling for energy-to-CPU ratio threshold
- Qualify call stack requirement as 'when available' in quality checklist
- Align [UNKNOWN] marker with protocol's [UNKNOWN: <what is missing>]
- Fix remaining INFERRED/ASSUMED instance to use [ASSUMPTION] marker

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Alan-Jowett Alan-Jowett requested a review from Copilot April 21, 2026 22:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Comment thread templates/investigate-trace.md Outdated
Comment thread templates/investigate-bug.md Outdated
Comment thread templates/investigate-trace.md
Comment thread protocols/reasoning/root-cause-analysis.md
Remove hardcoded '8 sections' from investigate-trace and investigate-bug
templates — the investigation-report format defines 9 sections (§1–§9).
Avoids drift by not embedding a count that the format owns.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.

@Alan-Jowett Alan-Jowett merged commit 0919e84 into microsoft:main Apr 21, 2026
8 checks passed
@Alan-Jowett Alan-Jowett deleted the improve-investigation-from-etl-feedback branch April 21, 2026 23:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants