Skip to content

TritonParse Release Notes v0.5.0 (43 commits)

Latest

Choose a tag to compare

@FindHao FindHao released this 10 Jun 03:36
  • Date range: 2026-04-22 — 2026-06-09
  • Scope: Major architectural release — Flexible Backend Support RFC Phase 1 (adapter-driven parser / analysis / derived-artifact / reproducer dispatch + frontend generalization), multi-process trace write corruption fix (PID + host filename suffix, cross-PID kernel-hash merge, pre-init rank attribution), CI migration to PyTorch self-hosted A10G + CUDA 13.0 + Python 3.14, AI-driven compat builder reliability loop, and zstd trace write support.

Highlights

  • 🧩 Flexible Backend Support RFC Phase 1 — Adapter-Driven Reader (#387, #394, #401, #402, #403, #404, #406, #408, #409, #411, #414): End-to-end refactor that moves every backend-specific decision in the reader (stage discovery, parser dispatch, analysis scheduling, derived artifacts, reproducer device/sync, frontend stage names, syntax highlighting) out of hardcoded branches and behind the CompilationPipelineAdapter contract. The trace now carries an ir_stages descriptor list per compilation event, so the React frontend has zero remaining hardcoded stage names. Adding a new backend now means subclassing the adapter and registering parsers/analyzers in __init__ — no shared-core edits required. Per-adapter registries are instance-isolated, eliminating cross-backend coupling and the previous three-way circular dependency between backend.py, ir_parser.py, and ir_analysis.py.

  • 🔀 Multi-Process Trace Write Corruption Fix: Each Triton process now writes to a dedicated ..._rank_{N|none}_pid_{PID}_host_{HOST}_.ndjson file, fixing silent NDJSON corruption when N Inductor compile workers shared one log file on filesystems without cross-process O_APPEND atomicity. parse_logs was reworked into a per-rank parse_single_rank batch API with cross-PID kernel-hash dedup and a three-pass (host, PID) → rank lookup that re-attributes pre-dist.init no-rank files into their owning rank by default. New CLI flag --no-pre-init-attribution is the escape hatch. host is keyed alongside PID so colliding PIDs across distributed hosts cannot mis-attribute. Backward compatibility with legacy filenames (no _pid_, _host_, or _rank_ suffix) is preserved end-to-end and exercised by tests/cpu/test_legacy_filename_compat.py.

  • 🚀 CI Migration to A10G + CUDA 13.0 + Python 3.14 + Triton-Nightly Wheels (#396): GPU CI moved from the GitHub-hosted T4 runner to the PyTorch self-hosted linux.g5.4xlarge.nvidia.gpu pool with the pytorch/almalinux-builder:cuda13.0 container. Triton install switched from a 30–50 min source build to the upstream Triton-Nightly wheel feed (~30 s warm). A companion script .ci/triton_nightly_pin.py (#400) works around PEP-440 local-version lexicographic sort so pip install -U --pre triton actually picks the newest commit instead of an arbitrary stale wheel. Includes Py 3.14 compatibility fixes and pip uninstall pytorch-triton triton reconciliation before the nightly install.

  • 🤖 AI-Driven Compat Builder Reliability Loop: The compat_builder module now runs a verify-and-retry loop (up to 5 attempts) where each AI fix is probe-tested against the incompatible LLVM and, on failure, the new build-error log is fed back into the next attempt. Build errors are persisted to log files rather than truncated inline, so note: / candidate: lines with correct API signatures survive into the next prompt. Adds stale-commit detection, a post-bisect single-commit re-verification probe (clean output instead of cross-contaminated bisect output), and a two-stage prompt rewrite (Analysis → Execution).

  • 🗜️ Zstd Compression Support for Trace Writes (#385): TRITON_TRACE_COMPRESSION=zstd now writes .bin.ndjson trace files directly. The read side already supported zstd via tools/compression.py; this closes the write side. A new generic compress_single_file(file_path, compression, verbose) helper in parse/common.py replaces gzip_single_file and supports both gzip and zstd.

  • 🎯 AutotuneListener Integration + Cross-PID Autotune Dedup: Consumes the new upstream knobs.autotuning.listener API in structured_logging.py to capture autotune events natively. _generate_autotune_analysis_events in event_diff.py now dedups compilations by hash before applying the ">= 2 configs = real benchmark session" gate, fixing inflated compilation_analysis.configs when two PIDs hit the same Triton cache for one compiled kernel.

Changes by Area

🧩 Flexible Backend Support (RFC Phase 1)

  • General Backend Infrastructure & Generic Parse Flow (#387): Introduces CompilationPipelineAdapter, NvidiaTritonAdapter, AmdTritonAdapter, IRStageDescriptor, and the dynamic stage loop in trace_processor.py. New backend.py (~310 LOC) is the centralized hub.
  • Parser Distribution Refactor (#394): Replaces hardcoded IR parser selection with ParserRegistry + adapter-driven dispatch. New metadata parameter threaded through parse_single_file / parse_single_rank.
  • Analysis Dispatch Refactor (#401): Introduces AnalyzerInfo, AnalysisRegistry, standardized analyzer wrapper functions, and layered registration for common vs. backend-specific analyzers. Centralizes TRITONPARSE_ANALYSIS env-var handling with case normalization and unknown-name filtering.
  • Stage Discovery Cleanup (#402): Removes the isinstance(backend_name, str) pre-check layer once backend_name was confirmed consistently present. Deletes ~76 lines of dead pre-dispatch logic in backend.py.
  • Derived Artifact Refactor (#403): Replaces hardcoded CUBIN → SASS dump in extract_file_content() with an adapter-driven DerivedArtifactInfo + DerivedArtifactRegistry. Adds TRITONPARSE_DERIVED_ARTIFACTS entry point; TRITONPARSE_DUMP_SASS remains as a compat shim with legacy fallback.
  • Reproducer Migration (#404): Final reader-side gap closed — device strings (cuda:0 / mps:0 / xpu:0) and sync calls (torch.cuda.synchronize() / torch.mps.synchronize() / torch.xpu.synchronize()) are now derived from adapter.pytorch_module. New SYNC_CALL_PLACEHOLDER and _replace_sync_call handler in placeholder_replacer.py. normalize_device_string() switches from no-op to generic prefix-based normalization.
  • Trace Data Foundation for Frontend (#406): Each compilation event payload now carries an ir_stages array with name, extension, display_name, display_order, is_text, supports_source_mapping, syntax_id per stage. Schema (compilation.schema.json) updated.
  • Frontend Generalization via ir_stages (#408): All hardcoded stage names removed from dataLoader.ts, irLanguage.ts, languageUtils.ts, CodeView.tsx, CodeComparisonView.tsx, SingleCodeViewer.tsx, FileDiffSession.tsx, FileDiffView.tsx. SourceMapping interface generalized with [key: string]: unknown index signature. New getDefaultPanels() and getGroupingAnchor() utilities derive panel selection and line-grouping anchor from ir_stages sorted by display_order. Legacy traces without ir_stages retain hardcoded fallback paths.
  • Registry Architecture Refactor (#409): Migrates ParserRegistry, AnalyzerInfo, AnalysisRegistry from ir_parser.py / ir_analysis.py into backend.py, eliminating the three-way circular dependency that previously required lazy in-function imports throughout. All three registries converted from class-level shared storage to per-adapter instance isolation, so NVIDIA and AMD registrations are fully independent.
  • AnalyzerContext Unification (#411): New AnalyzerContext dataclass replaces the (entry, procedure_checks) analyzer signature with (entry, ctx). Future per-call context fields (device_info, compile_options) extend AnalyzerContext without touching every analyzer. Also fixes a latent 4-arg-to-3-param bug in register_backend_analyzer and adds registry isolation tests for both register_backend_analyzer and register_backend_derived_artifact.
  • backend.py Internal Consistency Cleanup (#414): Three-commit cleanup — registry simplification, list_* API renaming, docstring/type-hint unification. backend.py net delta -3 LOC but ~370 lines reorganized.
  • Multi-Backend Integration Guide + Ascend Example (#413): New docs/10.-Adding-a-New-Backend.md (~393 lines) with full Ascend NPU integration walkthrough.

🔀 Multi-Process Trace Files

  • Repro Tests First: tests/cpu/test_multiprocess_write.py (~445 lines) + tests/gpu/test_multiprocess_write_inductor.py (~191 lines) landed as failing tests demonstrating the corruption bug.
  • PID + Host Filename Suffix: TritonTraceHandler.emit() adds pid_{PID}_host_{HOST}_ to the filename. Hostname via socket.gethostname().split(".")[0] sanitized to [a-zA-Z0-9-], cached for process lifetime. Adds os.write to bypass BufferedWriter and register_at_fork for additional safety.
  • parse_single_rank Batch API: Refactors parse_single_file into parse_single_rank(files_list, output_dir, ...) for cross-PID kernel-hash merge per rank. New _collect_and_bucket_files helper implements the three-pass (host, pid) → rank algorithm. New _PID_REGEX, _HOST_REGEX, _extract_pid_from_filename, _prescan_for_fake_compilations_multi.
  • Pre-init Rank Attribution On by Default: Flips enable_pre_init_attribution default to True. CLI flag --no-pre-init-attribution exposed through the parse pipeline.
  • Cross-PID Autotune Dedup: Fixes inflated compilation_analysis.configs from duplicate per-PID entries by deduping by hash in _generate_autotune_analysis_events.
  • Legacy Filename Compatibility (tests/cpu/test_legacy_filename_compat.py, ~486 lines): Comprehensive coverage of all four legacy filename formats — no-host, no-PID, no-rank-token combinations all continue to parse.
  • Docs Update: docs/02.-Usage-Guide.md and docs/06.-FAQ.md updated for the new multi-process behavior.

🤖 AI / Compat Builder

  • Verify-and-Retry Loop: fix_incompatibility() runs up to 5 attempts with stale-commit detection (_check_for_commit compares HEAD before/after). Build errors saved to log files (changes build_error: str to build_error_log: Path throughout). Post-bisect single-commit probe re-verifies the first-bad LLVM commit with clean output.
  • Prompt Rewrite: Two-stage (Analysis → Execution) structure; context reordered by causality (LLVM diff → error log → reference fix); removed inline source files section since AI reads files directly.
  • GCC Probe Workaround: Probe script and _ensure_llvm_repo() switched from conda clang++ to system gcc / g++, matching bisect_llvm.sh.
  • AI Client Debug Logging + pair_tester Bugfix: Adds debug logging to the Claude CLI client; fixes a missing header_skipped = True assignment in pair_tester.py.

🚀 Infrastructure & CI

  • A10G + CUDA 13.0 + Python 3.14 Migration (#396): 4-core-ubuntu-gpu-t4linux.g5.4xlarge.nvidia.gpu (16 vCPU / A10G sm_86) with pytorch/almalinux-builder:cuda13.0 container. Triton-Nightly wheel install replaces source build. actions/cache@v5 enabled via direct container: directive.
  • Triton Nightly Wheel Pin (#400): .ci/triton_nightly_pin.py (~242 lines) resolves the actual newest commit from the upstream Triton-Nightly feed instead of relying on PEP-440 local-version lexicographic sort, which silently selected stale wheels.
  • pip / python Mismatch Fix (#395): .ci/setup.sh reconciles pip and python interpreters; the silent conda create ... || true failure-swallowing pattern was removed.
  • GPU Multiprocess Inductor Test Stabilization (#412): Sets TORCHINDUCTOR_COMPILE_THREADS=1 instead of the broken TORCHINDUCTOR_WORKER_START=spawn env var (Inductor caches its worker pool with lru_cache(1), freezing it before setUp() can switch). Sidesteps the upstream RuntimeError: 0 active drivers ([]) race.
  • Lazy torch Import in structured_logging.py: Six functions gain a function-local import torch to break a circular import path (import tritontritonparse.structured_loggingimport torchimport triton) that surfaced after the module-top import torch was added.
  • Stale PR Branch Cleanup Workflow: Monthly GitHub Actions workflow (.github/workflows/cleanup-stale-pr-branches.yml, runs at 00:00 UTC).
  • CodeQL Findings Fix (#386): Three CodeQL security warnings in CI workflows and website/scripts/inline-html.js.

🛠️ Bisect & Tooling

  • Skip-on-Build-Failure Default: base_bisector.py now exits 125 (git bisect skip) on intermediate build failures instead of 128 (abort). New --build-failure-action flag ("skip" | "abort", default "skip"). Motivated by a Triton-main bisect that aborted on an unrelated mid-range build failure.
  • GCC Workaround in bisect_llvm.sh: Allows using gcc / g++ as the compiler instead of conda's clang++.
  • $HOME/.triton Cleanup During Build (#397): bisect_triton.sh cleans the llvm dir before each build.
  • Lazy FileCheck Detection: _get_filecheck_binary() wraps detection so the WARNING fires only on first run_filecheck() call, not at module import — eliminates spurious warnings in unrelated tests.

📊 Trace / Schema

  • Compilation Schema Fix: times field moved from payload.times to payload.metadata.times to match what structured_logging.py actually writes.
  • ir_stages in Compilation Schema (#406): Per-event stage descriptor list added to compilation.schema.json.

🎨 Frontend / Website

  • Homepage Button Display Polish (#392): When no trace is loaded, only File Diff remains in the top-right; Kernel Overview / IR Code / IR Analysis tabs render disabled with not-allowed cursor and tooltip "Please load a trace file first".
  • Dependency Bumps: dompurify 3.3.2 → 3.4.0 in monaco-editor override; postcss 8.5.9 → 8.5.12 (#393); 12-package website group bump (#416) including React 19.2.5 → 19.2.7, Vite 8.0.10 → 8.0.16, Tailwind 4.2.4 → 4.3.0; 3-package bump (#398).

Compatibility Notes

  • No public API breakage: The CLI surface (unified_parse, oss_run, parse_single_file) is preserved. parse_single_file remains a thin wrapper over parse_single_rank for single-file callers.
  • Behavior change — default rank attribution: Pre-init kernel attribution is now ON by default. Traces produced after this release have no-rank pre-init files automatically re-attributed to their owning rank during parse. Use --no-pre-init-attribution to restore the previous unattributed view.
  • Behavior change — trace filenames: New traces are written with _pid_{PID}_host_{HOST}_ suffixes. Legacy filenames without these suffixes continue to parse unchanged.
  • Frontend stage handling: Backends that produce non-standard stage names are now first-class — the frontend reads ir_stages from the trace instead of hardcoding ttgir / ptx / sass. Old traces without ir_stages fall back to the original behavior.

Upgrade Guidance

  1. Standard upgrade:

    pip install --upgrade tritonparse
  2. Adding a new backend (Ascend NPU and other accelerators): Subclass CompilationPipelineAdapter and register backend-specific parsers / analyzers / derived artifacts in __init__. No shared-core code changes are required. See the new walkthrough at docs/10.-Adding-a-New-Backend.md. Define pytorch_module (e.g., "mps", "xpu") so the reproducer device string and sync call are derived automatically.

  3. Multi-process / distributed users: No action required for new traces. If you need the pre-refactor unattributed view (kernels compiled before dist.init_process_group() shown separately rather than re-attributed), use:

    tritonparse parse <source> --no-pre-init-attribution
  4. Enable zstd trace writes (smaller than gzip, comparable speed):

    export TRITON_TRACE_COMPRESSION=zstd

    Trace files will be written as .bin.ndjson and remain readable by the parser without extra flags.

  5. Pin to Triton-Nightly in your own CI (if you mirror our setup): The .ci/triton_nightly_pin.py resolver script can be reused to defeat the PEP-440 lexicographic-sort bug in the upstream Triton-Nightly feed. Falls back to pip install -U --pre triton on any resolver error.