- Date range: 2026-04-22 — 2026-06-09
- Scope: Major architectural release — Flexible Backend Support RFC Phase 1 (adapter-driven parser / analysis / derived-artifact / reproducer dispatch + frontend generalization), multi-process trace write corruption fix (PID + host filename suffix, cross-PID kernel-hash merge, pre-init rank attribution), CI migration to PyTorch self-hosted A10G + CUDA 13.0 + Python 3.14, AI-driven compat builder reliability loop, and zstd trace write support.
Highlights
-
🧩 Flexible Backend Support RFC Phase 1 — Adapter-Driven Reader (#387, #394, #401, #402, #403, #404, #406, #408, #409, #411, #414): End-to-end refactor that moves every backend-specific decision in the reader (stage discovery, parser dispatch, analysis scheduling, derived artifacts, reproducer device/sync, frontend stage names, syntax highlighting) out of hardcoded branches and behind the
CompilationPipelineAdaptercontract. The trace now carries anir_stagesdescriptor list per compilation event, so the React frontend has zero remaining hardcoded stage names. Adding a new backend now means subclassing the adapter and registering parsers/analyzers in__init__— no shared-core edits required. Per-adapter registries are instance-isolated, eliminating cross-backend coupling and the previous three-way circular dependency betweenbackend.py,ir_parser.py, andir_analysis.py. -
🔀 Multi-Process Trace Write Corruption Fix: Each Triton process now writes to a dedicated
..._rank_{N|none}_pid_{PID}_host_{HOST}_.ndjsonfile, fixing silent NDJSON corruption when N Inductor compile workers shared one log file on filesystems without cross-processO_APPENDatomicity.parse_logswas reworked into a per-rankparse_single_rankbatch API with cross-PID kernel-hash dedup and a three-pass(host, PID) → ranklookup that re-attributes pre-dist.initno-rank files into their owning rank by default. New CLI flag--no-pre-init-attributionis the escape hatch.hostis keyed alongside PID so colliding PIDs across distributed hosts cannot mis-attribute. Backward compatibility with legacy filenames (no_pid_,_host_, or_rank_suffix) is preserved end-to-end and exercised bytests/cpu/test_legacy_filename_compat.py. -
🚀 CI Migration to A10G + CUDA 13.0 + Python 3.14 + Triton-Nightly Wheels (#396): GPU CI moved from the GitHub-hosted T4 runner to the PyTorch self-hosted
linux.g5.4xlarge.nvidia.gpupool with thepytorch/almalinux-builder:cuda13.0container. Triton install switched from a 30–50 min source build to the upstream Triton-Nightly wheel feed (~30 s warm). A companion script.ci/triton_nightly_pin.py(#400) works around PEP-440 local-version lexicographic sort sopip install -U --pre tritonactually picks the newest commit instead of an arbitrary stale wheel. Includes Py 3.14 compatibility fixes andpip uninstall pytorch-triton tritonreconciliation before the nightly install. -
🤖 AI-Driven Compat Builder Reliability Loop: The
compat_buildermodule now runs a verify-and-retry loop (up to 5 attempts) where each AI fix is probe-tested against the incompatible LLVM and, on failure, the new build-error log is fed back into the next attempt. Build errors are persisted to log files rather than truncated inline, sonote:/candidate:lines with correct API signatures survive into the next prompt. Adds stale-commit detection, a post-bisect single-commit re-verification probe (clean output instead of cross-contaminated bisect output), and a two-stage prompt rewrite (Analysis → Execution). -
🗜️ Zstd Compression Support for Trace Writes (#385):
TRITON_TRACE_COMPRESSION=zstdnow writes.bin.ndjsontrace files directly. The read side already supported zstd viatools/compression.py; this closes the write side. A new genericcompress_single_file(file_path, compression, verbose)helper inparse/common.pyreplacesgzip_single_fileand supports both gzip and zstd. -
🎯 AutotuneListener Integration + Cross-PID Autotune Dedup: Consumes the new upstream
knobs.autotuning.listenerAPI instructured_logging.pyto capture autotune events natively._generate_autotune_analysis_eventsinevent_diff.pynow dedups compilations by hash before applying the ">= 2configs = real benchmark session" gate, fixing inflatedcompilation_analysis.configswhen two PIDs hit the same Triton cache for one compiled kernel.
Changes by Area
🧩 Flexible Backend Support (RFC Phase 1)
- General Backend Infrastructure & Generic Parse Flow (#387): Introduces
CompilationPipelineAdapter,NvidiaTritonAdapter,AmdTritonAdapter,IRStageDescriptor, and the dynamic stage loop intrace_processor.py. Newbackend.py(~310 LOC) is the centralized hub. - Parser Distribution Refactor (#394): Replaces hardcoded IR parser selection with
ParserRegistry+ adapter-driven dispatch. Newmetadataparameter threaded throughparse_single_file/parse_single_rank. - Analysis Dispatch Refactor (#401): Introduces
AnalyzerInfo,AnalysisRegistry, standardized analyzer wrapper functions, and layered registration for common vs. backend-specific analyzers. CentralizesTRITONPARSE_ANALYSISenv-var handling with case normalization and unknown-name filtering. - Stage Discovery Cleanup (#402): Removes the
isinstance(backend_name, str)pre-check layer oncebackend_namewas confirmed consistently present. Deletes ~76 lines of dead pre-dispatch logic inbackend.py. - Derived Artifact Refactor (#403): Replaces hardcoded CUBIN → SASS dump in
extract_file_content()with an adapter-drivenDerivedArtifactInfo+DerivedArtifactRegistry. AddsTRITONPARSE_DERIVED_ARTIFACTSentry point;TRITONPARSE_DUMP_SASSremains as a compat shim with legacy fallback. - Reproducer Migration (#404): Final reader-side gap closed — device strings (
cuda:0/mps:0/xpu:0) and sync calls (torch.cuda.synchronize()/torch.mps.synchronize()/torch.xpu.synchronize()) are now derived fromadapter.pytorch_module. NewSYNC_CALL_PLACEHOLDERand_replace_sync_callhandler inplaceholder_replacer.py.normalize_device_string()switches from no-op to generic prefix-based normalization. - Trace Data Foundation for Frontend (#406): Each compilation event payload now carries an
ir_stagesarray withname,extension,display_name,display_order,is_text,supports_source_mapping,syntax_idper stage. Schema (compilation.schema.json) updated. - Frontend Generalization via
ir_stages(#408): All hardcoded stage names removed fromdataLoader.ts,irLanguage.ts,languageUtils.ts,CodeView.tsx,CodeComparisonView.tsx,SingleCodeViewer.tsx,FileDiffSession.tsx,FileDiffView.tsx.SourceMappinginterface generalized with[key: string]: unknownindex signature. NewgetDefaultPanels()andgetGroupingAnchor()utilities derive panel selection and line-grouping anchor fromir_stagessorted bydisplay_order. Legacy traces withoutir_stagesretain hardcoded fallback paths. - Registry Architecture Refactor (#409): Migrates
ParserRegistry,AnalyzerInfo,AnalysisRegistryfromir_parser.py/ir_analysis.pyintobackend.py, eliminating the three-way circular dependency that previously required lazy in-function imports throughout. All three registries converted from class-level shared storage to per-adapter instance isolation, so NVIDIA and AMD registrations are fully independent. - AnalyzerContext Unification (#411): New
AnalyzerContextdataclass replaces the(entry, procedure_checks)analyzer signature with(entry, ctx). Future per-call context fields (device_info, compile_options) extendAnalyzerContextwithout touching every analyzer. Also fixes a latent 4-arg-to-3-param bug inregister_backend_analyzerand adds registry isolation tests for bothregister_backend_analyzerandregister_backend_derived_artifact. - backend.py Internal Consistency Cleanup (#414): Three-commit cleanup — registry simplification,
list_*API renaming, docstring/type-hint unification.backend.pynet delta -3 LOC but ~370 lines reorganized. - Multi-Backend Integration Guide + Ascend Example (#413): New
docs/10.-Adding-a-New-Backend.md(~393 lines) with full Ascend NPU integration walkthrough.
🔀 Multi-Process Trace Files
- Repro Tests First:
tests/cpu/test_multiprocess_write.py(~445 lines) +tests/gpu/test_multiprocess_write_inductor.py(~191 lines) landed as failing tests demonstrating the corruption bug. - PID + Host Filename Suffix:
TritonTraceHandler.emit()addspid_{PID}_host_{HOST}_to the filename. Hostname viasocket.gethostname().split(".")[0]sanitized to[a-zA-Z0-9-], cached for process lifetime. Addsos.writeto bypassBufferedWriterandregister_at_forkfor additional safety. parse_single_rankBatch API: Refactorsparse_single_fileintoparse_single_rank(files_list, output_dir, ...)for cross-PID kernel-hash merge per rank. New_collect_and_bucket_fileshelper implements the three-pass(host, pid) → rankalgorithm. New_PID_REGEX,_HOST_REGEX,_extract_pid_from_filename,_prescan_for_fake_compilations_multi.- Pre-init Rank Attribution On by Default: Flips
enable_pre_init_attributiondefault toTrue. CLI flag--no-pre-init-attributionexposed through the parse pipeline. - Cross-PID Autotune Dedup: Fixes inflated
compilation_analysis.configsfrom duplicate per-PID entries by deduping by hash in_generate_autotune_analysis_events. - Legacy Filename Compatibility (
tests/cpu/test_legacy_filename_compat.py, ~486 lines): Comprehensive coverage of all four legacy filename formats — no-host, no-PID, no-rank-token combinations all continue to parse. - Docs Update:
docs/02.-Usage-Guide.mdanddocs/06.-FAQ.mdupdated for the new multi-process behavior.
🤖 AI / Compat Builder
- Verify-and-Retry Loop:
fix_incompatibility()runs up to 5 attempts with stale-commit detection (_check_for_commitcompares HEAD before/after). Build errors saved to log files (changesbuild_error: strtobuild_error_log: Paththroughout). Post-bisect single-commit probe re-verifies the first-bad LLVM commit with clean output. - Prompt Rewrite: Two-stage (Analysis → Execution) structure; context reordered by causality (LLVM diff → error log → reference fix); removed inline source files section since AI reads files directly.
- GCC Probe Workaround: Probe script and
_ensure_llvm_repo()switched from condaclang++to systemgcc/g++, matchingbisect_llvm.sh. - AI Client Debug Logging + pair_tester Bugfix: Adds debug logging to the Claude CLI client; fixes a missing
header_skipped = Trueassignment inpair_tester.py.
🚀 Infrastructure & CI
- A10G + CUDA 13.0 + Python 3.14 Migration (#396):
4-core-ubuntu-gpu-t4→linux.g5.4xlarge.nvidia.gpu(16 vCPU / A10G sm_86) withpytorch/almalinux-builder:cuda13.0container. Triton-Nightly wheel install replaces source build.actions/cache@v5enabled via directcontainer:directive. - Triton Nightly Wheel Pin (#400):
.ci/triton_nightly_pin.py(~242 lines) resolves the actual newest commit from the upstream Triton-Nightly feed instead of relying on PEP-440 local-version lexicographic sort, which silently selected stale wheels. - pip / python Mismatch Fix (#395):
.ci/setup.shreconciles pip and python interpreters; the silentconda create ... || truefailure-swallowing pattern was removed. - GPU Multiprocess Inductor Test Stabilization (#412): Sets
TORCHINDUCTOR_COMPILE_THREADS=1instead of the brokenTORCHINDUCTOR_WORKER_START=spawnenv var (Inductor caches its worker pool withlru_cache(1), freezing it beforesetUp()can switch). Sidesteps the upstreamRuntimeError: 0 active drivers ([])race. - Lazy
torchImport instructured_logging.py: Six functions gain a function-localimport torchto break a circular import path (import triton→tritonparse.structured_logging→import torch→import triton) that surfaced after the module-topimport torchwas added. - Stale PR Branch Cleanup Workflow: Monthly GitHub Actions workflow (
.github/workflows/cleanup-stale-pr-branches.yml, runs at 00:00 UTC). - CodeQL Findings Fix (#386): Three CodeQL security warnings in CI workflows and
website/scripts/inline-html.js.
🛠️ Bisect & Tooling
- Skip-on-Build-Failure Default:
base_bisector.pynow exits 125 (git bisect skip) on intermediate build failures instead of 128 (abort). New--build-failure-actionflag ("skip"|"abort", default"skip"). Motivated by a Triton-main bisect that aborted on an unrelated mid-range build failure. - GCC Workaround in
bisect_llvm.sh: Allows usinggcc/g++as the compiler instead of conda'sclang++. $HOME/.tritonCleanup During Build (#397):bisect_triton.shcleans the llvm dir before each build.- Lazy FileCheck Detection:
_get_filecheck_binary()wraps detection so the WARNING fires only on firstrun_filecheck()call, not at module import — eliminates spurious warnings in unrelated tests.
📊 Trace / Schema
- Compilation Schema Fix:
timesfield moved frompayload.timestopayload.metadata.timesto match whatstructured_logging.pyactually writes. ir_stagesin Compilation Schema (#406): Per-event stage descriptor list added tocompilation.schema.json.
🎨 Frontend / Website
- Homepage Button Display Polish (#392): When no trace is loaded, only File Diff remains in the top-right; Kernel Overview / IR Code / IR Analysis tabs render disabled with
not-allowedcursor and tooltip "Please load a trace file first". - Dependency Bumps:
dompurify3.3.2 → 3.4.0 inmonaco-editoroverride;postcss8.5.9 → 8.5.12 (#393); 12-package website group bump (#416) including React 19.2.5 → 19.2.7, Vite 8.0.10 → 8.0.16, Tailwind 4.2.4 → 4.3.0; 3-package bump (#398).
Compatibility Notes
- No public API breakage: The CLI surface (
unified_parse,oss_run,parse_single_file) is preserved.parse_single_fileremains a thin wrapper overparse_single_rankfor single-file callers. - Behavior change — default rank attribution: Pre-init kernel attribution is now ON by default. Traces produced after this release have no-rank pre-init files automatically re-attributed to their owning rank during parse. Use
--no-pre-init-attributionto restore the previous unattributed view. - Behavior change — trace filenames: New traces are written with
_pid_{PID}_host_{HOST}_suffixes. Legacy filenames without these suffixes continue to parse unchanged. - Frontend stage handling: Backends that produce non-standard stage names are now first-class — the frontend reads
ir_stagesfrom the trace instead of hardcodingttgir/ptx/sass. Old traces withoutir_stagesfall back to the original behavior.
Upgrade Guidance
-
Standard upgrade:
pip install --upgrade tritonparse
-
Adding a new backend (Ascend NPU and other accelerators): Subclass
CompilationPipelineAdapterand register backend-specific parsers / analyzers / derived artifacts in__init__. No shared-core code changes are required. See the new walkthrough atdocs/10.-Adding-a-New-Backend.md. Definepytorch_module(e.g.,"mps","xpu") so the reproducer device string and sync call are derived automatically. -
Multi-process / distributed users: No action required for new traces. If you need the pre-refactor unattributed view (kernels compiled before
dist.init_process_group()shown separately rather than re-attributed), use:tritonparse parse <source> --no-pre-init-attribution
-
Enable zstd trace writes (smaller than gzip, comparable speed):
export TRITON_TRACE_COMPRESSION=zstdTrace files will be written as
.bin.ndjsonand remain readable by the parser without extra flags. -
Pin to Triton-Nightly in your own CI (if you mirror our setup): The
.ci/triton_nightly_pin.pyresolver script can be reused to defeat the PEP-440 lexicographic-sort bug in the upstream Triton-Nightly feed. Falls back topip install -U --pre tritonon any resolver error.