TritonParse v0.4.1 Release π
- Date range: 2026-01-22 β 2026-02-24
- Scope: Feature release - New
diffCLI subcommand for kernel compilation comparison with tensor value analysis, autotune analysis visualization, profile-aware launch tracing, enhanced reproducer support, bisect auto-setup, and multi-format trace compression support.
Highlights
-
π Autotune Analysis: End-to-end autotune session tracking with frontend visualization. Automatically detects autotune sessions, tracks benchmark vs winner launches, displays configuration comparison tables, and shows winner run count statistics.
-
π¬ New
diffCLI Subcommand (Beta): Complete kernel compilation diff system for comparing two compilation events. Supports metadata analysis, source mapping comparison, IR statistics diff, and tensor value comparison with configurable tolerances (--tensor-values,--atol,--rtol). Output can be appended in-place or written to new files. Note: This feature is in beta β APIs and output formats may change in future releases. -
β‘ Profile-Aware Launch Tracing: Transparent integration with
torch.profilerviaTRITON_TRACE_LAUNCH_WITHIN_PROFILING=1. Monkey-patchestorch.profiler.scheduleto trace launches only during the profiler's RECORD phase. -
ποΈ Multi-Format Compression: Added CLP (Compressed Log Processor) support alongside existing gzip. Trace compression is now disabled by default (
TRITON_TRACE_COMPRESSION=none). Magic number detection for transparent decompression. -
π§ Bisect Auto-Setup: New
--auto-env-setupflag for--llvm-onlybisect mode. Automatically clones/updates Triton and LLVM repositories, creates conda environments. -
π¦ TMA Kernel Support: TensorDescriptor capture and reconstruction for TMA (Tensor Memory Accelerator) kernel reproducers.
Changes by Area
π Autotune Analysis
β‘ Profile-Aware Launch Tracing
π¬ New diff CLI Subcommand (Beta)
A complete kernel compilation comparison system (~1500 lines) with layered architecture:
- Data Types (D1):
CompilationDiffResult,DiffNote,DiffSummary,IRStats,IRStatsDiff,MetadataDiff,TensorArgDiff,TensorValueDiff - Event Matching (D2):
match_events_by_index(),match_events_by_kernel(),find_launch_for_compilation() - Diff Engine (D3): Main
DiffEngineclass orchestrating all analyzers - Metadata Analyzer (D4): Compares compilation metadata (num_warps, num_stages, etc.)
- Sourcemap Analyzer (D5): Compares source mappings between IRs
- Summary Generator (D6): Generates human-readable diff summaries
- Output Module (D7):
ConsolidatedDiffWriter,append_diff_to_file(),format_summary() - CLI Entry Point (D8):
tritonparseoss diffcommand with--events,--kernel,--tensor-valuesflags - Tensor Value Analyzer: Numeric tensor comparison with blob mode (full element-wise) and stats mode (min/max/mean/std fallback)
- Unit Tests: Phase 1 test coverage for core modules
CLI Usage:
# Compare compilations 0 and 1 in single file
tritonparseoss diff trace.ndjson --events 0,1
# Compare with tensor value analysis
tritonparseoss diff trace.ndjson --tensor-values --atol 1e-5 --rtol 1e-3
# List available compilations
tritonparseoss diff trace.ndjson --list
# Filter by kernel name
tritonparseoss diff trace.ndjson --kernel matmul --events 0,1β‘ Profile-Aware Launch Tracing
- New environment variable
TRITON_TRACE_LAUNCH_WITHIN_PROFILING=1 patch_profiler_schedule(): Monkey-patchestorch.profiler.scheduleenable_launch_tracing()/disable_launch_tracing()API- Mutually exclusive with
TRITON_TRACE_LAUNCH(validated at init) - Unit tests for all three scenarios: no flag, trace all, profile-aware
ποΈ Compression Module
- Magic number detection:
detect_compression()for gzip/zstd/none - Transparent reading:
open_compressed_file()context manager - CLP format support (#326):
TRITON_TRACE_COMPRESSION="clp"for Compressed Log Processor format - Default change: Compression disabled by default (was gzip)
- API functions:
is_gzip_file(),is_zstd_file(),iter_lines()
π§ Bisect Enhancements
- EnvironmentManager (#329-#332):
- Auto-clone Triton and LLVM repositories from GitHub
- Create/verify conda environments
--auto-env-setupCLI flag for--llvm-onlymode- Status checking and diagnostics
- Unit tests for all scenarios
π¦ Reproducer Enhancements
- TensorDescriptor support (#344): Captures
base,shape,strides,block_shape,paddingfor TMA kernels - preserve_autotune mode (#328): Preserve autotune configs in reproducer scripts
- Robustness improvements: Complex kernel handling, function reference detection in call arguments (#348)
- Verbose args print placeholder (#347): Placeholder for verbose argument printing
- WS kernel fix (#349): Correct
num_warpshandling for Warp Specialization kernels - Better logging (#346): Improved logging when black/isort unavailable
π Website UI Improvements
- KernelOverview page: New component for autotune analysis visualization (870 lines)
- WebSocket ArrayBuffer handling (#340): Direct trace ArrayBuffer via iframe messaging
- URL normalization (#324): Manifold Explorer and tritonparse URL handling
- Click-to-highlight tip: Added in CodeComparisonView
- Title navigation fix: TritonParse title returns to home
ποΈ Infrastructure & API
- SASS parsing refactor:
extract_sass_pc_mappings()for PC-offset-keyed source mapping (for CUTracer integration) - Rank-less file support (#341, #342):
--rank nonefor parsing files without rank suffix - Launch without compilation (#336): Support launch events when compilation was cached
- log_dir parameter (#337):
TritonParseManager(log_dir=...)API - Auto-switch log file: When rank becomes available during execution
- Error message improvements (#339): Better diagnostics and bug fixes
- Meta copyright headers: Added to all scripts
- Dependabot prefix:
[dependabot]prefix to PR titles - Negative line support (#319):
prettify_ndjsonhandles negative line numbers
Compatibility Notes
- Default Change: Trace compression is now disabled by default. Set
TRITON_TRACE_COMPRESSION="gzip"to restore v0.4.0 behavior. - New Feature (Beta): The
diffsubcommand is additive and doesn't affect existing workflows. It is in beta β APIs and output formats may change. - New Feature: Autotune analysis events are automatically generated; frontend displays when available.
- Mutual Exclusivity:
TRITON_TRACE_LAUNCHandTRITON_TRACE_LAUNCH_WITHIN_PROFILINGcannot both be set.
Upgrade Guidance
-
Use diff for kernel comparison:
# Basic diff tritonparseoss diff trace.ndjson --events 0,1 # With tensor value comparison tritonparseoss diff trace.ndjson --tensor-values --kernel matmul
-
Enable profile-aware tracing:
TRITON_TRACE_LAUNCH_WITHIN_PROFILING=1 python train.py
-
Use CLP compression (if available):
TRITON_TRACE_COMPRESSION="clp" python train.py -
Bisect with auto-setup:
tritonparseoss bisect --llvm-only --auto-env-setup \ --triton-dir ~/oss-triton \ --good-llvm abc123 --bad-llvm def456 \ --test-script test.py -
TMA kernel reproducers: Now work automatically when TensorDescriptor arguments are present.