Skip to content

TritonParse v0.4.1 Release πŸŽ‰

Choose a tag to compare

@FindHao FindHao released this 25 Feb 18:56
· 126 commits to main since this release
  • Date range: 2026-01-22 β€” 2026-02-24
  • Scope: Feature release - New diff CLI subcommand for kernel compilation comparison with tensor value analysis, autotune analysis visualization, profile-aware launch tracing, enhanced reproducer support, bisect auto-setup, and multi-format trace compression support.

Highlights

  • πŸ“Š Autotune Analysis: End-to-end autotune session tracking with frontend visualization. Automatically detects autotune sessions, tracks benchmark vs winner launches, displays configuration comparison tables, and shows winner run count statistics.

  • πŸ”¬ New diff CLI Subcommand (Beta): Complete kernel compilation diff system for comparing two compilation events. Supports metadata analysis, source mapping comparison, IR statistics diff, and tensor value comparison with configurable tolerances (--tensor-values, --atol, --rtol). Output can be appended in-place or written to new files. Note: This feature is in beta β€” APIs and output formats may change in future releases.

  • ⚑ Profile-Aware Launch Tracing: Transparent integration with torch.profiler via TRITON_TRACE_LAUNCH_WITHIN_PROFILING=1. Monkey-patches torch.profiler.schedule to trace launches only during the profiler's RECORD phase.

  • πŸ—œοΈ Multi-Format Compression: Added CLP (Compressed Log Processor) support alongside existing gzip. Trace compression is now disabled by default (TRITON_TRACE_COMPRESSION=none). Magic number detection for transparent decompression.

  • πŸ”§ Bisect Auto-Setup: New --auto-env-setup flag for --llvm-only bisect mode. Automatically clones/updates Triton and LLVM repositories, creates conda environments.

  • πŸ“¦ TMA Kernel Support: TensorDescriptor capture and reconstruction for TMA (Tensor Memory Accelerator) kernel reproducers.

Changes by Area

πŸ“Š Autotune Analysis

⚑ Profile-Aware Launch Tracing

πŸ”¬ New diff CLI Subcommand (Beta)

A complete kernel compilation comparison system (~1500 lines) with layered architecture:

  • Data Types (D1): CompilationDiffResult, DiffNote, DiffSummary, IRStats, IRStatsDiff, MetadataDiff, TensorArgDiff, TensorValueDiff
  • Event Matching (D2): match_events_by_index(), match_events_by_kernel(), find_launch_for_compilation()
  • Diff Engine (D3): Main DiffEngine class orchestrating all analyzers
  • Metadata Analyzer (D4): Compares compilation metadata (num_warps, num_stages, etc.)
  • Sourcemap Analyzer (D5): Compares source mappings between IRs
  • Summary Generator (D6): Generates human-readable diff summaries
  • Output Module (D7): ConsolidatedDiffWriter, append_diff_to_file(), format_summary()
  • CLI Entry Point (D8): tritonparseoss diff command with --events, --kernel, --tensor-values flags
  • Tensor Value Analyzer: Numeric tensor comparison with blob mode (full element-wise) and stats mode (min/max/mean/std fallback)
  • Unit Tests: Phase 1 test coverage for core modules

CLI Usage:

# Compare compilations 0 and 1 in single file
tritonparseoss diff trace.ndjson --events 0,1

# Compare with tensor value analysis
tritonparseoss diff trace.ndjson --tensor-values --atol 1e-5 --rtol 1e-3

# List available compilations
tritonparseoss diff trace.ndjson --list

# Filter by kernel name
tritonparseoss diff trace.ndjson --kernel matmul --events 0,1

⚑ Profile-Aware Launch Tracing

  • New environment variable TRITON_TRACE_LAUNCH_WITHIN_PROFILING=1
  • patch_profiler_schedule(): Monkey-patches torch.profiler.schedule
  • enable_launch_tracing() / disable_launch_tracing() API
  • Mutually exclusive with TRITON_TRACE_LAUNCH (validated at init)
  • Unit tests for all three scenarios: no flag, trace all, profile-aware

πŸ—œοΈ Compression Module

  • Magic number detection: detect_compression() for gzip/zstd/none
  • Transparent reading: open_compressed_file() context manager
  • CLP format support (#326): TRITON_TRACE_COMPRESSION="clp" for Compressed Log Processor format
  • Default change: Compression disabled by default (was gzip)
  • API functions: is_gzip_file(), is_zstd_file(), iter_lines()

πŸ”§ Bisect Enhancements

  • EnvironmentManager (#329-#332):
    • Auto-clone Triton and LLVM repositories from GitHub
    • Create/verify conda environments
    • --auto-env-setup CLI flag for --llvm-only mode
    • Status checking and diagnostics
    • Unit tests for all scenarios

πŸ“¦ Reproducer Enhancements

  • TensorDescriptor support (#344): Captures base, shape, strides, block_shape, padding for TMA kernels
  • preserve_autotune mode (#328): Preserve autotune configs in reproducer scripts
  • Robustness improvements: Complex kernel handling, function reference detection in call arguments (#348)
  • Verbose args print placeholder (#347): Placeholder for verbose argument printing
  • WS kernel fix (#349): Correct num_warps handling for Warp Specialization kernels
  • Better logging (#346): Improved logging when black/isort unavailable

🌐 Website UI Improvements

  • KernelOverview page: New component for autotune analysis visualization (870 lines)
  • WebSocket ArrayBuffer handling (#340): Direct trace ArrayBuffer via iframe messaging
  • URL normalization (#324): Manifold Explorer and tritonparse URL handling
  • Click-to-highlight tip: Added in CodeComparisonView
  • Title navigation fix: TritonParse title returns to home

πŸ—οΈ Infrastructure & API

  • SASS parsing refactor: extract_sass_pc_mappings() for PC-offset-keyed source mapping (for CUTracer integration)
  • Rank-less file support (#341, #342): --rank none for parsing files without rank suffix
  • Launch without compilation (#336): Support launch events when compilation was cached
  • log_dir parameter (#337): TritonParseManager(log_dir=...) API
  • Auto-switch log file: When rank becomes available during execution
  • Error message improvements (#339): Better diagnostics and bug fixes
  • Meta copyright headers: Added to all scripts
  • Dependabot prefix: [dependabot] prefix to PR titles
  • Negative line support (#319): prettify_ndjson handles negative line numbers

Compatibility Notes

  • Default Change: Trace compression is now disabled by default. Set TRITON_TRACE_COMPRESSION="gzip" to restore v0.4.0 behavior.
  • New Feature (Beta): The diff subcommand is additive and doesn't affect existing workflows. It is in beta β€” APIs and output formats may change.
  • New Feature: Autotune analysis events are automatically generated; frontend displays when available.
  • Mutual Exclusivity: TRITON_TRACE_LAUNCH and TRITON_TRACE_LAUNCH_WITHIN_PROFILING cannot both be set.

Upgrade Guidance

  1. Use diff for kernel comparison:

    # Basic diff
    tritonparseoss diff trace.ndjson --events 0,1
    
    # With tensor value comparison
    tritonparseoss diff trace.ndjson --tensor-values --kernel matmul
  2. Enable profile-aware tracing:

    TRITON_TRACE_LAUNCH_WITHIN_PROFILING=1 python train.py
  3. Use CLP compression (if available):

    TRITON_TRACE_COMPRESSION="clp" python train.py
  4. Bisect with auto-setup:

    tritonparseoss bisect --llvm-only --auto-env-setup \
        --triton-dir ~/oss-triton \
        --good-llvm abc123 --bad-llvm def456 \
        --test-script test.py
  5. TMA kernel reproducers: Now work automatically when TensorDescriptor arguments are present.