Skip to content

TritonParse v0.4.2 Release πŸŽ‰

Choose a tag to compare

@FindHao FindHao released this 01 Apr 02:33
· 81 commits to main since this release

TritonParse Release Notes v0.4.2 (45 commits)

  • Date range: 2026-02-27 β€” 2026-03-30
  • Scope: Feature release - New ai module for LLM-powered analysis, whole-trace --trace diff mode with multi-strategy kernel matching, FileCheck-based procedure detection replacing hardcoded BlockPingpong, orjson performance optimization with free-threading fallback, torch trace kernel attribution, JSON schema validation, and kernel-run-level tensor blob save controls.

Highlights

  • πŸ€– New ai Module: LLM client abstraction layer with LLMClient ABC, ClaudeCodeClient for Claude Code CLI integration, MockClient for testing, and output parsers (extract_json, extract_code_block, extract_diff_patch). Foundation for AI-powered analysis features.

  • πŸ”¬ Whole-Trace Diff (--trace mode): Compare all kernels across two trace files with a single command. Multi-strategy KernelMatcher engine matches kernels by hash β†’ name β†’ source similarity β†’ fuzzy name β†’ config similarity. TraceDiffEngine orchestrates matching, per-pair diffing, and summary generation. Autotuning-aware: distinguishes truly absent kernels from unpaired autotuning compilations.

  • πŸ“‹ FileCheck-Based Procedure Detection: Complete rewrite of IR analysis from hardcoded Python pattern matching to a JSON-driven, FileCheck-based system. Procedure definitions are declarative with configurable pattern checks and display attributes. Replaces old BlockPingpongCategory with three configurable procedure configs (Small/Medium/Large). Tile size attributes (M, N, K, bits) now displayed.

  • ⚑ orjson Performance + Free-Threading Fallback: New _json_compat.py compatibility layer uses orjson for performance and falls back to stdlib json for CPython 3.14 free-threading builds. All 21 modules migrated. orjson>=3.9 and rich>=13.0 are now default dependencies.

  • πŸ” Torch Trace Kernel Attribution: New torch trace log parser extracts kernel_source_path β†’ CompileInfo mappings from inductor's output code events, enabling kernel-to-compilation-frame attribution when pt_info is missing. Wired through parse pipeline and CLI via --torch-trace-dir.

  • βœ… JSON Schema Validation: New tritonparse/validation/ module with JSON schemas for compilation, launch, launch_diff, and ir_analysis event types. Lightweight validator checks types, required fields, enums, numeric constraints, and $ref resolution.

  • πŸŽ›οΈ Kernel-Run-Level Tensor Blob Controls: New TRITONPARSE_TENSOR_SAVE_SKIP_RUNS and TRITONPARSE_TENSOR_SAVE_MAX_RUNS environment variables (and Python API) for fine-grained control over which kernel runs get tensor blob snapshots.

Changes by Area

πŸ€– New ai Module

A new tritonparse/ai/ module (~1,400 lines) providing LLM client abstractions:

  • LLM Client ABC (PR-1): LLMClient abstract base with chat() and chat_stream() interfaces; Message, Response, ToolCall dataclasses; MockClient for testing
  • ClaudeCodeClient (PR-2): Production client wrapping Claude Code CLI with temp file shell escaping, session resumption, model selection, retry logic, JSON/stream-JSON parsing
  • Output Parsers (PR-3): extract_json(), extract_code_block(), extract_diff_patch() fallback parsers for LLM text responses; format_messages(), truncate_context() utilities
  • Error Diagnostics: Improved error handling extracts actual error from stdout JSON "result" field instead of just stderr

πŸ”¬ Whole-Trace Diff (--trace mode)

A complete trace-level comparison system (~3,400 lines) with layered architecture:

  • Data Types: MatchMethod enum (HASH/NAME/SOURCE/FUZZY_NAME/CONFIG), KernelMatchResult, TraceDiffResult, TraceDiffSummary, TraceStats, DtypeMismatch
  • KernelMatcher (~505 lines): Three-phase group-aware matching engine:
    • Phase 0: Hash-based exact matching (highest priority, cross-name capable)
    • Phase 1: Group-level matching by exact name β†’ source similarity (threshold 0.75) β†’ fuzzy name (threshold 0.7)
    • Phase 2: Within-group config pairing by (num_stages, num_warps, shared memory) similarity
    • Bounded sampling (_MAX_GROUP_SAMPLES=5) for performance on large traces
  • TraceDiffEngine (~355 lines): Orchestrator computing trace stats β†’ kernel matching β†’ per-pair DiffEngine β†’ summary generation
  • Output: TraceSummaryFormatter for human-readable output; extended ConsolidatedDiffWriter with add_trace_diff()
  • CLI: New --trace flag requiring exactly 2 input files
  • Dtype Mismatch Detection: Surfaces dtype mismatches in tensor value comparison when argument names don't overlap
  • Test Reorganization: Monolithic test_diff.py split into 7 focused files: test_cli.py, test_diff_engine.py, test_fixtures.py, test_kernel_matcher.py, test_tensor_value.py, test_trace_diff.py, test_trace_output.py

CLI Usage:

# Compare all kernels across two trace files
tritonparseoss diff trace_a.ndjson trace_b.ndjson --trace

# With tensor value analysis
tritonparseoss diff trace_a.ndjson trace_b.ndjson --trace --tensor-values --atol 1e-5 --rtol 1e-3

πŸ“‹ FileCheck-Based Procedure Detection

Complete rewrite of IR analysis (~2,200 lines) from hardcoded Python to JSON-driven FileCheck:

  • FileCheck Integration: Auto-discovers FileCheck binary from Triton's bundled version, FILECHECK_PATH env var, or system PATH
  • JSON Configuration (default_procedure_checks.json): Declarative procedure definitions with pattern_checks (FileCheck patterns) and display_attributes (configurable extraction rules)
  • Attribute Extraction: Multiple sources (module_attrs, ir_content, computed) with rules (regex, count, dot_shape, tile_size_bits, pp_clusters)
  • Tile Size Display: New tile_m, tile_n, tile_k, tile_size_bits attributes
  • BlockPingpong Migration: Old BlockPingpongCategory enum and ~254 lines of hardcoded Python replaced by three JSON-configured procedures (Small/Medium/Large)
  • Website UI: Collapsible/foldable sections per procedure in IRAnalysis page
  • Streamlined Workflow: Procedure detection integrated into main tritonparse parse pipeline

⚑ orjson Performance + Free-Threading Fallback

  • _json_compat.py (new): Unified JSON compatibility layer β€” orjson when available, stdlib json fallback
    • loads() accepts str | bytes | bytearray | memoryview
    • dumps() returns str with indent and sort_keys support
    • Non-string key coercion in fallback path (replicates orjson's OPT_NON_STR_KEYS)
  • Global Migration (#362): All 21 modules migrated from import json to from tritonparse._json_compat import loads, dumps, JSONDecodeError
  • Free-Threading Support (#365): Automatic stdlib json fallback for CPython 3.14 free-threading builds where orjson is unavailable
  • Default Dependencies (#366): orjson>=3.9 and rich>=13.0 added to pyproject.toml dependencies (previously zero dependencies)

πŸ” Torch Trace Kernel Attribution

  • Torch Trace Parser (#353): New tritonparse/parse/torch_trace_parser.py (~212 lines) parsing inductor's glog-formatted torch trace logs to extract kernel_source_path β†’ CompileInfo mappings from inductor_output_code events
  • Trace Processor Integration (#354): _build_kernel_attribution_map() and _apply_kernel_attribution() enrich compilation events with pt_info when missing (~126 lines)
  • CLI & Pipeline Wiring (#355): New --torch-trace-dir flag with auto-discovery of torch trace files from the same parent directory

βœ… JSON Schema Validation

  • Schema Files (#356): Four JSON schemas for compilation, launch, launch_diff, ir_analysis event types
  • Lightweight Validator (json_validator.py, ~287 lines): Validates required fields, types, enums, numeric constraints (min/max/exclusive), additionalProperties, array items, and $ref resolution
  • validate_trace_file(): Full NDJSON trace file validation with max_errors cap
  • Schema Loader: importlib.resources for PAR compatibility, lazy loading with caching
  • Test Suite: Comprehensive tests (~652 lines) covering all validation scenarios

πŸŽ›οΈ Tensor Blob Save Controls

  • Skip/Max Runs Gating: New environment variables for fine-grained control:
    • TRITONPARSE_TENSOR_SAVE_SKIP_RUNS: Skip tensor blob saving for the first N kernel runs (default: 0)
    • TRITONPARSE_TENSOR_SAVE_MAX_RUNS: Save tensor blobs for at most N kernel runs after skipping (default: 0 = unlimited)
  • Python API: TritonParseManager(tensor_save_skip_runs=N, tensor_save_max_runs=M) and init(tensor_save_skip_runs=N, tensor_save_max_runs=M)
  • Autotune-Aware: Benchmark launches during autotune are excluded from run counting
  • GPU Tests: End-to-end validation of skip/max runs gating

πŸ”§ Reproducer Enhancements

  • CUDA Graph Capture Error (#359): Clear RuntimeError when reproducing kernels launched during CUDA graph capture, explaining that argument extraction was skipped
  • Kernel Name Fallback: Reproducer/info now falls back to matching by kernel name when compilation hash is missing (Inductor kernels where JIT hook didn't fire)

πŸ”§ Bisect Enhancements

  • --triton-repo Flag: Controls culprit commit URL prefix β€” oai (triton-lang/triton, default) or meta (facebookexperimental/triton); state persisted and restored on resume
  • Rich as Default Dependency (#366): rich>=13.0 moved from optional to default, simplifying bisect UI code

πŸ—οΈ Infrastructure & CI

  • GitHub Actions Update (#357): All actions updated to latest versions; Python test matrix changed from 3.11 to 3.13
  • MAST Compatibility: Handle both numeric and string state formats in MAST CLI JSON output
  • Internal Test Reorganization (#358): test_mast_compat.py moved to tests/fb/ for internal-only tests
  • Manifold Upload (#360): New TRITONPARSE_TRACE_MANIFOLD env var for automatic raw log upload on handler cleanup
  • Fix oss_run (#361): Fix missing procedure_checks parameter
  • Centralized Logging: diff/cli.py migrated from print() to centralized logger
  • Copyright Header: Added missing header to clp.py

🌐 Website & Dependencies

  • DOMPurify XSS Fix: npm overrides to fix DOMPurify vulnerability
  • Dependabot Bumps: flatted 3.3.3β†’3.4.2, rollup 4.52.5β†’4.59.0, minimatch security update

Compatibility Notes

  • Breaking Change (Dependencies): orjson>=3.9 and rich>=13.0 are now default dependencies. Users who installed tritonparse with zero dependencies will need to install these on upgrade. Environments where orjson cannot be installed (e.g., CPython 3.14 free-threading) will automatically fall back to stdlib json.
  • Python 3.13: CI now tests on Python 3.13 (was 3.11).
  • Procedure Detection: BlockPingpong detection behavior is preserved but the mechanism changed from hardcoded Python to FileCheck-based. Requires LLVM FileCheck binary (bundled with Triton, or set FILECHECK_PATH). Without FileCheck, procedure detection is disabled with a warning.
  • New Features: --trace diff mode, ai module, validation schemas, torch trace attribution, and tensor blob controls are all additive and don't affect existing workflows.
  • --triton-repo: Defaults to oai (triton-lang/triton), matching pre-existing behavior.

Upgrade Guidance

  1. Install new dependencies:

    pip install --upgrade tritonparse  # orjson and rich will be installed automatically
  2. Use trace-level diff:

    # Compare all kernels across two traces
    tritonparseoss diff trace_a.ndjson trace_b.ndjson --trace
    
    # With tensor value comparison
    tritonparseoss diff trace_a.ndjson trace_b.ndjson --trace --tensor-values
  3. Use torch trace attribution (for multi-process Triton JIT scenarios):

    tritonparseoss parse trace.ndjson --torch-trace-dir /path/to/torch_traces/
  4. Control tensor blob saving:

    # Skip first 100 runs, then save blobs for 50 runs
    TRITONPARSE_TENSOR_SAVE_SKIP_RUNS=100 TRITONPARSE_TENSOR_SAVE_MAX_RUNS=50 python train.py

    Or via Python API:

    with TritonParseManager(
        enable_tensor_blob_storage=True,
        tensor_save_skip_runs=100,
        tensor_save_max_runs=50,
    ) as tp:
        model(input)
  5. Validate trace files against schemas:

    from tritonparse.validation import json_validator
    result = json_validator.validate_trace_file("trace.ndjson")
    print(f"Valid: {result['valid']}, Records: {result['record_count']}")
  6. Bisect with Meta's Triton fork:

    tritonparseoss bisect --triton-repo meta --triton-dir ~/triton \
        --test-script test.py --good v2.0.0 --bad HEAD