Release TritonParse v0.4.2 Release 🎉 · meta-pytorch/tritonparse

TritonParse Release Notes v0.4.2 (45 commits)

Date range: 2026-02-27 — 2026-03-30
Scope: Feature release - New ai module for LLM-powered analysis, whole-trace --trace diff mode with multi-strategy kernel matching, FileCheck-based procedure detection replacing hardcoded BlockPingpong, orjson performance optimization with free-threading fallback, torch trace kernel attribution, JSON schema validation, and kernel-run-level tensor blob save controls.

Highlights

🤖 New ai Module: LLM client abstraction layer with LLMClient ABC, ClaudeCodeClient for Claude Code CLI integration, MockClient for testing, and output parsers (extract_json, extract_code_block, extract_diff_patch). Foundation for AI-powered analysis features.
🔬 Whole-Trace Diff (--trace mode): Compare all kernels across two trace files with a single command. Multi-strategy KernelMatcher engine matches kernels by hash → name → source similarity → fuzzy name → config similarity. TraceDiffEngine orchestrates matching, per-pair diffing, and summary generation. Autotuning-aware: distinguishes truly absent kernels from unpaired autotuning compilations.
📋 FileCheck-Based Procedure Detection: Complete rewrite of IR analysis from hardcoded Python pattern matching to a JSON-driven, FileCheck-based system. Procedure definitions are declarative with configurable pattern checks and display attributes. Replaces old BlockPingpongCategory with three configurable procedure configs (Small/Medium/Large). Tile size attributes (M, N, K, bits) now displayed.
⚡ orjson Performance + Free-Threading Fallback: New _json_compat.py compatibility layer uses orjson for performance and falls back to stdlib json for CPython 3.14 free-threading builds. All 21 modules migrated. orjson>=3.9 and rich>=13.0 are now default dependencies.
🔍 Torch Trace Kernel Attribution: New torch trace log parser extracts kernel_source_path → CompileInfo mappings from inductor's output code events, enabling kernel-to-compilation-frame attribution when pt_info is missing. Wired through parse pipeline and CLI via --torch-trace-dir.
✅ JSON Schema Validation: New tritonparse/validation/ module with JSON schemas for compilation, launch, launch_diff, and ir_analysis event types. Lightweight validator checks types, required fields, enums, numeric constraints, and $ref resolution.
🎛️ Kernel-Run-Level Tensor Blob Controls: New TRITONPARSE_TENSOR_SAVE_SKIP_RUNS and TRITONPARSE_TENSOR_SAVE_MAX_RUNS environment variables (and Python API) for fine-grained control over which kernel runs get tensor blob snapshots.

Changes by Area

🤖 New `ai` Module

A new tritonparse/ai/ module (~1,400 lines) providing LLM client abstractions:

LLM Client ABC (PR-1): LLMClient abstract base with chat() and chat_stream() interfaces; Message, Response, ToolCall dataclasses; MockClient for testing
ClaudeCodeClient (PR-2): Production client wrapping Claude Code CLI with temp file shell escaping, session resumption, model selection, retry logic, JSON/stream-JSON parsing
Output Parsers (PR-3): extract_json(), extract_code_block(), extract_diff_patch() fallback parsers for LLM text responses; format_messages(), truncate_context() utilities
Error Diagnostics: Improved error handling extracts actual error from stdout JSON "result" field instead of just stderr

🔬 Whole-Trace Diff (`--trace` mode)

A complete trace-level comparison system (~3,400 lines) with layered architecture:

Data Types: MatchMethod enum (HASH/NAME/SOURCE/FUZZY_NAME/CONFIG), KernelMatchResult, TraceDiffResult, TraceDiffSummary, TraceStats, DtypeMismatch
KernelMatcher (~505 lines): Three-phase group-aware matching engine:
- Phase 0: Hash-based exact matching (highest priority, cross-name capable)
- Phase 1: Group-level matching by exact name → source similarity (threshold 0.75) → fuzzy name (threshold 0.7)
- Phase 2: Within-group config pairing by (num_stages, num_warps, shared memory) similarity
- Bounded sampling (_MAX_GROUP_SAMPLES=5) for performance on large traces
TraceDiffEngine (~355 lines): Orchestrator computing trace stats → kernel matching → per-pair DiffEngine → summary generation
Output: TraceSummaryFormatter for human-readable output; extended ConsolidatedDiffWriter with add_trace_diff()
CLI: New --trace flag requiring exactly 2 input files
Dtype Mismatch Detection: Surfaces dtype mismatches in tensor value comparison when argument names don't overlap
Test Reorganization: Monolithic test_diff.py split into 7 focused files: test_cli.py, test_diff_engine.py, test_fixtures.py, test_kernel_matcher.py, test_tensor_value.py, test_trace_diff.py, test_trace_output.py

CLI Usage:

# Compare all kernels across two trace files
tritonparseoss diff trace_a.ndjson trace_b.ndjson --trace

# With tensor value analysis
tritonparseoss diff trace_a.ndjson trace_b.ndjson --trace --tensor-values --atol 1e-5 --rtol 1e-3

📋 FileCheck-Based Procedure Detection

Complete rewrite of IR analysis (~2,200 lines) from hardcoded Python to JSON-driven FileCheck:

FileCheck Integration: Auto-discovers FileCheck binary from Triton's bundled version, FILECHECK_PATH env var, or system PATH
JSON Configuration (default_procedure_checks.json): Declarative procedure definitions with pattern_checks (FileCheck patterns) and display_attributes (configurable extraction rules)
Attribute Extraction: Multiple sources (module_attrs, ir_content, computed) with rules (regex, count, dot_shape, tile_size_bits, pp_clusters)
Tile Size Display: New tile_m, tile_n, tile_k, tile_size_bits attributes
BlockPingpong Migration: Old BlockPingpongCategory enum and ~254 lines of hardcoded Python replaced by three JSON-configured procedures (Small/Medium/Large)
Website UI: Collapsible/foldable sections per procedure in IRAnalysis page
Streamlined Workflow: Procedure detection integrated into main tritonparse parse pipeline

⚡ orjson Performance + Free-Threading Fallback

_json_compat.py (new): Unified JSON compatibility layer — orjson when available, stdlib json fallback
- loads() accepts str | bytes | bytearray | memoryview
- dumps() returns str with indent and sort_keys support
- Non-string key coercion in fallback path (replicates orjson's OPT_NON_STR_KEYS)
Global Migration (#362): All 21 modules migrated from import json to from tritonparse._json_compat import loads, dumps, JSONDecodeError
Free-Threading Support (#365): Automatic stdlib json fallback for CPython 3.14 free-threading builds where orjson is unavailable
Default Dependencies (#366): orjson>=3.9 and rich>=13.0 added to pyproject.toml dependencies (previously zero dependencies)

🔍 Torch Trace Kernel Attribution

Torch Trace Parser (#353): New tritonparse/parse/torch_trace_parser.py (~212 lines) parsing inductor's glog-formatted torch trace logs to extract kernel_source_path → CompileInfo mappings from inductor_output_code events
Trace Processor Integration (#354): _build_kernel_attribution_map() and _apply_kernel_attribution() enrich compilation events with pt_info when missing (~126 lines)
CLI & Pipeline Wiring (#355): New --torch-trace-dir flag with auto-discovery of torch trace files from the same parent directory

✅ JSON Schema Validation

Schema Files (#356): Four JSON schemas for compilation, launch, launch_diff, ir_analysis event types
Lightweight Validator (json_validator.py, ~287 lines): Validates required fields, types, enums, numeric constraints (min/max/exclusive), additionalProperties, array items, and $ref resolution
validate_trace_file(): Full NDJSON trace file validation with max_errors cap
Schema Loader: importlib.resources for PAR compatibility, lazy loading with caching
Test Suite: Comprehensive tests (~652 lines) covering all validation scenarios

🎛️ Tensor Blob Save Controls

Skip/Max Runs Gating: New environment variables for fine-grained control:
- TRITONPARSE_TENSOR_SAVE_SKIP_RUNS: Skip tensor blob saving for the first N kernel runs (default: 0)
- TRITONPARSE_TENSOR_SAVE_MAX_RUNS: Save tensor blobs for at most N kernel runs after skipping (default: 0 = unlimited)
Python API: TritonParseManager(tensor_save_skip_runs=N, tensor_save_max_runs=M) and init(tensor_save_skip_runs=N, tensor_save_max_runs=M)
Autotune-Aware: Benchmark launches during autotune are excluded from run counting
GPU Tests: End-to-end validation of skip/max runs gating

🔧 Reproducer Enhancements

CUDA Graph Capture Error (#359): Clear RuntimeError when reproducing kernels launched during CUDA graph capture, explaining that argument extraction was skipped
Kernel Name Fallback: Reproducer/info now falls back to matching by kernel name when compilation hash is missing (Inductor kernels where JIT hook didn't fire)

🔧 Bisect Enhancements

--triton-repo Flag: Controls culprit commit URL prefix — oai (triton-lang/triton, default) or meta (facebookexperimental/triton); state persisted and restored on resume
Rich as Default Dependency (#366): rich>=13.0 moved from optional to default, simplifying bisect UI code

🏗️ Infrastructure & CI

GitHub Actions Update (#357): All actions updated to latest versions; Python test matrix changed from 3.11 to 3.13
MAST Compatibility: Handle both numeric and string state formats in MAST CLI JSON output
Internal Test Reorganization (#358): test_mast_compat.py moved to tests/fb/ for internal-only tests
Manifold Upload (#360): New TRITONPARSE_TRACE_MANIFOLD env var for automatic raw log upload on handler cleanup
Fix oss_run (#361): Fix missing procedure_checks parameter
Centralized Logging: diff/cli.py migrated from print() to centralized logger
Copyright Header: Added missing header to clp.py

🌐 Website & Dependencies

DOMPurify XSS Fix: npm overrides to fix DOMPurify vulnerability
Dependabot Bumps: flatted 3.3.3→3.4.2, rollup 4.52.5→4.59.0, minimatch security update

Compatibility Notes

Breaking Change (Dependencies): orjson>=3.9 and rich>=13.0 are now default dependencies. Users who installed tritonparse with zero dependencies will need to install these on upgrade. Environments where orjson cannot be installed (e.g., CPython 3.14 free-threading) will automatically fall back to stdlib json.
Python 3.13: CI now tests on Python 3.13 (was 3.11).
Procedure Detection: BlockPingpong detection behavior is preserved but the mechanism changed from hardcoded Python to FileCheck-based. Requires LLVM FileCheck binary (bundled with Triton, or set FILECHECK_PATH). Without FileCheck, procedure detection is disabled with a warning.
New Features: --trace diff mode, ai module, validation schemas, torch trace attribution, and tensor blob controls are all additive and don't affect existing workflows.
--triton-repo: Defaults to oai (triton-lang/triton), matching pre-existing behavior.

Upgrade Guidance

Install new dependencies:

pip install --upgrade tritonparse  # orjson and rich will be installed automatically

Use trace-level diff:

# Compare all kernels across two traces
tritonparseoss diff trace_a.ndjson trace_b.ndjson --trace

# With tensor value comparison
tritonparseoss diff trace_a.ndjson trace_b.ndjson --trace --tensor-values

Use torch trace attribution (for multi-process Triton JIT scenarios):

tritonparseoss parse trace.ndjson --torch-trace-dir /path/to/torch_traces/

Control tensor blob saving:

# Skip first 100 runs, then save blobs for 50 runs
TRITONPARSE_TENSOR_SAVE_SKIP_RUNS=100 TRITONPARSE_TENSOR_SAVE_MAX_RUNS=50 python train.py

Or via Python API:

with TritonParseManager(
    enable_tensor_blob_storage=True,
    tensor_save_skip_runs=100,
    tensor_save_max_runs=50,
) as tp:
    model(input)

Validate trace files against schemas:

from tritonparse.validation import json_validator
result = json_validator.validate_trace_file("trace.ndjson")
print(f"Valid: {result['valid']}, Records: {result['record_count']}")

Bisect with Meta's Triton fork:

tritonparseoss bisect --triton-repo meta --triton-dir ~/triton \
    --test-script test.py --good v2.0.0 --bad HEAD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TritonParse v0.4.2 Release 🎉

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

TritonParse Release Notes v0.4.2 (45 commits)

Highlights

Changes by Area

🤖 New `ai` Module

🔬 Whole-Trace Diff (`--trace` mode)

📋 FileCheck-Based Procedure Detection

⚡ orjson Performance + Free-Threading Fallback

🔍 Torch Trace Kernel Attribution

✅ JSON Schema Validation

🎛️ Tensor Blob Save Controls

🔧 Reproducer Enhancements

🔧 Bisect Enhancements

🏗️ Infrastructure & CI

🌐 Website & Dependencies

Compatibility Notes

Upgrade Guidance

Uh oh!

TritonParse v0.4.2 Release 🎉

TritonParse Release Notes v0.4.2 (45 commits)

Highlights

Changes by Area

🤖 New ai Module

🔬 Whole-Trace Diff (--trace mode)

📋 FileCheck-Based Procedure Detection

⚡ orjson Performance + Free-Threading Fallback

🔍 Torch Trace Kernel Attribution

✅ JSON Schema Validation

🎛️ Tensor Blob Save Controls

🔧 Reproducer Enhancements

🔧 Bisect Enhancements

🏗️ Infrastructure & CI

🌐 Website & Dependencies

Compatibility Notes

Upgrade Guidance

Uh oh!

🤖 New `ai` Module

🔬 Whole-Trace Diff (`--trace` mode)