TritonParse v0.4.2 Release π
TritonParse Release Notes v0.4.2 (45 commits)
- Date range: 2026-02-27 β 2026-03-30
- Scope: Feature release - New
aimodule for LLM-powered analysis, whole-trace--tracediff mode with multi-strategy kernel matching, FileCheck-based procedure detection replacing hardcoded BlockPingpong, orjson performance optimization with free-threading fallback, torch trace kernel attribution, JSON schema validation, and kernel-run-level tensor blob save controls.
Highlights
-
π€ New
aiModule: LLM client abstraction layer withLLMClientABC,ClaudeCodeClientfor Claude Code CLI integration,MockClientfor testing, and output parsers (extract_json,extract_code_block,extract_diff_patch). Foundation for AI-powered analysis features. -
π¬ Whole-Trace Diff (
--tracemode): Compare all kernels across two trace files with a single command. Multi-strategyKernelMatcherengine matches kernels by hash β name β source similarity β fuzzy name β config similarity.TraceDiffEngineorchestrates matching, per-pair diffing, and summary generation. Autotuning-aware: distinguishes truly absent kernels from unpaired autotuning compilations. -
π FileCheck-Based Procedure Detection: Complete rewrite of IR analysis from hardcoded Python pattern matching to a JSON-driven, FileCheck-based system. Procedure definitions are declarative with configurable pattern checks and display attributes. Replaces old
BlockPingpongCategorywith three configurable procedure configs (Small/Medium/Large). Tile size attributes (M, N, K, bits) now displayed. -
β‘ orjson Performance + Free-Threading Fallback: New
_json_compat.pycompatibility layer uses orjson for performance and falls back to stdlib json for CPython 3.14 free-threading builds. All 21 modules migrated.orjson>=3.9andrich>=13.0are now default dependencies. -
π Torch Trace Kernel Attribution: New torch trace log parser extracts
kernel_source_path β CompileInfomappings from inductor's output code events, enabling kernel-to-compilation-frame attribution whenpt_infois missing. Wired through parse pipeline and CLI via--torch-trace-dir. -
β JSON Schema Validation: New
tritonparse/validation/module with JSON schemas forcompilation,launch,launch_diff, andir_analysisevent types. Lightweight validator checks types, required fields, enums, numeric constraints, and$refresolution. -
ποΈ Kernel-Run-Level Tensor Blob Controls: New
TRITONPARSE_TENSOR_SAVE_SKIP_RUNSandTRITONPARSE_TENSOR_SAVE_MAX_RUNSenvironment variables (and Python API) for fine-grained control over which kernel runs get tensor blob snapshots.
Changes by Area
π€ New ai Module
A new tritonparse/ai/ module (~1,400 lines) providing LLM client abstractions:
- LLM Client ABC (PR-1):
LLMClientabstract base withchat()andchat_stream()interfaces;Message,Response,ToolCalldataclasses;MockClientfor testing - ClaudeCodeClient (PR-2): Production client wrapping Claude Code CLI with temp file shell escaping, session resumption, model selection, retry logic, JSON/stream-JSON parsing
- Output Parsers (PR-3):
extract_json(),extract_code_block(),extract_diff_patch()fallback parsers for LLM text responses;format_messages(),truncate_context()utilities - Error Diagnostics: Improved error handling extracts actual error from stdout JSON
"result"field instead of just stderr
π¬ Whole-Trace Diff (--trace mode)
A complete trace-level comparison system (~3,400 lines) with layered architecture:
- Data Types:
MatchMethodenum (HASH/NAME/SOURCE/FUZZY_NAME/CONFIG),KernelMatchResult,TraceDiffResult,TraceDiffSummary,TraceStats,DtypeMismatch - KernelMatcher (~505 lines): Three-phase group-aware matching engine:
- Phase 0: Hash-based exact matching (highest priority, cross-name capable)
- Phase 1: Group-level matching by exact name β source similarity (threshold 0.75) β fuzzy name (threshold 0.7)
- Phase 2: Within-group config pairing by (num_stages, num_warps, shared memory) similarity
- Bounded sampling (
_MAX_GROUP_SAMPLES=5) for performance on large traces
- TraceDiffEngine (~355 lines): Orchestrator computing trace stats β kernel matching β per-pair DiffEngine β summary generation
- Output:
TraceSummaryFormatterfor human-readable output; extendedConsolidatedDiffWriterwithadd_trace_diff() - CLI: New
--traceflag requiring exactly 2 input files - Dtype Mismatch Detection: Surfaces dtype mismatches in tensor value comparison when argument names don't overlap
- Test Reorganization: Monolithic
test_diff.pysplit into 7 focused files:test_cli.py,test_diff_engine.py,test_fixtures.py,test_kernel_matcher.py,test_tensor_value.py,test_trace_diff.py,test_trace_output.py
CLI Usage:
# Compare all kernels across two trace files
tritonparseoss diff trace_a.ndjson trace_b.ndjson --trace
# With tensor value analysis
tritonparseoss diff trace_a.ndjson trace_b.ndjson --trace --tensor-values --atol 1e-5 --rtol 1e-3π FileCheck-Based Procedure Detection
Complete rewrite of IR analysis (~2,200 lines) from hardcoded Python to JSON-driven FileCheck:
- FileCheck Integration: Auto-discovers FileCheck binary from Triton's bundled version,
FILECHECK_PATHenv var, or system PATH - JSON Configuration (
default_procedure_checks.json): Declarative procedure definitions withpattern_checks(FileCheck patterns) anddisplay_attributes(configurable extraction rules) - Attribute Extraction: Multiple sources (
module_attrs,ir_content,computed) with rules (regex,count,dot_shape,tile_size_bits,pp_clusters) - Tile Size Display: New tile_m, tile_n, tile_k, tile_size_bits attributes
- BlockPingpong Migration: Old
BlockPingpongCategoryenum and ~254 lines of hardcoded Python replaced by three JSON-configured procedures (Small/Medium/Large) - Website UI: Collapsible/foldable sections per procedure in IRAnalysis page
- Streamlined Workflow: Procedure detection integrated into main tritonparse parse pipeline
β‘ orjson Performance + Free-Threading Fallback
_json_compat.py(new): Unified JSON compatibility layer β orjson when available, stdlib json fallbackloads()acceptsstr | bytes | bytearray | memoryviewdumps()returnsstrwithindentandsort_keyssupport- Non-string key coercion in fallback path (replicates orjson's
OPT_NON_STR_KEYS)
- Global Migration (#362): All 21 modules migrated from
import jsontofrom tritonparse._json_compat import loads, dumps, JSONDecodeError - Free-Threading Support (#365): Automatic stdlib json fallback for CPython 3.14 free-threading builds where orjson is unavailable
- Default Dependencies (#366):
orjson>=3.9andrich>=13.0added topyproject.tomldependencies (previously zero dependencies)
π Torch Trace Kernel Attribution
- Torch Trace Parser (#353): New
tritonparse/parse/torch_trace_parser.py(~212 lines) parsing inductor's glog-formatted torch trace logs to extractkernel_source_path β CompileInfomappings frominductor_output_codeevents - Trace Processor Integration (#354):
_build_kernel_attribution_map()and_apply_kernel_attribution()enrich compilation events withpt_infowhen missing (~126 lines) - CLI & Pipeline Wiring (#355): New
--torch-trace-dirflag with auto-discovery of torch trace files from the same parent directory
β JSON Schema Validation
- Schema Files (#356): Four JSON schemas for
compilation,launch,launch_diff,ir_analysisevent types - Lightweight Validator (
json_validator.py, ~287 lines): Validates required fields, types, enums, numeric constraints (min/max/exclusive),additionalProperties, array items, and$refresolution validate_trace_file(): Full NDJSON trace file validation withmax_errorscap- Schema Loader:
importlib.resourcesfor PAR compatibility, lazy loading with caching - Test Suite: Comprehensive tests (~652 lines) covering all validation scenarios
ποΈ Tensor Blob Save Controls
- Skip/Max Runs Gating: New environment variables for fine-grained control:
TRITONPARSE_TENSOR_SAVE_SKIP_RUNS: Skip tensor blob saving for the first N kernel runs (default: 0)TRITONPARSE_TENSOR_SAVE_MAX_RUNS: Save tensor blobs for at most N kernel runs after skipping (default: 0 = unlimited)
- Python API:
TritonParseManager(tensor_save_skip_runs=N, tensor_save_max_runs=M)andinit(tensor_save_skip_runs=N, tensor_save_max_runs=M) - Autotune-Aware: Benchmark launches during autotune are excluded from run counting
- GPU Tests: End-to-end validation of skip/max runs gating
π§ Reproducer Enhancements
- CUDA Graph Capture Error (#359): Clear
RuntimeErrorwhen reproducing kernels launched during CUDA graph capture, explaining that argument extraction was skipped - Kernel Name Fallback: Reproducer/info now falls back to matching by kernel name when compilation hash is missing (Inductor kernels where JIT hook didn't fire)
π§ Bisect Enhancements
--triton-repoFlag: Controls culprit commit URL prefix βoai(triton-lang/triton, default) ormeta(facebookexperimental/triton); state persisted and restored on resume- Rich as Default Dependency (#366):
rich>=13.0moved from optional to default, simplifying bisect UI code
ποΈ Infrastructure & CI
- GitHub Actions Update (#357): All actions updated to latest versions; Python test matrix changed from 3.11 to 3.13
- MAST Compatibility: Handle both numeric and string state formats in MAST CLI JSON output
- Internal Test Reorganization (#358):
test_mast_compat.pymoved totests/fb/for internal-only tests - Manifold Upload (#360): New
TRITONPARSE_TRACE_MANIFOLDenv var for automatic raw log upload on handler cleanup - Fix
oss_run(#361): Fix missingprocedure_checksparameter - Centralized Logging:
diff/cli.pymigrated fromprint()to centralized logger - Copyright Header: Added missing header to
clp.py
π Website & Dependencies
- DOMPurify XSS Fix: npm overrides to fix DOMPurify vulnerability
- Dependabot Bumps: flatted 3.3.3β3.4.2, rollup 4.52.5β4.59.0, minimatch security update
Compatibility Notes
- Breaking Change (Dependencies):
orjson>=3.9andrich>=13.0are now default dependencies. Users who installed tritonparse with zero dependencies will need to install these on upgrade. Environments where orjson cannot be installed (e.g., CPython 3.14 free-threading) will automatically fall back to stdlib json. - Python 3.13: CI now tests on Python 3.13 (was 3.11).
- Procedure Detection: BlockPingpong detection behavior is preserved but the mechanism changed from hardcoded Python to FileCheck-based. Requires LLVM FileCheck binary (bundled with Triton, or set
FILECHECK_PATH). Without FileCheck, procedure detection is disabled with a warning. - New Features:
--tracediff mode,aimodule, validation schemas, torch trace attribution, and tensor blob controls are all additive and don't affect existing workflows. --triton-repo: Defaults tooai(triton-lang/triton), matching pre-existing behavior.
Upgrade Guidance
-
Install new dependencies:
pip install --upgrade tritonparse # orjson and rich will be installed automatically -
Use trace-level diff:
# Compare all kernels across two traces tritonparseoss diff trace_a.ndjson trace_b.ndjson --trace # With tensor value comparison tritonparseoss diff trace_a.ndjson trace_b.ndjson --trace --tensor-values
-
Use torch trace attribution (for multi-process Triton JIT scenarios):
tritonparseoss parse trace.ndjson --torch-trace-dir /path/to/torch_traces/
-
Control tensor blob saving:
# Skip first 100 runs, then save blobs for 50 runs TRITONPARSE_TENSOR_SAVE_SKIP_RUNS=100 TRITONPARSE_TENSOR_SAVE_MAX_RUNS=50 python train.pyOr via Python API:
with TritonParseManager( enable_tensor_blob_storage=True, tensor_save_skip_runs=100, tensor_save_max_runs=50, ) as tp: model(input)
-
Validate trace files against schemas:
from tritonparse.validation import json_validator result = json_validator.validate_trace_file("trace.ndjson") print(f"Valid: {result['valid']}, Records: {result['record_count']}")
-
Bisect with Meta's Triton fork:
tritonparseoss bisect --triton-repo meta --triton-dir ~/triton \ --test-script test.py --good v2.0.0 --bad HEAD