Introduce KernelAgent <> BackendBench Integration #41

kaiming-cheng · 2025-11-07T06:51:42Z

Summary

This PR introduces the integration between KernelAgent and BackendBench, enabling systematic evaluation of KernelAgent-generated Triton kernels.

Implementation Details

The integration follows a two-phase workflow:

Phase 1: Kernel Generation

KernelAgent generates Triton kernels for specified PyTorch operators
Generated kernels are saved to generated_kernels/<operator_name>/

Phase 2: Evaluation

BackendBench's DirectoryBackend loads generated kernels
Executes correctness and performance tests

Key Components

benchmark/BackendBench/eval.py - Main evaluation script with:
- generate_kernels() - Generate kernels using KernelAgent
- evaluate_kernels() - Evaluate generated kernels with BackendBench
- create the test case code from BackendBench and pass it in the kernel generation
- Functionalities here are modified based on Fix operation filtering and integrate KernelAgent with BackendBench test cases BackendBench#111
benchmark/BackendBench/setup.py - Directory structure generator:
- Creates operator directories from BackendBench's op_map
- Sets up proper file structure for DirectoryBackend

Testing Instructions

1. Install BackendBench dependency

pip install -e ".[backendbench]"

2. Generate operator directory structure

cd benchmark/BackendBench 
python setup.py --base-dir generated_kernels

3. Run evaluation

Option A: Generate and evaluate in one step

python eval.py --suite torchbench --num-operators 3

Option B: Generate only

python eval.py --suite torchbench --num-operators 3 --generate-only

Option C: Evaluate existing kernels

python eval.py --suite torchbench --evaluate-only --kernels-dir generated_kernels

4. View results

Results are saved to timestamped directories:

agent_logs/run_<timestamp>/ - KernelAgent generation logs
generated_kernels - operator directory and store the KernelAgent generated kernel
log_BackendBench/run_<timestamp>/ - Evaluation results
- OVERALL_SUMMARY.md - Human-readable summary
- operator_summary.csv - Per-operator metrics
- full_results.json - Complete test results

5. TODO

Support Fuser evaluation logic
Add single op vs multiple op support
Enable FP16/BF16 filtering

Eval Result

TritonKernelAgent - INFO - Starting kernel generation
TritonKernelAgent - INFO - Problem:
Implement a high-performance Triton kernel for the PyTorch operation: _adaptive_avg_pool2d.default
...
TritonKernelAgent - INFO - Using provided test code
INFO - - log_BackendBench/run_20251107_001753/OVERALL_SUMMARY.md
INFO - - log_BackendBench/run_20251107_001753/operator_summary.csv
INFO - - log_BackendBench/run_20251107_001753/full_results.json

DirectoryBackend loaded 1 kernels from generated_kernels/
correctness score (mean pass rate over all operators): 1.00
performance score (geomean speedup over all operators): 0.14

- Fix ruff linting errors in eval.py (remove unused import, fix f-strings) - Fix ruff linting errors in setup.py (remove unnecessary f-string prefixes) - Add Python version marker (>=3.10) to backendbench dependency in pyproject.toml - This allows core KernelAgent to support Python 3.8+ while BackendBench integration requires 3.10+ - Update CI workflow to use venv activation instead of 'uv run' to avoid dependency resolution issues

Jack-Khuu · 2025-11-07T17:52:00Z

benchmark/BackendBench/setup.py

+        # Show message if using default directory
+        if base_dir == "generated_kernels":
+            print(f"ℹ️  Using default directory: {abs_base_dir}")
+            print("   (Specify --base-dir to use a different location)\n")


nit: For the sake of keeping it minimal

Suggested change

print(" (Specify --base-dir to use a different location)\n")

Jack-Khuu

Looks like a nice start, but I'm not sure how this goes towards testing the main offerings of KernelAgent if we're using different prompts

Jack-Khuu · 2025-11-07T18:19:46Z

benchmark/BackendBench/setup.py

@@ -0,0 +1,130 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.


Can we just make the setup a simple pass thru?

E.g. scripts/setup_backendbench.sh

python -m BackendBench.scripts.setup_operator_directories "$@"

Jack-Khuu · 2025-11-07T18:41:12Z

benchmark/BackendBench/eval.py

@@ -0,0 +1,547 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.


Naming nit since it's one file: benchmark/backend_bench.py after moving setup (see comment)

Jack-Khuu · 2025-11-07T18:46:44Z

pyproject.toml

    "python-dotenv",
    "gradio",
    "requests",
+


Jack-Khuu · 2025-11-07T18:52:47Z

benchmark/BackendBench/eval.py

+        logger.error("--evaluate-only requires --kernels-dir")
+        return 1
+
+    if args.generate_only and args.evaluate_only:


Check out https://docs.python.org/3/library/argparse.html#mutual-exclusion

Jack-Khuu · 2025-11-07T19:25:17Z

benchmark/BackendBench/eval.py

+        kernels_dir = args.kernels_dir
+
+        # Phase 1: Generate kernels
+        if not args.evaluate_only:
+            kernels_dir = generate_kernels(
+                suite_name=args.suite,
+                num_operators=args.num_operators,
+                num_workers=args.num_workers,
+                max_rounds=args.max_rounds,
+                workflow=args.workflows,
+                verbose=args.verbose,
+            )
+
+            if args.generate_only:
+                logger.info(f"Generation complete. Kernels saved to: {kernels_dir}")
+                return 0
+
+        # Phase 2: Evaluate kernels
+        if not args.generate_only:
+            exit_code = evaluate_kernels(
+                kernels_dir=kernels_dir,
+                suite_name=args.suite,
+                verbose=args.verbose,
+            )
+
+            if exit_code == 0:
+                logger.info("Evaluation complete!")
+                logger.info(f"Results saved to: {kernels_dir}")
+            else:
+                logger.error(f"Evaluation failed with exit code {exit_code}")
+
+            return exit_code
+
+        return 0


Minor suggestion/removing unnecessary conditionals

Suggested change

kernels_dir = args.kernels_dir

# Phase 1: Generate kernels

if not args.evaluate_only:

kernels_dir = generate_kernels(

suite_name=args.suite,

num_operators=args.num_operators,

num_workers=args.num_workers,

max_rounds=args.max_rounds,

workflow=args.workflows,

verbose=args.verbose,

)

if args.generate_only:

logger.info(f"Generation complete. Kernels saved to: {kernels_dir}")

return 0

# Phase 2: Evaluate kernels

if not args.generate_only:

exit_code = evaluate_kernels(

kernels_dir=kernels_dir,

suite_name=args.suite,

verbose=args.verbose,

)

if exit_code == 0:

logger.info("Evaluation complete!")

logger.info(f"Results saved to: {kernels_dir}")

else:

logger.error(f"Evaluation failed with exit code {exit_code}")

return exit_code

return 0

kernels_dir = args.kernels_dir if args.evaluate_only

else generate_kernels(

suite_name=args.suite,

num_operators=args.num_operators,

num_workers=args.num_workers,

max_rounds=args.max_rounds,

workflow=args.workflows,

verbose=args.verbose,

)

if args.generate_only:

logger.info(f"Generation complete. Kernels saved to: {kernels_dir}")

return 0

# Phase 2: Evaluate kernels

exit_code = evaluate_kernels(

kernels_dir=kernels_dir,

suite_name=args.suite,

verbose=args.verbose,

)

if exit_code == 0:

logger.info("Evaluation complete!")

logger.info(f"Results saved to: {kernels_dir}")

else:

logger.error(f"Evaluation failed with exit code {exit_code}")

return exit_code

Jack-Khuu · 2025-11-07T20:26:59Z

benchmark/BackendBench/eval.py

+    return result.returncode
+
+
+def _create_problem_description_from_op(op, op_name: str) -> str:


Jack-Khuu · 2025-11-07T20:28:38Z

benchmark/BackendBench/eval.py

+        raise ValueError(f"Unknown suite: {suite_name}")
+
+    # Get operators to generate
+    operators = list(test_suite)


Might make sense for this to be a static artifact we refresh if backendbench isn't making huge updates

Jack-Khuu · 2025-11-07T20:30:50Z

benchmark/BackendBench/eval.py

+    cmd = [
+        sys.executable,
+        "-m",
+        "BackendBench.scripts.main",
+        "--backend",
+        "directory",
+        "--suite",
+        suite_name,
+        "--ops-directory",
+        kernels_dir,
+        "--log-dir",
+        log_dir,  # Save evaluation results to separate log directory
+    ]


@jiannanWang Is there an API we can call instead of a CLI?

No, currently there isn’t an API for the main entry point. You can check our expected usage of BackendBench here: https://github.com/meta-pytorch/BackendBench/tree/main?tab=readme-ov-file#llm-kernel-development-workflow.

Actually if you think having an API would be helpful, please open an issue in BackendBench. I can implement it into BackendBench.

Jack-Khuu · 2025-11-07T20:32:49Z

benchmark/BackendBench/eval.py

+    return problem_description
+
+
+def _create_test_code_from_backendbench(op, op_name: str, test_cases, logger) -> str:


Another thing we can probably make static

Jack-Khuu · 2025-11-07T20:33:10Z

benchmark/BackendBench/eval.py

+    # Import the serialization utility
+    from BackendBench.utils import serialize_args
+
+    test_code = f'''import torch


Source? Or custom?

Introduce KernelAgent <> BackendBench Integration

48aa534

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 7, 2025

kaiming-cheng and others added 2 commits November 6, 2025 22:51

Merge branch 'main' into kaiming/BackendBench-integration

185b2f8

kaiming-cheng marked this pull request as ready for review November 7, 2025 07:07

kaiming-cheng requested review from Jack-Khuu and Laurawly November 7, 2025 07:07

Kaiming Cheng added 4 commits November 6, 2025 23:45

fix function parameter

36c8f93

fix prompt

827b2e0

fix pyproject

53eed66

fix function name

ac7cd0f

kaiming-cheng requested a review from jiannanWang November 7, 2025 17:51

Jack-Khuu reviewed Nov 7, 2025

View reviewed changes

Jack-Khuu requested changes Nov 7, 2025

View reviewed changes

		@@ -0,0 +1,130 @@
		# Copyright (c) Meta Platforms, Inc. and affiliates.

		@@ -0,0 +1,547 @@
		# Copyright (c) Meta Platforms, Inc. and affiliates.

		return result.returncode


		def _create_problem_description_from_op(op, op_name: str) -> str:

		return problem_description


		def _create_test_code_from_backendbench(op, op_name: str, test_cases, logger) -> str:

Introduce KernelAgent <> BackendBench Integration #41

Are you sure you want to change the base?

Introduce KernelAgent <> BackendBench Integration #41

Conversation

kaiming-cheng commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implementation Details

Phase 1: Kernel Generation

Phase 2: Evaluation

Key Components

Testing Instructions

1. Install BackendBench dependency

2. Generate operator directory structure

3. Run evaluation

4. View results

5. TODO

Eval Result

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jack-Khuu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kaiming-cheng commented Nov 7, 2025 •

edited

Loading