-
Notifications
You must be signed in to change notification settings - Fork 9
Introduce KernelAgent <> BackendBench Integration #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Fix ruff linting errors in eval.py (remove unused import, fix f-strings) - Fix ruff linting errors in setup.py (remove unnecessary f-string prefixes) - Add Python version marker (>=3.10) to backendbench dependency in pyproject.toml - This allows core KernelAgent to support Python 3.8+ while BackendBench integration requires 3.10+ - Update CI workflow to use venv activation instead of 'uv run' to avoid dependency resolution issues
| # Show message if using default directory | ||
| if base_dir == "generated_kernels": | ||
| print(f"ℹ️ Using default directory: {abs_base_dir}") | ||
| print(" (Specify --base-dir to use a different location)\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: For the sake of keeping it minimal
| print(" (Specify --base-dir to use a different location)\n") |
Jack-Khuu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a nice start, but I'm not sure how this goes towards testing the main offerings of KernelAgent if we're using different prompts
| @@ -0,0 +1,130 @@ | |||
| # Copyright (c) Meta Platforms, Inc. and affiliates. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just make the setup a simple pass thru?
E.g. scripts/setup_backendbench.sh
python -m BackendBench.scripts.setup_operator_directories "$@"| @@ -0,0 +1,547 @@ | |||
| # Copyright (c) Meta Platforms, Inc. and affiliates. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming nit since it's one file: benchmark/backend_bench.py after moving setup (see comment)
| "python-dotenv", | ||
| "gradio", | ||
| "requests", | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lel
| logger.error("--evaluate-only requires --kernels-dir") | ||
| return 1 | ||
|
|
||
| if args.generate_only and args.evaluate_only: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| kernels_dir = args.kernels_dir | ||
|
|
||
| # Phase 1: Generate kernels | ||
| if not args.evaluate_only: | ||
| kernels_dir = generate_kernels( | ||
| suite_name=args.suite, | ||
| num_operators=args.num_operators, | ||
| num_workers=args.num_workers, | ||
| max_rounds=args.max_rounds, | ||
| workflow=args.workflows, | ||
| verbose=args.verbose, | ||
| ) | ||
|
|
||
| if args.generate_only: | ||
| logger.info(f"Generation complete. Kernels saved to: {kernels_dir}") | ||
| return 0 | ||
|
|
||
| # Phase 2: Evaluate kernels | ||
| if not args.generate_only: | ||
| exit_code = evaluate_kernels( | ||
| kernels_dir=kernels_dir, | ||
| suite_name=args.suite, | ||
| verbose=args.verbose, | ||
| ) | ||
|
|
||
| if exit_code == 0: | ||
| logger.info("Evaluation complete!") | ||
| logger.info(f"Results saved to: {kernels_dir}") | ||
| else: | ||
| logger.error(f"Evaluation failed with exit code {exit_code}") | ||
|
|
||
| return exit_code | ||
|
|
||
| return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestion/removing unnecessary conditionals
| kernels_dir = args.kernels_dir | |
| # Phase 1: Generate kernels | |
| if not args.evaluate_only: | |
| kernels_dir = generate_kernels( | |
| suite_name=args.suite, | |
| num_operators=args.num_operators, | |
| num_workers=args.num_workers, | |
| max_rounds=args.max_rounds, | |
| workflow=args.workflows, | |
| verbose=args.verbose, | |
| ) | |
| if args.generate_only: | |
| logger.info(f"Generation complete. Kernels saved to: {kernels_dir}") | |
| return 0 | |
| # Phase 2: Evaluate kernels | |
| if not args.generate_only: | |
| exit_code = evaluate_kernels( | |
| kernels_dir=kernels_dir, | |
| suite_name=args.suite, | |
| verbose=args.verbose, | |
| ) | |
| if exit_code == 0: | |
| logger.info("Evaluation complete!") | |
| logger.info(f"Results saved to: {kernels_dir}") | |
| else: | |
| logger.error(f"Evaluation failed with exit code {exit_code}") | |
| return exit_code | |
| return 0 | |
| kernels_dir = args.kernels_dir if args.evaluate_only | |
| else generate_kernels( | |
| suite_name=args.suite, | |
| num_operators=args.num_operators, | |
| num_workers=args.num_workers, | |
| max_rounds=args.max_rounds, | |
| workflow=args.workflows, | |
| verbose=args.verbose, | |
| ) | |
| if args.generate_only: | |
| logger.info(f"Generation complete. Kernels saved to: {kernels_dir}") | |
| return 0 | |
| # Phase 2: Evaluate kernels | |
| exit_code = evaluate_kernels( | |
| kernels_dir=kernels_dir, | |
| suite_name=args.suite, | |
| verbose=args.verbose, | |
| ) | |
| if exit_code == 0: | |
| logger.info("Evaluation complete!") | |
| logger.info(f"Results saved to: {kernels_dir}") | |
| else: | |
| logger.error(f"Evaluation failed with exit code {exit_code}") | |
| return exit_code |
| return result.returncode | ||
|
|
||
|
|
||
| def _create_problem_description_from_op(op, op_name: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Custom?
| raise ValueError(f"Unknown suite: {suite_name}") | ||
|
|
||
| # Get operators to generate | ||
| operators = list(test_suite) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might make sense for this to be a static artifact we refresh if backendbench isn't making huge updates
| cmd = [ | ||
| sys.executable, | ||
| "-m", | ||
| "BackendBench.scripts.main", | ||
| "--backend", | ||
| "directory", | ||
| "--suite", | ||
| suite_name, | ||
| "--ops-directory", | ||
| kernels_dir, | ||
| "--log-dir", | ||
| log_dir, # Save evaluation results to separate log directory | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiannanWang Is there an API we can call instead of a CLI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, currently there isn’t an API for the main entry point. You can check our expected usage of BackendBench here: https://github.com/meta-pytorch/BackendBench/tree/main?tab=readme-ov-file#llm-kernel-development-workflow.
Actually if you think having an API would be helpful, please open an issue in BackendBench. I can implement it into BackendBench.
| return problem_description | ||
|
|
||
|
|
||
| def _create_test_code_from_backendbench(op, op_name: str, test_cases, logger) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing we can probably make static
| # Import the serialization utility | ||
| from BackendBench.utils import serialize_args | ||
|
|
||
| test_code = f'''import torch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Source? Or custom?
Summary
This PR introduces the integration between KernelAgent and BackendBench, enabling systematic evaluation of KernelAgent-generated Triton kernels.
Implementation Details
The integration follows a two-phase workflow:
Phase 1: Kernel Generation
generated_kernels/<operator_name>/Phase 2: Evaluation
DirectoryBackendloads generated kernelsKey Components
benchmark/BackendBench/eval.py- Main evaluation script with:generate_kernels()- Generate kernels using KernelAgentevaluate_kernels()- Evaluate generated kernels with BackendBenchbenchmark/BackendBench/setup.py- Directory structure generator:op_mapTesting Instructions
1. Install BackendBench dependency
pip install -e ".[backendbench]"2. Generate operator directory structure
cd benchmark/BackendBench python setup.py --base-dir generated_kernels3. Run evaluation
Option A: Generate and evaluate in one step
python eval.py --suite torchbench --num-operators 3Option B: Generate only
python eval.py --suite torchbench --num-operators 3 --generate-onlyOption C: Evaluate existing kernels
python eval.py --suite torchbench --evaluate-only --kernels-dir generated_kernels4. View results
Results are saved to timestamped directories:
agent_logs/run_<timestamp>/- KernelAgent generation logsgenerated_kernels- operator directory and store the KernelAgent generated kernellog_BackendBench/run_<timestamp>/- Evaluation resultsOVERALL_SUMMARY.md- Human-readable summaryoperator_summary.csv- Per-operator metricsfull_results.json- Complete test results5. TODO
Eval Result
TritonKernelAgent - INFO - Starting kernel generation
TritonKernelAgent - INFO - Problem:
Implement a high-performance Triton kernel for the PyTorch operation: _adaptive_avg_pool2d.default
...
TritonKernelAgent - INFO - Using provided test code
INFO - - log_BackendBench/run_20251107_001753/OVERALL_SUMMARY.md
INFO - - log_BackendBench/run_20251107_001753/operator_summary.csv
INFO - - log_BackendBench/run_20251107_001753/full_results.json
DirectoryBackend loaded 1 kernels from generated_kernels/
correctness score (mean pass rate over all operators): 1.00
performance score (geomean speedup over all operators): 0.14