Add xpu-kernels skill - Intel XPU Triton kernel development#547
Conversation
Adds a new skill under kernel-builder/skills/xpu-kernels/, alongside the existing cuda-kernels and rocm-kernels skills, bringing Intel XPU support to kernel-builder. Target hardware is Intel Battlemage / Arc Pro B70 (Xe2) via the Intel XPU Backend for Triton (https://github.com/intel/intel-xpu-backend-for-triton). The skill packages the Xe-Forge (https://github.com/IntelLabs/Xe-Forge) workflow — an LLM-driven loop that transforms PyTorch code into optimized Triton kernels for Intel XPU — into the hf-kernels skill format. Xe-Forge has been used to produce measured speedups on KernelBench Level 2 fused kernels (bf16) and Flash Attention forward (fp16); full results live in that repo.
|
Hi @danielfleischer, thanks for your interest in contributing! This project requires that pull request authors are vouched, and you are not in the list of vouched users. This PR will be closed automatically. See https://github.com/huggingface/kernels/blob/main/CONTRIBUTING.md for more details. |
|
Hi @danielfleischer, thanks for your interest in contributing! This project requires that pull request authors are vouched, and you are not in the list of vouched users. This PR will be closed automatically. See https://github.com/huggingface/kernels/blob/main/CONTRIBUTING.md for more details. |
|
Hi @danielfleischer, thanks for your interest in contributing! This project requires that pull request authors are vouched, and you are not in the list of vouched users. This PR will be closed automatically. See https://github.com/huggingface/kernels/blob/main/CONTRIBUTING.md for more details. |
* fix: remove existing test repo before upload (huggingface#519) * fix: remove existing test repo before upload * fix: add missing content type * fix: prefer removing repos via hub library * fix: use lib from nix shell on runner * fix: disallow more than one instance of E2E running at once to avoid race conditions * fix: prefer using ci token * fix: update e2e to use trust_remote_code for the dummy user * fix: prefer using latest kernels-data in test * fix: update nix warns to throws (huggingface#540) * feat: bump cute dsl/cutlass (huggingface#545) * feat: add to vouched (huggingface#551) * hook up skill in the cli and add docs. --------- Co-authored-by: drbh <david.richard.holtz@gmail.com> Co-authored-by: Copilot <copilot@github.com>
…#550) * Update version bumping scripts with the `--major` option With this change the script supports both major and minor version bumping. For example: Codebase at `0.10.1.dev0` ``` (none) -> 0.10.1 --major -> 0.11.0 --dev -> 0.10.1.dev1 --dev --major -> 0.11.0.dev0 ``` Codebase at `0.10.1`: ``` (none) -> 0.10.2 --major -> 0.11.0 --dev -> 0.10.2.dev0 ``` These are the typical version bumping workflows within the project. * Sync .PHONY targets
| #!/usr/bin/env python3 | ||
| """Bump all version strings in the repo. | ||
|
|
||
| Without ``--dev``: strip the development suffix ahead of a release. |
There was a problem hiding this comment.
This seems like an unrelated change?
| from kernels import get_kernel, get_local_kernel | ||
|
|
||
| if is_local: | ||
| kernel = get_local_kernel(Path(repo_id), "activation") |
| .split('/') | ||
| .next_back() | ||
| .is_some_and(|n| n.starts_with("benchmark")) | ||
| .is_some_and(|n| n.starts_with("benchmark") && n.ends_with(".py")) |
| @@ -1,4 +1,4 @@ | |||
| .PHONY: style kernel-builder-cli-docs quality bump-dev bump-dev-dry-run bump-release bump-release-dry-run pin-actions | |||
| .PHONY: style kernel-builder-cli-docs quality bump-dev bump-dev-dry-run bump-dev-major bump-dev-major-dry-run bump-release bump-release-dry-run bump-major bump-major-dry-run pin-actions | |||
|
Should I not have cherry picked main? |
|
If we merge the upstram |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
@danieldk could you review the changes in |
Looks good! 👍 |
Adds a new skill under kernel-builder/skills/xpu-kernels/, alongside the existing cuda-kernels and rocm-kernels skills, bringing Intel XPU support to kernel-builder. Target hardware is Intel Battlemage / Arc Pro B70 (Xe2) via the Intel XPU Backend for Triton (https://github.com/intel/intel-xpu-backend-for-triton).
The skill packages the Xe-Forge (https://github.com/IntelLabs/Xe-Forge) workflow — an LLM-driven loop that transforms PyTorch code into optimized Triton kernels for Intel XPU — into the hf-kernels skill format. Xe-Forge has been used to produce measured speedups on KernelBench Level 2 fused kernels (bf16) and Flash Attention (fp16); full results live in that repo.
What's included
plus HF kernels / transformers integration examples.
Next steps / guidance welcome
The skill content has been developed and validated against Xe-Forge directly. Integration into this repo's Nix-based build is the remaining piece, and I'd appreciate pointers from maintainers on:
Happy to iterate on any of the above.