From 42f58aec6eec6cd0e8cf8c881c94d8360a8d756d Mon Sep 17 00:00:00 2001 From: "Shiyi Zheng (from Dev Box)" Date: Mon, 25 May 2026 17:45:08 +0800 Subject: [PATCH 1/6] Add winml-modelkit Claude Code skill MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds .claude/skills/winml-modelkit/SKILL.md — a Claude Code skill that teaches an agent to drive the `winml` CLI for the Build-Your-Own-Model (BYOM) pipeline across NPU/GPU/CPU execution providers. The body covers install (uv venv + uv pip install winml-cli), discovering the CLI via --help, the inspect -> export -> analyze -> optimize -> quantize -> compile -> perf pipeline, the config + build shortcut, hardware/EP mapping (QNN/OpenVINO/VitisAI/NvTensorRTRTX/DML/ CPU), in/out scope (classic DL vs generative), and the published-output principle that each winml command exposes outputs via -o or stdout while caches and internal byproducts are not supported outputs. --- .claude/skills/winml-modelkit/SKILL.md | 190 +++++++++++++++++++++++++ 1 file changed, 190 insertions(+) create mode 100644 .claude/skills/winml-modelkit/SKILL.md diff --git a/.claude/skills/winml-modelkit/SKILL.md b/.claude/skills/winml-modelkit/SKILL.md new file mode 100644 index 000000000..27dc55a47 --- /dev/null +++ b/.claude/skills/winml-modelkit/SKILL.md @@ -0,0 +1,190 @@ +--- +name: winml-modelkit +description: Build, optimize, quantize, compile, and benchmark ONNX models for Windows ML using the `winml` CLI. Covers the Build-Your-Own-Model (BYOM) pipeline across NPU (Qualcomm QNN, Intel OpenVINO, AMD VitisAI), GPU, and CPU execution providers. Use this skill whenever the user wants to run a Hugging Face or ONNX model on a Windows AI PC, target an NPU, prepare a model for on-device inference, benchmark latency on Snapdragon X Elite / Intel Core Ultra / AMD Ryzen AI, or troubleshoot operator/EP compatibility — even when they don't say "ModelKit" or "winml" by name. If a user mentions running models on Windows hardware, NPU acceleration, or low-latency on-device inference, this skill applies. **Skip for generative models** — LLMs (GPT, LLaMA, Phi, Mistral), Stable Diffusion, Whisper, or any decoder-only / seq2seq architecture are out of scope (planned for late 2026). +--- + +# WinML ModelKit + +ModelKit ships a CLI called `winml` that turns a source model — a Hugging Face ID or a local ONNX file — into a portable, performant artifact that runs on any Windows execution provider. This skill teaches you the *shape* of that workflow. The CLI is the source of truth for current commands and flags. + +## Installing the CLI + +**Default behavior: lead any walkthrough with a brief install section.** Unless the user signals they already have `winml` working, include the install steps below (or a clear "prereq: install winml first" pointer to them) as the first thing in your response. First-timers shouldn't have to guess what they need. + +**Skip the install section only if the user clearly signals they're past install:** +- They quote a `winml ` they ran, with output or an error from it. +- They say they "already" / "previously" exported, built, optimized, etc. with winml. +- They share an artifact path that came out of an earlier winml run. +- They're asking a debugging or comparison question that presumes a working install. + +When in doubt, include it — a five-line prereq block is cheaper than a stuck user. + +ModelKit pins **Python 3.10 exactly** (`>=3.10,<3.11`) — use `uv` to create an isolated venv so you don't pollute system Python or land on a 3.11+ environment that won't resolve. + +**1. Create a Python 3.10 environment** + +```bash +uv venv --python 3.10 +``` + +Activate it: + +```bash +# Windows (PowerShell) +.venv\Scripts\activate + +# Windows (Git Bash / WSL) +source .venv/Scripts/activate +``` + +**2. Install `winml-cli` from PyPI** + +```bash +uv pip install winml-cli +``` + +**3. Verify** + +```bash +winml --help +winml sys --list-ep +``` + +`--help` should print the command list, and `sys --list-ep` should show the execution providers registered on this machine. + +## Discover the CLI before doing anything else + +The command set and flags evolve. Don't memorize them and don't guess them — read them from the tool itself: + +- **`winml --help`** — current top-level command list with one-line descriptions. +- **`winml --help`** — current flags, arguments, and defaults for that command. +- **`winml sys --list-device --list-ep`** — what hardware and execution providers are actually present on this machine. + +Run these *before quoting any command to the user*, not after. "I'll check `--help` if anything looks off" is too late — the user has already copy-pasted a broken command and come back annoyed. If you're about to write `winml ` in your reply, run `winml --help` first. + +**The CLI is flag-based, not positional.** Model IDs and paths go through `-m` / `--model`, not as bare positional arguments. `winml inspect microsoft/resnet-50` will error — you need `winml inspect -m microsoft/resnet-50`. This shape is stable across the toolkit; the specific flag spelling per command isn't, which is why you still read `--help`. + +Inventing plausible-sounding flags (a `--preset`, `--profile`, `--mode=fast`) is the most common way to waste the user's time — the command will reject them and the user has to come back. When in doubt, `--help`. + +## The mental model + +ModelKit organizes work as a pipeline. Each stage is its own primitive command, and the output of one stage feeds the next: + +``` +inspect → export → analyze → optimize → quantize → compile → perf +``` + +You don't have to run every stage. Enter wherever the user's input lives (already have an ONNX file? skip `export`) and exit when you have what you need (just want a latency number? stop at `perf`). Several stages are EP- or hardware-sensitive — `compile` is documented as requiring an NPU device (per README's Scope & Limitations: "winml compile requires an NPU device"). `winml compile --help` does expose `--device` and `--ep` values for CPU/GPU, but treat NPU as the assumed target unless the user says otherwise. + +Sitting on top of the primitives are two **shortcut commands** that wrap the whole pipeline: + +- A **config** command auto-detects every setting the pipeline needs and writes a JSON file. +- A **build** command reads that config and runs the stages in order. + +Together they replace the seven primitives with two. + +The names above (`inspect`, `export`, `analyze`, `optimize`, `quantize`, `compile`, `perf`, plus the config/build pair) are stable concepts — they map to subcommands of `winml`. Confirm exact spelling and current flags via `winml --help` before you write any command. + +## Outputs are explicit; cache is opaque + +Every `winml` command has a **published output** — what `-o` writes or what it prints to stdout — and that is the only thing you can hand to a user or downstream command. What a command does *internally* (cached intermediate builds, temp graphs, EP context blobs in a private folder) is not a supported output. Don't try to fish artifacts out of cache, don't assume one command's internal byproducts can feed another. + +This is the easiest way to pick the right command for a goal: **map the goal to a published output, not to a side effect.** + +- `inspect`, `sys`, `analyze`, `hub` print to stdout (no `-o` artifact). +- `export`, `optimize`, `quantize`, `compile`, `build` write transformed model artifacts to `-o`. +- `perf` writes a **metrics JSON** to `-o` — not a model. +- `config` writes the **config JSON** to `-o` — the input to `build`. + +When you find yourself thinking "command X probably builds something internally that I can grab" — stop. You've picked the wrong entry point. Use the command whose published output is the thing you actually want. + +One concrete failure this guards against: `perf` builds an artifact internally to benchmark it, but only the metrics JSON is published; the build lives in cache and is opaque. If a user wants both a deployable model *and* a latency number, `perf` alone won't get them the model, and chaining `perf → build → perf` just pays the build cost twice — the first `perf` produced nothing they could keep. The right shape is to enter at the command whose `-o` is the artifact (`build`), then run `perf` against that artifact for the number. + +The same logic applies wherever a goal doesn't match a command's published output: `inspect`'s JSON isn't a build input (use `config`); `analyze`'s linter output isn't a graph rewrite (use a different optim/quant config); and so on. + +## The golden rule: inspect first + +Before any other command, run the inspect subcommand on the user's model. Per `winml inspect --help`, it reads the model configuration *without downloading weights* and shows the loader, exporter, WinML inference class, I/O specs, and the build resolution the pipeline will use. Pass `-f json` for machine-readable output. + +Inspect tells you whether the toolkit knows how to handle the architecture. But **always cross-check against the scope section below** — a model that inspect accepts can still be out of scope. The scope rule overrides anything inspect prints; for example, an LLM may have a usable loader/exporter via TasksManager defaults but is still not a fit. + +Skipping inspect and jumping to export or build is the most common cause of confusing failures three stages in, because the cost of finding out a model is unsupported climbs at every later stage. + +## Choosing a path + +Once inspect passes, pick one of two paths based on what the user is trying to do. Default to **config + build** unless the user explicitly wants to fiddle with a single stage. + +**Primitive commands — one stage at a time.** Right when the user is exploring, debugging a specific stage, or tweaking settings between runs. They get fine-grained control at the cost of running more commands. + +**Config + build — two commands for the whole thing.** Right when the user wants a clean, reproducible, end-to-end build for production, CI, or sharing with a teammate. The generated config is the single source of truth — they edit it to override defaults, version-control it, and replay deterministically. + +If the user is unsure, default to config + build unless they say "I want to try different settings" or "something failed and I need to debug a specific stage." + +## Mapping "I want to run X" to a command + +"I want to run resnet" / "can I try this model" / "let me use this on my NPU" is the most common ambiguous prompt. Apply the published-output rule from the previous section: figure out what the user actually wants, then pick the command whose `-o` is that thing. + +| User actually wants | Command whose `-o` is that thing | +|---|---| +| A latency / throughput number | `winml perf` (writes metrics JSON) | +| A deployable `.onnx` they can ship or load from code | `winml config` then `winml build -o ` (build writes the artifact) | +| Both | `winml build -o ` first, then `winml perf` against the built artifact | + +If the user's intent isn't clear from their prompt, ask one short question — "do you want a usable artifact, or just a latency number?" — before quoting commands. + +## Hardware and execution providers + +The right execution provider depends on the user's machine. Status as of 2026-05-20: + +| Hardware | Execution provider | Status | +|---|---|---| +| Qualcomm NPU (Snapdragon X Elite) | QNN | 🟢 Ready | +| Intel NPU (Meteor Lake / Lunar Lake / Core Ultra) | OpenVINO | 🟢 Ready | +| AMD NPU (Ryzen AI: Phoenix / Hawk Point / Strix) | VitisAI | 🟢 Ready | +| NVIDIA discrete GPU | NvTensorRTRTX | 🟢 Ready | +| Hardware-agnostic GPU | DirectML (Dml) | 🟢 Ready | +| AMD discrete GPU | MIGraphX | 🔶 Planned | +| CPU | CPU EP | ⚪ Always available | + +If the user has hardware whose EP is **Planned** (currently only MIGraphX for AMD discrete GPUs), say so — recommend CPU or DML as the working fallback rather than pretending the planned EP is ready. The README's Supported Hardware table may lag behind this status; trust `winml sys --list-ep` on the user's machine for what's actually registered. + +For the **current flag spelling, supported status, and device-selection options** (including any auto-pick mode), consult `winml --help` and `winml sys`. Don't hardcode flag values from this skill into your suggestions — read them live. + +If you don't know what hardware the user has, ask, or run `winml sys` and read the output. + +## Common patterns + +**"Just benchmark this model on my hardware."** A single perf invocation against the source model is enough — `winml perf` builds artifacts on the fly (see `--rebuild`, `--ignore-cache`, `--no-quantize` in `winml perf --help`). You don't have to chain primitives manually. For live NPU utilization during the run, look for the `--monitor` flag in `winml perf --help`. + +**"What's the latency on NPU vs CPU?"** Build once, then run perf twice — once against the EP-compiled artifact on the NPU, once against the optimized (pre-compile) artifact on CPU. Compiled artifacts are tied to the EP they were compiled for, so run the CPU comparison against the pre-compile optimized ONNX, not the compiled NPU artifact. + +**"Will this model work with my hardware?"** Inspect, then analyze. The analyzer's linter classifies every operator as supported / partial / unsupported per EP — that's the cheapest way to find out a build will succeed before paying the full export cost. + +**"My optimize/quantize step just blew up."** Most operator-pattern failures land at these stages even when export succeeded. Re-run analyze against the exported ONNX; the linter will usually name the offending op pattern. Don't hand-edit the ONNX graph — try a different optim or quantization configuration to dodge the unsupported pattern, or escalate to "this model isn't a fit for this EP." + +## Scope — what's in and what's out + +**In scope.** Classic deep learning models — CNNs, encoders, vision transformers, NLP classifiers, NER, object detection, segmentation. Concretely: ResNet, ViT, Swin, ConvNeXT, BERT, RoBERTa, Table Transformer, SegFormer families. If the user passes one of these, the pipeline is designed to handle it. + +**Out of scope.** Generative and decoder-only architectures: GPT, LLaMA, Phi, Mistral, Stable Diffusion, any seq2seq generator. If a user asks ModelKit to handle one of these, **stop and say so** — the pipeline will fail mid-way and the error won't always make the cause obvious. LLM support (with LoRA) is on the public roadmap for late 2026; don't pretend it works today. + +If you're genuinely unsure whether a model is in scope, the inspect command is the source of truth. Trust its verdict over your guess. + +## Things that catch people out + +- **Confirm the target EP is registered before compiling.** Run `winml sys --list-ep` first; if your `--ep ` isn't in the list, compile won't produce a usable artifact for that EP. Compile also runs validation by default (see `--validate / --no-validate` in `winml compile --help`). +- **Compile defaults to external EP-context storage.** Per `winml compile --help`, the default writes EP context to a `.bin` file co-located with the output `.onnx`; pass `--embed` to inline it instead. If you move a non-embedded artifact, move the `.bin` alongside. +- **CLI flags override the config file, not the other way around.** Every primitive that accepts `-c, --config` says so in its `--help`: "Provides defaults; explicit CLI options take precedence." For repeatable builds, edit the JSON; for one-off overrides, pass the flag at build time. +- **Output paths are explicit on the pipeline-building commands.** `export`, `optimize`, `quantize`, `compile`, `perf`, `config`, and `build` each take an `-o` / `--output` (or `--output-dir`). There's no implicit "current directory" convention — tell the user where files will land. `inspect`, `sys`, and `hub` print to stdout and don't require an output path. +- **EP-compiled models are tied to their target EP.** Don't try to perf a QNN-compiled artifact against the CPU EP — the result is at best meaningless. For cross-EP comparison, use the pre-compile optimized ONNX. +- **Don't fabricate flags.** If a flag isn't in `winml --help`, it doesn't exist. Find a real one or change approach. + +## When things go sideways + +Read the error before suggesting a next step. ModelKit error messages are usually specific (op name, EP, stage). When you don't know what to do: + +1. `winml --help` to confirm you used real flags. +2. `winml sys --list-ep` to confirm the EP is actually registered on this machine. +3. `winml inspect` and `winml analyze` to confirm the model is supported and the operator pattern is buildable. + +The CLI is self-documenting; lean on it before guessing. From 68535769edff20054298d4b64cc31ee75df00925 Mon Sep 17 00:00:00 2001 From: "Shiyi Zheng (from Dev Box)" Date: Mon, 25 May 2026 17:49:25 +0800 Subject: [PATCH 2/6] Bump pinned Python from 3.10 to 3.11 ModelKit now requires Python 3.11 exactly (>=3.11,<3.12). Updated the install section's venv command and the version pin language. --- .claude/skills/winml-modelkit/SKILL.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/.claude/skills/winml-modelkit/SKILL.md b/.claude/skills/winml-modelkit/SKILL.md index 27dc55a47..a3f5cef97 100644 --- a/.claude/skills/winml-modelkit/SKILL.md +++ b/.claude/skills/winml-modelkit/SKILL.md @@ -19,12 +19,12 @@ ModelKit ships a CLI called `winml` that turns a source model — a Hugging Face When in doubt, include it — a five-line prereq block is cheaper than a stuck user. -ModelKit pins **Python 3.10 exactly** (`>=3.10,<3.11`) — use `uv` to create an isolated venv so you don't pollute system Python or land on a 3.11+ environment that won't resolve. +ModelKit pins **Python 3.11 exactly** (`>=3.11,<3.12`) — use `uv` to create an isolated venv so you don't pollute system Python or land on a 3.12+ environment that won't resolve. -**1. Create a Python 3.10 environment** +**1. Create a Python 3.11 environment** ```bash -uv venv --python 3.10 +uv venv --python 3.11 ``` Activate it: From 3afdc1f838b82228a1ac2258e9f31cf2d1019ede Mon Sep 17 00:00:00 2001 From: "Shiyi Zheng (from Dev Box)" Date: Mon, 25 May 2026 18:09:13 +0800 Subject: [PATCH 3/6] Rename skill: winml-modelkit -> winml-cli Moves the skill to .claude/skills/winml-cli/ to match the published package name (winml-cli on PyPI). Updates SKILL.md frontmatter `name:` to `winml-cli`, the H1 title to `winml CLI`, and replaces all "ModelKit"-branded prose in the body with the CLI's actual name. --- .../{winml-modelkit => winml-cli}/SKILL.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) rename .claude/skills/{winml-modelkit => winml-cli}/SKILL.md (90%) diff --git a/.claude/skills/winml-modelkit/SKILL.md b/.claude/skills/winml-cli/SKILL.md similarity index 90% rename from .claude/skills/winml-modelkit/SKILL.md rename to .claude/skills/winml-cli/SKILL.md index a3f5cef97..661b84222 100644 --- a/.claude/skills/winml-modelkit/SKILL.md +++ b/.claude/skills/winml-cli/SKILL.md @@ -1,11 +1,11 @@ --- -name: winml-modelkit -description: Build, optimize, quantize, compile, and benchmark ONNX models for Windows ML using the `winml` CLI. Covers the Build-Your-Own-Model (BYOM) pipeline across NPU (Qualcomm QNN, Intel OpenVINO, AMD VitisAI), GPU, and CPU execution providers. Use this skill whenever the user wants to run a Hugging Face or ONNX model on a Windows AI PC, target an NPU, prepare a model for on-device inference, benchmark latency on Snapdragon X Elite / Intel Core Ultra / AMD Ryzen AI, or troubleshoot operator/EP compatibility — even when they don't say "ModelKit" or "winml" by name. If a user mentions running models on Windows hardware, NPU acceleration, or low-latency on-device inference, this skill applies. **Skip for generative models** — LLMs (GPT, LLaMA, Phi, Mistral), Stable Diffusion, Whisper, or any decoder-only / seq2seq architecture are out of scope (planned for late 2026). +name: winml-cli +description: Build, optimize, quantize, compile, and benchmark ONNX models for Windows ML using the `winml` CLI. Covers the Build-Your-Own-Model (BYOM) pipeline across NPU (Qualcomm QNN, Intel OpenVINO, AMD VitisAI), GPU, and CPU execution providers. Use this skill whenever the user wants to run a Hugging Face or ONNX model on a Windows AI PC, target an NPU, prepare a model for on-device inference, benchmark latency on Snapdragon X Elite / Intel Core Ultra / AMD Ryzen AI, or troubleshoot operator/EP compatibility — even when they don't say "winml" by name. If a user mentions running models on Windows hardware, NPU acceleration, or low-latency on-device inference, this skill applies. **Skip for generative models** — LLMs (GPT, LLaMA, Phi, Mistral), Stable Diffusion, Whisper, or any decoder-only / seq2seq architecture are out of scope (planned for late 2026). --- -# WinML ModelKit +# winml CLI -ModelKit ships a CLI called `winml` that turns a source model — a Hugging Face ID or a local ONNX file — into a portable, performant artifact that runs on any Windows execution provider. This skill teaches you the *shape* of that workflow. The CLI is the source of truth for current commands and flags. +`winml` is a CLI that turns a source model — a Hugging Face ID or a local ONNX file — into a portable, performant artifact that runs on any Windows execution provider. This skill teaches you the *shape* of that workflow. The CLI is the source of truth for current commands and flags. ## Installing the CLI @@ -19,7 +19,7 @@ ModelKit ships a CLI called `winml` that turns a source model — a Hugging Face When in doubt, include it — a five-line prereq block is cheaper than a stuck user. -ModelKit pins **Python 3.11 exactly** (`>=3.11,<3.12`) — use `uv` to create an isolated venv so you don't pollute system Python or land on a 3.12+ environment that won't resolve. +`winml-cli` pins **Python 3.11 exactly** (`>=3.11,<3.12`) — use `uv` to create an isolated venv so you don't pollute system Python or land on a 3.12+ environment that won't resolve. **1. Create a Python 3.11 environment** @@ -68,7 +68,7 @@ Inventing plausible-sounding flags (a `--preset`, `--profile`, `--mode=fast`) is ## The mental model -ModelKit organizes work as a pipeline. Each stage is its own primitive command, and the output of one stage feeds the next: +`winml` organizes work as a pipeline. Each stage is its own primitive command, and the output of one stage feeds the next: ``` inspect → export → analyze → optimize → quantize → compile → perf @@ -166,7 +166,7 @@ If you don't know what hardware the user has, ask, or run `winml sys` and read t **In scope.** Classic deep learning models — CNNs, encoders, vision transformers, NLP classifiers, NER, object detection, segmentation. Concretely: ResNet, ViT, Swin, ConvNeXT, BERT, RoBERTa, Table Transformer, SegFormer families. If the user passes one of these, the pipeline is designed to handle it. -**Out of scope.** Generative and decoder-only architectures: GPT, LLaMA, Phi, Mistral, Stable Diffusion, any seq2seq generator. If a user asks ModelKit to handle one of these, **stop and say so** — the pipeline will fail mid-way and the error won't always make the cause obvious. LLM support (with LoRA) is on the public roadmap for late 2026; don't pretend it works today. +**Out of scope.** Generative and decoder-only architectures: GPT, LLaMA, Phi, Mistral, Stable Diffusion, any seq2seq generator. If a user asks `winml` to handle one of these, **stop and say so** — the pipeline will fail mid-way and the error won't always make the cause obvious. LLM support (with LoRA) is on the public roadmap for late 2026; don't pretend it works today. If you're genuinely unsure whether a model is in scope, the inspect command is the source of truth. Trust its verdict over your guess. @@ -181,7 +181,7 @@ If you're genuinely unsure whether a model is in scope, the inspect command is t ## When things go sideways -Read the error before suggesting a next step. ModelKit error messages are usually specific (op name, EP, stage). When you don't know what to do: +Read the error before suggesting a next step. `winml` error messages are usually specific (op name, EP, stage). When you don't know what to do: 1. `winml --help` to confirm you used real flags. 2. `winml sys --list-ep` to confirm the EP is actually registered on this machine. From 4d32a74438e320d8b3f6f2bf7b450f3582bf9b88 Mon Sep 17 00:00:00 2001 From: "Shiyi Zheng (from Dev Box)" Date: Mon, 25 May 2026 19:33:49 +0800 Subject: [PATCH 4/6] Address review: fix hub -> catalog, drop stale EP status table MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit @timenick: - `hub` is not a real winml subcommand. The actual command that prints catalog/hub info to stdout is `catalog`. Fixed both references (one in the published-output principle list, one in the gotchas section). The "don't fabricate flags" rule applies to subcommand names too — trains agents to invent. - The Hardware/EP status table was already 5 days stale on PR day. Replaced with prose that points at the README's Supported Hardware table as the canonical source and emphasizes `winml sys --list-ep` as ground truth on each machine. Matches the skill's own "read it live" philosophy and removes a drift surface that would have silently rotted between releases. --- .claude/skills/winml-cli/SKILL.md | 20 +++++--------------- 1 file changed, 5 insertions(+), 15 deletions(-) diff --git a/.claude/skills/winml-cli/SKILL.md b/.claude/skills/winml-cli/SKILL.md index 661b84222..99cd5fb86 100644 --- a/.claude/skills/winml-cli/SKILL.md +++ b/.claude/skills/winml-cli/SKILL.md @@ -91,7 +91,7 @@ Every `winml` command has a **published output** — what `-o` writes or what it This is the easiest way to pick the right command for a goal: **map the goal to a published output, not to a side effect.** -- `inspect`, `sys`, `analyze`, `hub` print to stdout (no `-o` artifact). +- `inspect`, `sys`, `analyze`, `catalog` print to stdout (no `-o` artifact). - `export`, `optimize`, `quantize`, `compile`, `build` write transformed model artifacts to `-o`. - `perf` writes a **metrics JSON** to `-o` — not a model. - `config` writes the **config JSON** to `-o` — the input to `build`. @@ -134,21 +134,11 @@ If the user's intent isn't clear from their prompt, ask one short question — " ## Hardware and execution providers -The right execution provider depends on the user's machine. Status as of 2026-05-20: +The right execution provider depends on the user's hardware. See the **README's Supported Hardware** table for the current hardware-to-EP matrix and Ready/Planned status — that's the canonical source, and it changes between releases. Don't reproduce the matrix here; it goes stale. -| Hardware | Execution provider | Status | -|---|---|---| -| Qualcomm NPU (Snapdragon X Elite) | QNN | 🟢 Ready | -| Intel NPU (Meteor Lake / Lunar Lake / Core Ultra) | OpenVINO | 🟢 Ready | -| AMD NPU (Ryzen AI: Phoenix / Hawk Point / Strix) | VitisAI | 🟢 Ready | -| NVIDIA discrete GPU | NvTensorRTRTX | 🟢 Ready | -| Hardware-agnostic GPU | DirectML (Dml) | 🟢 Ready | -| AMD discrete GPU | MIGraphX | 🔶 Planned | -| CPU | CPU EP | ⚪ Always available | +Ground truth for what's actually working on this machine is always `winml sys --list-ep`. Run it; trust its output over anything else. If the user's hardware needs an EP that isn't in the listed output (whether the README marks it Ready or Planned), recommend CPU or DML as the working fallback rather than pretending the missing EP works. -If the user has hardware whose EP is **Planned** (currently only MIGraphX for AMD discrete GPUs), say so — recommend CPU or DML as the working fallback rather than pretending the planned EP is ready. The README's Supported Hardware table may lag behind this status; trust `winml sys --list-ep` on the user's machine for what's actually registered. - -For the **current flag spelling, supported status, and device-selection options** (including any auto-pick mode), consult `winml --help` and `winml sys`. Don't hardcode flag values from this skill into your suggestions — read them live. +For exact flag spellings and device-selection options, consult `winml --help` and `winml sys`. Don't hardcode flag values from this skill — read them live. If you don't know what hardware the user has, ask, or run `winml sys` and read the output. @@ -175,7 +165,7 @@ If you're genuinely unsure whether a model is in scope, the inspect command is t - **Confirm the target EP is registered before compiling.** Run `winml sys --list-ep` first; if your `--ep ` isn't in the list, compile won't produce a usable artifact for that EP. Compile also runs validation by default (see `--validate / --no-validate` in `winml compile --help`). - **Compile defaults to external EP-context storage.** Per `winml compile --help`, the default writes EP context to a `.bin` file co-located with the output `.onnx`; pass `--embed` to inline it instead. If you move a non-embedded artifact, move the `.bin` alongside. - **CLI flags override the config file, not the other way around.** Every primitive that accepts `-c, --config` says so in its `--help`: "Provides defaults; explicit CLI options take precedence." For repeatable builds, edit the JSON; for one-off overrides, pass the flag at build time. -- **Output paths are explicit on the pipeline-building commands.** `export`, `optimize`, `quantize`, `compile`, `perf`, `config`, and `build` each take an `-o` / `--output` (or `--output-dir`). There's no implicit "current directory" convention — tell the user where files will land. `inspect`, `sys`, and `hub` print to stdout and don't require an output path. +- **Output paths are explicit on the pipeline-building commands.** `export`, `optimize`, `quantize`, `compile`, `perf`, `config`, and `build` each take an `-o` / `--output` (or `--output-dir`). There's no implicit "current directory" convention — tell the user where files will land. `inspect`, `sys`, `analyze`, and `catalog` print to stdout and don't require an output path. - **EP-compiled models are tied to their target EP.** Don't try to perf a QNN-compiled artifact against the CPU EP — the result is at best meaningless. For cross-EP comparison, use the pre-compile optimized ONNX. - **Don't fabricate flags.** If a flag isn't in `winml --help`, it doesn't exist. Find a real one or change approach. From 52266ac14ad12b29ac0361da3a1ea45ec8d7b52a Mon Sep 17 00:00:00 2001 From: "Shiyi Zheng (from Dev Box)" Date: Tue, 26 May 2026 10:50:05 +0800 Subject: [PATCH 5/6] Move skill out of .claude/ into top-level skills/ Addresses review: SKILL.md is a public artifact that ships with the winml-cli package, not project-private Claude Code config. Hosting it under .claude/ implied this was dev-only / hidden tooling. Moving to skills/winml-cli/SKILL.md makes its scope match its role. --- {.claude/skills => skills}/winml-cli/SKILL.md | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename {.claude/skills => skills}/winml-cli/SKILL.md (100%) diff --git a/.claude/skills/winml-cli/SKILL.md b/skills/winml-cli/SKILL.md similarity index 100% rename from .claude/skills/winml-cli/SKILL.md rename to skills/winml-cli/SKILL.md From 76acb1d4f8b1cf17e28f3276049e51b029944deb Mon Sep 17 00:00:00 2001 From: "Shiyi Zheng (from Dev Box)" Date: Tue, 26 May 2026 12:05:54 +0800 Subject: [PATCH 6/6] Rename skill: winml-cli -> use-winml-cli The `winml-cli` slug collided with the published package name. Skills are imperative agent instructions ("use the winml CLI"), so prefixing the skill slug with `use-` disambiguates and makes the agent-facing intent obvious. Renamed both the directory and the frontmatter `name:` field. --- skills/{winml-cli => use-winml-cli}/SKILL.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) rename skills/{winml-cli => use-winml-cli}/SKILL.md (99%) diff --git a/skills/winml-cli/SKILL.md b/skills/use-winml-cli/SKILL.md similarity index 99% rename from skills/winml-cli/SKILL.md rename to skills/use-winml-cli/SKILL.md index 99cd5fb86..548bf8cd5 100644 --- a/skills/winml-cli/SKILL.md +++ b/skills/use-winml-cli/SKILL.md @@ -1,5 +1,5 @@ --- -name: winml-cli +name: use-winml-cli description: Build, optimize, quantize, compile, and benchmark ONNX models for Windows ML using the `winml` CLI. Covers the Build-Your-Own-Model (BYOM) pipeline across NPU (Qualcomm QNN, Intel OpenVINO, AMD VitisAI), GPU, and CPU execution providers. Use this skill whenever the user wants to run a Hugging Face or ONNX model on a Windows AI PC, target an NPU, prepare a model for on-device inference, benchmark latency on Snapdragon X Elite / Intel Core Ultra / AMD Ryzen AI, or troubleshoot operator/EP compatibility — even when they don't say "winml" by name. If a user mentions running models on Windows hardware, NPU acceleration, or low-latency on-device inference, this skill applies. **Skip for generative models** — LLMs (GPT, LLaMA, Phi, Mistral), Stable Diffusion, Whisper, or any decoder-only / seq2seq architecture are out of scope (planned for late 2026). ---