perf(inspect): dedupe AutoConfig + gate processor Auto* lookups (#543) by timenick · Pull Request #719 · microsoft/winml-cli

timenick · 2026-05-25T05:21:06Z

Summary

Follow-up to #718 (banner + spinner). The spinner removed the perceived-hang UX problem but the total wall time was still ~23 s. This PR cuts the resolver-side redundancy to bring that to ~18 s.

Profile (warm cache, `microsoft/resnet-50`)

Before this PR (per-call counts inside _inspect_model_v2):

0.71s  AutoConfig.from_pretrained   ← explicit fetch in inspect.py
0.38s  AutoConfig.from_pretrained   ← duplicate inside resolve_loader_config
0.72s  resolve_io_specs
0.38s  hf_hub_download              ← preprocessor_config.json
0.31s  hf_hub_download              ← tokenizer_config.json (404)
0.70s  AutoTokenizer.from_pretrained
0.71s  AutoImageProcessor.from_pretrained
3.53s  AutoProcessor.from_pretrained
0.38s  AutoConfig.from_pretrained
0.70s  AutoTokenizer.from_pretrained
0.69s  AutoImageProcessor.from_pretrained
4.20s  AutoFeatureExtractor.from_pretrained
─────
12.33s total inspect work

After this PR:

… (same fast portion) …
0.69s  AutoImageProcessor.from_pretrained
3.49s  AutoProcessor.from_pretrained
0.70s  AutoTokenizer.from_pretrained
─────
9.29s total inspect work

AutoConfig calls 5 → 4, AutoFeatureExtractor entirely eliminated, duplicate AutoImageProcessor skipped.

End-to-end wall time (uv run winml inspect -m microsoft/resnet-50 --format json):

Branch	run 1	run 2	run 3
main	23.4 s	23.2 s	23.0 s
this PR	17.6 s	17.7 s	18.8 s

Change

1. Dedupe `AutoConfig.from_pretrained`

_inspect_model_v2 already loads the parent hf_config (it needs the un-narrowed config for I/O introspection — resolve_loader_config may narrow it to a sub-config for multimodal models like CLIP). Then resolve_loader_config loaded it again.

Added an hf_config= kwarg to resolve_loader_config — when supplied, step 1's AutoConfig.from_pretrained is skipped. Pass-through from _inspect_model_v2. Other callers default to hf_config=None, original behaviour preserved.

2. Gate processor Auto* lookups per-field

resolve_processor ran every Auto* class instantiation in Strategy 2, even when Strategies 0+1 had already populated the corresponding field. Added try_* kwargs to _resolve_processor_from_auto_classes and skip each Auto* whose field is already known.

For ResNet, Strategies 0+1 set image_processor_class from the HF registry, so try_image_processor=False skips the redundant AutoImageProcessor.from_pretrained. AutoProcessor succeeds and supplies feature_extractor_class as a side effect, so AutoFeatureExtractor (the 4.2s call) is skipped via the existing inner gate.

If Strategies 0+1 already populate all four fields (CLIP-like models), Strategy 2 is skipped entirely.

Why not further?

The remaining 9 s of resolver work is inside transformers' Auto* internals — AutoProcessor.from_pretrained does its own AutoConfig load and may construct sub-processors. Cutting deeper requires upstream restructuring, not a ModelKit-side fix.

Tests

test_resolve_loader_config.py::test_hf_config_kwarg_skips_autoconfig_fetch — pins the dedupe by patching AutoConfig.from_pretrained and asserting call_count == 0 when hf_config= is supplied.
test_resolve_processor_gating.py — three cases:
- All four fields filled by Strategy 1 → _resolve_processor_from_auto_classes is not called at all
- Partial fill → only the still-needed try_* flags are True
- Nothing filled → every try_* flag is True

112 targeted tests pass.

Closes #543.

Two changes inside the inspect resolver path that shave ~5 s off the warm `winml inspect -m <id>` wall time: 1. Dedupe AutoConfig.from_pretrained `_inspect_model_v2` already fetches the parent hf_config (needed un-narrowed for resolve_io_config), then `resolve_loader_config` fetched it again. Add an `hf_config=` kwarg to resolve_loader_config so the caller can hand its pre-loaded config in, and pass it through from inspect. Other callers are unaffected — the kwarg defaults to None and the original lookup path is preserved. 2. Gate _resolve_processor_from_auto_classes per-field `resolve_processor` used to invoke every Auto* class (AutoProcessor, AutoTokenizer, AutoImageProcessor, AutoFeatureExtractor) unconditionally even when Strategies 0+1 had already populated the corresponding field. Add `try_*` kwargs and skip each Auto* call whose field is already known. For microsoft/resnet-50 this eliminates the AutoFeatureExtractor call entirely (~4.2 s warm) and the redundant AutoImageProcessor call. Measurements (warm cache, microsoft/resnet-50, --format json): main: ~23 s wall, AutoConfig×5, AutoFeatureExtractor×1 this PR: ~18 s wall, AutoConfig×4, AutoFeatureExtractor×0 Tests: - test_resolve_loader_config.py::test_hf_config_kwarg_skips_autoconfig_fetch pins the dedupe — patching AutoConfig.from_pretrained and asserting it is never called when hf_config= is supplied. - test_resolve_processor_gating.py covers the three gating shapes: all four fields filled → no Auto* call at all; partial fill → correct try_* flags; nothing filled → all four flags True. Cutting further requires restructuring AutoProcessor's internals (it loads its own AutoConfig and may construct sub-processors). That's a transformers-upstream concern, not ModelKit's. Closes #543.

DingmaomaoBJTU

Overall this is a well-motivated performance optimization with clean implementation and good test coverage. Two comments inline.

DingmaomaoBJTU

Well-motivated perf optimization with clean implementation and solid test coverage. Two inline comments.

Addresses review feedback on #719. Previously _resolve_processor_from_auto_classes called AutoProcessor.from_pretrained (~3.5s warm) even when the caller only needed sub-pieces (tokenizer / image_processor / feature_extractor) — paying the AutoProcessor cost just for side-effects. Gate AutoProcessor on try_processor only. When the caller already knows processor_class from earlier strategies (e.g., from preprocessor_config.json), fall through to the cheaper standalone Auto* calls. For ResNet the path is unchanged (processor_class still unknown after Strategy 0+1), but for models with processor_class in preprocessor_config.json but missing tokenizer_config.json, this saves the AutoProcessor round-trip. Also drops the redundant inner OR-guard — the outer guard in resolve_processor already short-circuits Strategy 2 when all four need_* flags are False. Test: test_try_processor_false_skips_autoprocessor pins the new gate by asserting AutoProcessor.from_pretrained is not called when try_processor=False.

…o e2e (#727) ## Summary Closes #726. After #708, `WinMLSession(device="auto")` resolves to a concrete EP via `resolve_device()` and force-binds it through `add_provider_for_devices`. On the Windows CI runner the WinML EP registry advertises phantom NPU/GPU EP devices even without real hardware — force-binding those EPs segfaults natively in `InferenceSession` creation, surfacing as `Process completed with exit code 1` with no pytest traceback. The crash is non-deterministic (#719 happened to pass on a re-run while #717 failed on the same commit), so every PR is exposed to a random failure until main is fixed. ## Approach I first considered rewriting the affected tests to `device="cpu"`. On audit, almost all of them duplicate existing CPU-explicit coverage in the same file — they used `device="auto"` as a convenience, not to exercise auto-resolution semantics. Drop the redundant ones instead. | Test | Overlap | |---|---| | `test_run_uses_epcontext_after_compile` | `test_compile_is_idempotent` (compile()→COMPILED) | | `test_basic_inference` | `test_explicit_cpu_provider` + 5 perf tests already call `run(sample)` on cpu | | `test_inference_auto_compiles` | Implicit in every other test that calls `run()` without prior `compile()` | | `test_state_transitions` | `test_ep_name_is_none_before_compile` + `test_ep_name_after_compile` cover state transitions | | `test_reset_returns_to_initialized` | `test_reset_clears_error_state` exercises `reset()` | | `test_providers_are_valid_and_include_fallback` | Asserted pre-#708 'auto falls back to CPU' behaviour that #708 intentionally removed; `test_cpu_provider_always_available` covers the CPU-explicit case | Six tests deleted, one kept and converted: - **`test_inference_with_torch_tensor`** → `device="cpu"`. Sole test covering `torch.Tensor` input → numpy conversion path. ## Restoring `device="auto"` runtime coverage Added `test_auto_device_runtime_smoke` to [tests/e2e/test_session.py](tests/e2e/test_session.py) under the existing `@pytest.mark.e2e` class marker. End-to-end coverage of `resolve_device → add_provider_for_devices → InferenceSession` now lives where real hardware can be assumed. ## Verification ``` tests\unit\session\test_winml_session.py =========== 33 passed, 6 skipped in 3.02s =========== ``` The 5 fewer-passing-than-before are exactly the deleted redundant tests; nothing else moved.

## Summary Two extra Strategy-2 gating fixes on top of #719. Profile on `cardiffnlp/twitter-roberta-base-sentiment-latest` (text-only) showed the inspect command was still taking ~16 s warm — most of it spent on Auto* calls that didn't need to run. Targeting **`release/v0.1.0`** since #717 / #718 / #719 are already on that release; this is the natural follow-up. ## Profile (warm cache, `cardiffnlp/twitter-roberta-base-sentiment-latest`) **Before this PR**: ``` [0] AutoConfig 0.74s (parent_hf_config — already deduped by #719) [1] AutoProcessor 4.27s returns RobertaTokenizerFast [7] AutoTokenizer 2.22s ← redundant, AutoProcessor already returned the tokenizer [11] AutoImageProcessor 1.39s FAIL (text model has no preprocessor_config.json) [12] AutoFeatureExtractor 0.64s FAIL (same) ───── ~16 s total ``` **After this PR**: ``` AutoConfig 5 calls (vs 8) AutoProcessor 1 call AutoTokenizer 1 call (vs 2 — the redundant standalone load is gone) AutoImageProcessor 0 calls (skipped — no preprocessor_config.json) AutoFeatureExtractor 0 calls (skipped — same) ───── ~12 s total (~25% faster) ``` ## Change ### 1. Detect when `AutoProcessor` returns a leaf class For single-modality models, `AutoProcessor.from_pretrained` returns the leaf class directly — e.g. RoBERTa → `RobertaTokenizerFast`. Such a return has no `.tokenizer` wrapper attribute, so the old code couldn't populate `tokenizer_class` and fell through to a redundant `AutoTokenizer.from_pretrained` (~2 s warm). Pattern-match the returned class name (`*Tokenizer` / `*TokenizerFast`, `*ImageProcessor` / `*ImageProcessorFast`, `*FeatureExtractor`) and populate the corresponding field. The `.tokenizer` / `.image_processor` / `.feature_extractor` attribute path still wins for genuine multimodal `ProcessorMixin` returns (CLIP, etc.) — see the `test_autoprocessor_with_wrapped_pieces_uses_attributes` regression test. ### 2. `preprocessor_config.json` absence is authoritative `_resolve_processor_from_hub_configs` already tries to download `preprocessor_config.json`. When the hub returns 404, the model has *no* image processor or feature extractor, period. Surface this as a `has_preprocessor_config` bool from the helper so the caller can skip the `AutoImageProcessor` / `AutoFeatureExtractor` round-trips (~2 s total wasted confirming 404s). ## Tests `tests/unit/inspect/test_resolve_processor_gating.py`: - `test_autoprocessor_returns_tokenizer_fills_tokenizer_class` — leaf-class detection populates `tokenizer_class` from class-name suffix and skips standalone `AutoTokenizer` - `test_autoprocessor_returns_image_processor_fills_image_class` — same for `*ImageProcessor` - `test_autoprocessor_returns_feature_extractor_fills_feature_class` — same for `*FeatureExtractor` - `test_autoprocessor_with_wrapped_pieces_uses_attributes` — multimodal `ProcessorMixin` with `.tokenizer` attribute wins over name suffix - `test_missing_preprocessor_config_skips_image_and_feature` — `has_preprocessor_config=False` skips `AutoImageProcessor` / `AutoFeatureExtractor` 55 targeted tests pass.

timenick requested a review from a team as a code owner May 25, 2026 05:21

Merge branch 'main' into zhiwang/inspect-resolver-dedupe

677ff67

DingmaomaoBJTU reviewed May 25, 2026

View reviewed changes

Comment thread src/winml/modelkit/inspect/resolver.py

Comment thread src/winml/modelkit/inspect/resolver.py

This was referenced May 25, 2026

[session] WinMLSession(device='auto') crashes on hardware-less Windows CI after #708 #726

Closed

test(session): drop redundant device='auto' tests + move auto smoke to e2e #727

Merged

DingmaomaoBJTU approved these changes May 25, 2026

View reviewed changes

timenick merged commit 3052423 into main May 25, 2026
9 checks passed

timenick deleted the zhiwang/inspect-resolver-dedupe branch May 25, 2026 09:16

This was referenced May 26, 2026

Validate model task in config. #723

Merged

perf(inspect): skip redundant Auto* calls for text-only models #746

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(inspect): dedupe AutoConfig + gate processor Auto* lookups (#543)#719

perf(inspect): dedupe AutoConfig + gate processor Auto* lookups (#543)#719
timenick merged 3 commits into
mainfrom
zhiwang/inspect-resolver-dedupe

timenick commented May 25, 2026

Uh oh!

DingmaomaoBJTU left a comment

Uh oh!

DingmaomaoBJTU left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

timenick commented May 25, 2026

Summary

Profile (warm cache, microsoft/resnet-50)

Change

1. Dedupe AutoConfig.from_pretrained

2. Gate processor Auto* lookups per-field

Why not further?

Tests

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Profile (warm cache, `microsoft/resnet-50`)

1. Dedupe `AutoConfig.from_pretrained`