perf(inspect): dedupe AutoConfig + gate processor Auto* lookups (#543)#719
Merged
Conversation
Two changes inside the inspect resolver path that shave ~5 s off the
warm `winml inspect -m <id>` wall time:
1. Dedupe AutoConfig.from_pretrained
`_inspect_model_v2` already fetches the parent hf_config (needed
un-narrowed for resolve_io_config), then `resolve_loader_config`
fetched it again. Add an `hf_config=` kwarg to resolve_loader_config
so the caller can hand its pre-loaded config in, and pass it
through from inspect. Other callers are unaffected — the kwarg
defaults to None and the original lookup path is preserved.
2. Gate _resolve_processor_from_auto_classes per-field
`resolve_processor` used to invoke every Auto* class
(AutoProcessor, AutoTokenizer, AutoImageProcessor,
AutoFeatureExtractor) unconditionally even when Strategies 0+1 had
already populated the corresponding field. Add `try_*` kwargs and
skip each Auto* call whose field is already known. For
microsoft/resnet-50 this eliminates the AutoFeatureExtractor call
entirely (~4.2 s warm) and the redundant AutoImageProcessor call.
Measurements (warm cache, microsoft/resnet-50, --format json):
main: ~23 s wall, AutoConfig×5, AutoFeatureExtractor×1
this PR: ~18 s wall, AutoConfig×4, AutoFeatureExtractor×0
Tests:
- test_resolve_loader_config.py::test_hf_config_kwarg_skips_autoconfig_fetch
pins the dedupe — patching AutoConfig.from_pretrained and asserting
it is never called when hf_config= is supplied.
- test_resolve_processor_gating.py covers the three gating shapes:
all four fields filled → no Auto* call at all; partial fill →
correct try_* flags; nothing filled → all four flags True.
Cutting further requires restructuring AutoProcessor's internals (it
loads its own AutoConfig and may construct sub-processors). That's a
transformers-upstream concern, not ModelKit's.
Closes #543.
Collaborator
DingmaomaoBJTU
left a comment
There was a problem hiding this comment.
Overall this is a well-motivated performance optimization with clean implementation and good test coverage. Two comments inline.
Collaborator
DingmaomaoBJTU
left a comment
There was a problem hiding this comment.
Well-motivated perf optimization with clean implementation and solid test coverage. Two inline comments.
Addresses review feedback on #719. Previously _resolve_processor_from_auto_classes called AutoProcessor.from_pretrained (~3.5s warm) even when the caller only needed sub-pieces (tokenizer / image_processor / feature_extractor) — paying the AutoProcessor cost just for side-effects. Gate AutoProcessor on try_processor only. When the caller already knows processor_class from earlier strategies (e.g., from preprocessor_config.json), fall through to the cheaper standalone Auto* calls. For ResNet the path is unchanged (processor_class still unknown after Strategy 0+1), but for models with processor_class in preprocessor_config.json but missing tokenizer_config.json, this saves the AutoProcessor round-trip. Also drops the redundant inner OR-guard — the outer guard in resolve_processor already short-circuits Strategy 2 when all four need_* flags are False. Test: test_try_processor_false_skips_autoprocessor pins the new gate by asserting AutoProcessor.from_pretrained is not called when try_processor=False.
This was referenced May 25, 2026
timenick
added a commit
that referenced
this pull request
May 25, 2026
…o e2e (#727) ## Summary Closes #726. After #708, `WinMLSession(device="auto")` resolves to a concrete EP via `resolve_device()` and force-binds it through `add_provider_for_devices`. On the Windows CI runner the WinML EP registry advertises phantom NPU/GPU EP devices even without real hardware — force-binding those EPs segfaults natively in `InferenceSession` creation, surfacing as `Process completed with exit code 1` with no pytest traceback. The crash is non-deterministic (#719 happened to pass on a re-run while #717 failed on the same commit), so every PR is exposed to a random failure until main is fixed. ## Approach I first considered rewriting the affected tests to `device="cpu"`. On audit, almost all of them duplicate existing CPU-explicit coverage in the same file — they used `device="auto"` as a convenience, not to exercise auto-resolution semantics. Drop the redundant ones instead. | Test | Overlap | |---|---| | `test_run_uses_epcontext_after_compile` | `test_compile_is_idempotent` (compile()→COMPILED) | | `test_basic_inference` | `test_explicit_cpu_provider` + 5 perf tests already call `run(sample)` on cpu | | `test_inference_auto_compiles` | Implicit in every other test that calls `run()` without prior `compile()` | | `test_state_transitions` | `test_ep_name_is_none_before_compile` + `test_ep_name_after_compile` cover state transitions | | `test_reset_returns_to_initialized` | `test_reset_clears_error_state` exercises `reset()` | | `test_providers_are_valid_and_include_fallback` | Asserted pre-#708 'auto falls back to CPU' behaviour that #708 intentionally removed; `test_cpu_provider_always_available` covers the CPU-explicit case | Six tests deleted, one kept and converted: - **`test_inference_with_torch_tensor`** → `device="cpu"`. Sole test covering `torch.Tensor` input → numpy conversion path. ## Restoring `device="auto"` runtime coverage Added `test_auto_device_runtime_smoke` to [tests/e2e/test_session.py](tests/e2e/test_session.py) under the existing `@pytest.mark.e2e` class marker. End-to-end coverage of `resolve_device → add_provider_for_devices → InferenceSession` now lives where real hardware can be assumed. ## Verification ``` tests\unit\session\test_winml_session.py =========== 33 passed, 6 skipped in 3.02s =========== ``` The 5 fewer-passing-than-before are exactly the deleted redundant tests; nothing else moved.
DingmaomaoBJTU
approved these changes
May 25, 2026
This was referenced May 26, 2026
timenick
added a commit
that referenced
this pull request
May 26, 2026
## Summary Two extra Strategy-2 gating fixes on top of #719. Profile on `cardiffnlp/twitter-roberta-base-sentiment-latest` (text-only) showed the inspect command was still taking ~16 s warm — most of it spent on Auto* calls that didn't need to run. Targeting **`release/v0.1.0`** since #717 / #718 / #719 are already on that release; this is the natural follow-up. ## Profile (warm cache, `cardiffnlp/twitter-roberta-base-sentiment-latest`) **Before this PR**: ``` [0] AutoConfig 0.74s (parent_hf_config — already deduped by #719) [1] AutoProcessor 4.27s returns RobertaTokenizerFast [7] AutoTokenizer 2.22s ← redundant, AutoProcessor already returned the tokenizer [11] AutoImageProcessor 1.39s FAIL (text model has no preprocessor_config.json) [12] AutoFeatureExtractor 0.64s FAIL (same) ───── ~16 s total ``` **After this PR**: ``` AutoConfig 5 calls (vs 8) AutoProcessor 1 call AutoTokenizer 1 call (vs 2 — the redundant standalone load is gone) AutoImageProcessor 0 calls (skipped — no preprocessor_config.json) AutoFeatureExtractor 0 calls (skipped — same) ───── ~12 s total (~25% faster) ``` ## Change ### 1. Detect when `AutoProcessor` returns a leaf class For single-modality models, `AutoProcessor.from_pretrained` returns the leaf class directly — e.g. RoBERTa → `RobertaTokenizerFast`. Such a return has no `.tokenizer` wrapper attribute, so the old code couldn't populate `tokenizer_class` and fell through to a redundant `AutoTokenizer.from_pretrained` (~2 s warm). Pattern-match the returned class name (`*Tokenizer` / `*TokenizerFast`, `*ImageProcessor` / `*ImageProcessorFast`, `*FeatureExtractor`) and populate the corresponding field. The `.tokenizer` / `.image_processor` / `.feature_extractor` attribute path still wins for genuine multimodal `ProcessorMixin` returns (CLIP, etc.) — see the `test_autoprocessor_with_wrapped_pieces_uses_attributes` regression test. ### 2. `preprocessor_config.json` absence is authoritative `_resolve_processor_from_hub_configs` already tries to download `preprocessor_config.json`. When the hub returns 404, the model has *no* image processor or feature extractor, period. Surface this as a `has_preprocessor_config` bool from the helper so the caller can skip the `AutoImageProcessor` / `AutoFeatureExtractor` round-trips (~2 s total wasted confirming 404s). ## Tests `tests/unit/inspect/test_resolve_processor_gating.py`: - `test_autoprocessor_returns_tokenizer_fills_tokenizer_class` — leaf-class detection populates `tokenizer_class` from class-name suffix and skips standalone `AutoTokenizer` - `test_autoprocessor_returns_image_processor_fills_image_class` — same for `*ImageProcessor` - `test_autoprocessor_returns_feature_extractor_fills_feature_class` — same for `*FeatureExtractor` - `test_autoprocessor_with_wrapped_pieces_uses_attributes` — multimodal `ProcessorMixin` with `.tokenizer` attribute wins over name suffix - `test_missing_preprocessor_config_skips_image_and_feature` — `has_preprocessor_config=False` skips `AutoImageProcessor` / `AutoFeatureExtractor` 55 targeted tests pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #718 (banner + spinner). The spinner removed the perceived-hang UX problem but the total wall time was still ~23 s. This PR cuts the resolver-side redundancy to bring that to ~18 s.
Profile (warm cache,
microsoft/resnet-50)Before this PR (per-call counts inside
_inspect_model_v2):After this PR:
AutoConfigcalls 5 → 4,AutoFeatureExtractorentirely eliminated, duplicateAutoImageProcessorskipped.End-to-end wall time (
uv run winml inspect -m microsoft/resnet-50 --format json):Change
1. Dedupe
AutoConfig.from_pretrained_inspect_model_v2already loads the parenthf_config(it needs the un-narrowed config for I/O introspection —resolve_loader_configmay narrow it to a sub-config for multimodal models like CLIP). Thenresolve_loader_configloaded it again.Added an
hf_config=kwarg toresolve_loader_config— when supplied, step 1'sAutoConfig.from_pretrainedis skipped. Pass-through from_inspect_model_v2. Other callers default tohf_config=None, original behaviour preserved.2. Gate processor Auto* lookups per-field
resolve_processorran every Auto* class instantiation in Strategy 2, even when Strategies 0+1 had already populated the corresponding field. Addedtry_*kwargs to_resolve_processor_from_auto_classesand skip each Auto* whose field is already known.For ResNet, Strategies 0+1 set
image_processor_classfrom the HF registry, sotry_image_processor=Falseskips the redundantAutoImageProcessor.from_pretrained.AutoProcessorsucceeds and suppliesfeature_extractor_classas a side effect, soAutoFeatureExtractor(the 4.2s call) is skipped via the existing inner gate.If Strategies 0+1 already populate all four fields (CLIP-like models), Strategy 2 is skipped entirely.
Why not further?
The remaining 9 s of resolver work is inside transformers' Auto* internals —
AutoProcessor.from_pretraineddoes its own AutoConfig load and may construct sub-processors. Cutting deeper requires upstream restructuring, not a ModelKit-side fix.Tests
test_resolve_loader_config.py::test_hf_config_kwarg_skips_autoconfig_fetch— pins the dedupe by patchingAutoConfig.from_pretrainedand assertingcall_count == 0whenhf_config=is supplied.test_resolve_processor_gating.py— three cases:_resolve_processor_from_auto_classesis not called at alltry_*flags are Truetry_*flag is True112 targeted tests pass.
Closes #543.