Skip to content

perf(inspect): dedupe AutoConfig + gate processor Auto* lookups (#543)#719

Merged
timenick merged 3 commits into
mainfrom
zhiwang/inspect-resolver-dedupe
May 25, 2026
Merged

perf(inspect): dedupe AutoConfig + gate processor Auto* lookups (#543)#719
timenick merged 3 commits into
mainfrom
zhiwang/inspect-resolver-dedupe

Conversation

@timenick
Copy link
Copy Markdown
Collaborator

Summary

Follow-up to #718 (banner + spinner). The spinner removed the perceived-hang UX problem but the total wall time was still ~23 s. This PR cuts the resolver-side redundancy to bring that to ~18 s.

Profile (warm cache, microsoft/resnet-50)

Before this PR (per-call counts inside _inspect_model_v2):

0.71s  AutoConfig.from_pretrained   ← explicit fetch in inspect.py
0.38s  AutoConfig.from_pretrained   ← duplicate inside resolve_loader_config
0.72s  resolve_io_specs
0.38s  hf_hub_download              ← preprocessor_config.json
0.31s  hf_hub_download              ← tokenizer_config.json (404)
0.70s  AutoTokenizer.from_pretrained
0.71s  AutoImageProcessor.from_pretrained
3.53s  AutoProcessor.from_pretrained
0.38s  AutoConfig.from_pretrained
0.70s  AutoTokenizer.from_pretrained
0.69s  AutoImageProcessor.from_pretrained
4.20s  AutoFeatureExtractor.from_pretrained
─────
12.33s total inspect work

After this PR:

… (same fast portion) …
0.69s  AutoImageProcessor.from_pretrained
3.49s  AutoProcessor.from_pretrained
0.70s  AutoTokenizer.from_pretrained
─────
9.29s total inspect work

AutoConfig calls 5 → 4, AutoFeatureExtractor entirely eliminated, duplicate AutoImageProcessor skipped.

End-to-end wall time (uv run winml inspect -m microsoft/resnet-50 --format json):

Branch run 1 run 2 run 3
main 23.4 s 23.2 s 23.0 s
this PR 17.6 s 17.7 s 18.8 s

Change

1. Dedupe AutoConfig.from_pretrained

_inspect_model_v2 already loads the parent hf_config (it needs the un-narrowed config for I/O introspection — resolve_loader_config may narrow it to a sub-config for multimodal models like CLIP). Then resolve_loader_config loaded it again.

Added an hf_config= kwarg to resolve_loader_config — when supplied, step 1's AutoConfig.from_pretrained is skipped. Pass-through from _inspect_model_v2. Other callers default to hf_config=None, original behaviour preserved.

2. Gate processor Auto* lookups per-field

resolve_processor ran every Auto* class instantiation in Strategy 2, even when Strategies 0+1 had already populated the corresponding field. Added try_* kwargs to _resolve_processor_from_auto_classes and skip each Auto* whose field is already known.

For ResNet, Strategies 0+1 set image_processor_class from the HF registry, so try_image_processor=False skips the redundant AutoImageProcessor.from_pretrained. AutoProcessor succeeds and supplies feature_extractor_class as a side effect, so AutoFeatureExtractor (the 4.2s call) is skipped via the existing inner gate.

If Strategies 0+1 already populate all four fields (CLIP-like models), Strategy 2 is skipped entirely.

Why not further?

The remaining 9 s of resolver work is inside transformers' Auto* internals — AutoProcessor.from_pretrained does its own AutoConfig load and may construct sub-processors. Cutting deeper requires upstream restructuring, not a ModelKit-side fix.

Tests

  • test_resolve_loader_config.py::test_hf_config_kwarg_skips_autoconfig_fetch — pins the dedupe by patching AutoConfig.from_pretrained and asserting call_count == 0 when hf_config= is supplied.
  • test_resolve_processor_gating.py — three cases:
    • All four fields filled by Strategy 1 → _resolve_processor_from_auto_classes is not called at all
    • Partial fill → only the still-needed try_* flags are True
    • Nothing filled → every try_* flag is True

112 targeted tests pass.

Closes #543.

Two changes inside the inspect resolver path that shave ~5 s off the
warm `winml inspect -m <id>` wall time:

1. Dedupe AutoConfig.from_pretrained
   `_inspect_model_v2` already fetches the parent hf_config (needed
   un-narrowed for resolve_io_config), then `resolve_loader_config`
   fetched it again. Add an `hf_config=` kwarg to resolve_loader_config
   so the caller can hand its pre-loaded config in, and pass it
   through from inspect. Other callers are unaffected — the kwarg
   defaults to None and the original lookup path is preserved.

2. Gate _resolve_processor_from_auto_classes per-field
   `resolve_processor` used to invoke every Auto* class
   (AutoProcessor, AutoTokenizer, AutoImageProcessor,
   AutoFeatureExtractor) unconditionally even when Strategies 0+1 had
   already populated the corresponding field. Add `try_*` kwargs and
   skip each Auto* call whose field is already known. For
   microsoft/resnet-50 this eliminates the AutoFeatureExtractor call
   entirely (~4.2 s warm) and the redundant AutoImageProcessor call.

Measurements (warm cache, microsoft/resnet-50, --format json):
  main:      ~23 s wall, AutoConfig×5, AutoFeatureExtractor×1
  this PR:   ~18 s wall, AutoConfig×4, AutoFeatureExtractor×0

Tests:
  - test_resolve_loader_config.py::test_hf_config_kwarg_skips_autoconfig_fetch
    pins the dedupe — patching AutoConfig.from_pretrained and asserting
    it is never called when hf_config= is supplied.
  - test_resolve_processor_gating.py covers the three gating shapes:
    all four fields filled → no Auto* call at all; partial fill →
    correct try_* flags; nothing filled → all four flags True.

Cutting further requires restructuring AutoProcessor's internals (it
loads its own AutoConfig and may construct sub-processors). That's a
transformers-upstream concern, not ModelKit's.

Closes #543.
@timenick timenick requested a review from a team as a code owner May 25, 2026 05:21
Copy link
Copy Markdown
Collaborator

@DingmaomaoBJTU DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is a well-motivated performance optimization with clean implementation and good test coverage. Two comments inline.

Copy link
Copy Markdown
Collaborator

@DingmaomaoBJTU DingmaomaoBJTU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-motivated perf optimization with clean implementation and solid test coverage. Two inline comments.

Comment thread src/winml/modelkit/inspect/resolver.py
Comment thread src/winml/modelkit/inspect/resolver.py
Addresses review feedback on #719. Previously _resolve_processor_from_auto_classes
called AutoProcessor.from_pretrained (~3.5s warm) even when the caller only
needed sub-pieces (tokenizer / image_processor / feature_extractor) — paying
the AutoProcessor cost just for side-effects.

Gate AutoProcessor on try_processor only. When the caller already knows
processor_class from earlier strategies (e.g., from preprocessor_config.json),
fall through to the cheaper standalone Auto* calls. For ResNet the path is
unchanged (processor_class still unknown after Strategy 0+1), but for models
with processor_class in preprocessor_config.json but missing tokenizer_config.json,
this saves the AutoProcessor round-trip.

Also drops the redundant inner OR-guard — the outer guard in resolve_processor
already short-circuits Strategy 2 when all four need_* flags are False.

Test: test_try_processor_false_skips_autoprocessor pins the new gate by
asserting AutoProcessor.from_pretrained is not called when try_processor=False.
timenick added a commit that referenced this pull request May 25, 2026
…o e2e (#727)

## Summary

Closes #726.

After #708, `WinMLSession(device="auto")` resolves to a concrete EP via
`resolve_device()` and force-binds it through
`add_provider_for_devices`. On the Windows CI runner the WinML EP
registry advertises phantom NPU/GPU EP devices even without real
hardware — force-binding those EPs segfaults natively in
`InferenceSession` creation, surfacing as `Process completed with exit
code 1` with no pytest traceback.

The crash is non-deterministic (#719 happened to pass on a re-run while
#717 failed on the same commit), so every PR is exposed to a random
failure until main is fixed.

## Approach

I first considered rewriting the affected tests to `device="cpu"`. On
audit, almost all of them duplicate existing CPU-explicit coverage in
the same file — they used `device="auto"` as a convenience, not to
exercise auto-resolution semantics. Drop the redundant ones instead.

| Test | Overlap |
|---|---|
| `test_run_uses_epcontext_after_compile` | `test_compile_is_idempotent`
(compile()→COMPILED) |
| `test_basic_inference` | `test_explicit_cpu_provider` + 5 perf tests
already call `run(sample)` on cpu |
| `test_inference_auto_compiles` | Implicit in every other test that
calls `run()` without prior `compile()` |
| `test_state_transitions` | `test_ep_name_is_none_before_compile` +
`test_ep_name_after_compile` cover state transitions |
| `test_reset_returns_to_initialized` | `test_reset_clears_error_state`
exercises `reset()` |
| `test_providers_are_valid_and_include_fallback` | Asserted pre-#708
'auto falls back to CPU' behaviour that #708 intentionally removed;
`test_cpu_provider_always_available` covers the CPU-explicit case |

Six tests deleted, one kept and converted:

- **`test_inference_with_torch_tensor`** → `device="cpu"`. Sole test
covering `torch.Tensor` input → numpy conversion path.

## Restoring `device="auto"` runtime coverage

Added `test_auto_device_runtime_smoke` to
[tests/e2e/test_session.py](tests/e2e/test_session.py) under the
existing `@pytest.mark.e2e` class marker. End-to-end coverage of
`resolve_device → add_provider_for_devices → InferenceSession` now lives
where real hardware can be assumed.

## Verification

```
tests\unit\session\test_winml_session.py
=========== 33 passed, 6 skipped in 3.02s ===========
```

The 5 fewer-passing-than-before are exactly the deleted redundant tests;
nothing else moved.
@timenick timenick merged commit 3052423 into main May 25, 2026
9 checks passed
@timenick timenick deleted the zhiwang/inspect-resolver-dedupe branch May 25, 2026 09:16
timenick added a commit that referenced this pull request May 26, 2026
## Summary

Two extra Strategy-2 gating fixes on top of #719. Profile on
`cardiffnlp/twitter-roberta-base-sentiment-latest` (text-only) showed
the inspect command was still taking ~16 s warm — most of it spent on
Auto* calls that didn't need to run.

Targeting **`release/v0.1.0`** since #717 / #718 / #719 are already on
that release; this is the natural follow-up.

## Profile (warm cache,
`cardiffnlp/twitter-roberta-base-sentiment-latest`)

**Before this PR**:

```
[0]  AutoConfig             0.74s  (parent_hf_config — already deduped by #719)
[1]  AutoProcessor          4.27s  returns RobertaTokenizerFast
[7]  AutoTokenizer          2.22s  ← redundant, AutoProcessor already returned the tokenizer
[11] AutoImageProcessor     1.39s  FAIL (text model has no preprocessor_config.json)
[12] AutoFeatureExtractor   0.64s  FAIL (same)
─────
~16 s total
```

**After this PR**:

```
AutoConfig             5 calls (vs 8)
AutoProcessor          1 call
AutoTokenizer          1 call   (vs 2 — the redundant standalone load is gone)
AutoImageProcessor     0 calls  (skipped — no preprocessor_config.json)
AutoFeatureExtractor   0 calls  (skipped — same)
─────
~12 s total  (~25% faster)
```

## Change

### 1. Detect when `AutoProcessor` returns a leaf class

For single-modality models, `AutoProcessor.from_pretrained` returns the
leaf class directly — e.g. RoBERTa → `RobertaTokenizerFast`. Such a
return has no `.tokenizer` wrapper attribute, so the old code couldn't
populate `tokenizer_class` and fell through to a redundant
`AutoTokenizer.from_pretrained` (~2 s warm).

Pattern-match the returned class name (`*Tokenizer` / `*TokenizerFast`,
`*ImageProcessor` / `*ImageProcessorFast`, `*FeatureExtractor`) and
populate the corresponding field. The `.tokenizer` / `.image_processor`
/ `.feature_extractor` attribute path still wins for genuine multimodal
`ProcessorMixin` returns (CLIP, etc.) — see the
`test_autoprocessor_with_wrapped_pieces_uses_attributes` regression
test.

### 2. `preprocessor_config.json` absence is authoritative

`_resolve_processor_from_hub_configs` already tries to download
`preprocessor_config.json`. When the hub returns 404, the model has *no*
image processor or feature extractor, period. Surface this as a
`has_preprocessor_config` bool from the helper so the caller can skip
the `AutoImageProcessor` / `AutoFeatureExtractor` round-trips (~2 s
total wasted confirming 404s).

## Tests

`tests/unit/inspect/test_resolve_processor_gating.py`:

- `test_autoprocessor_returns_tokenizer_fills_tokenizer_class` —
leaf-class detection populates `tokenizer_class` from class-name suffix
and skips standalone `AutoTokenizer`
- `test_autoprocessor_returns_image_processor_fills_image_class` — same
for `*ImageProcessor`
- `test_autoprocessor_returns_feature_extractor_fills_feature_class` —
same for `*FeatureExtractor`
- `test_autoprocessor_with_wrapped_pieces_uses_attributes` — multimodal
`ProcessorMixin` with `.tokenizer` attribute wins over name suffix
- `test_missing_preprocessor_config_skips_image_and_feature` —
`has_preprocessor_config=False` skips `AutoImageProcessor` /
`AutoFeatureExtractor`

55 targeted tests pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[winml inspect] [P0] winml inspect -m <hf_id> takes 24 s end-to-end; first user-visible output silent for ~14 s

2 participants