Skip to content

fix(winml): null EpCatalog handle after enumeration to prevent QNN NPU crash on exit#701

Merged
DingmaomaoBJTU merged 3 commits into
mainfrom
qiowu/run_qnn_eval
May 22, 2026
Merged

fix(winml): null EpCatalog handle after enumeration to prevent QNN NPU crash on exit#701
DingmaomaoBJTU merged 3 commits into
mainfrom
qiowu/run_qnn_eval

Conversation

@DingmaomaoBJTU
Copy link
Copy Markdown
Collaborator

Summary

WinMLEpCatalogRelease crashes with ACCESS_VIOLATION (0xC0000005) on some QNN NPU driver configurations during Python interpreter shutdown, causing every non-cached winml build to exit with STATUS_ACCESS_VIOLATION instead of 0.

Root Cause

Two independent singletons — WinML (in winml.py) and WinMLEPRegistry (in session/ep_registry.py) — each create a windowsml.EpCatalog instance to enumerate EP library paths. After use, Python's garbage collector eventually calls EpCatalog.__del__close()WinMLEpCatalogRelease(self._handle). On affected QNN NPU driver configurations this native call raises a Windows SEH exception (STATUS_ACCESS_VIOLATION), which Python's try/except Exception cannot catch.

The crash fires on a background thread during interpreter shutdown — after all build stages complete successfully — so exit code 3221225477 (0xC0000005) is observed even though quantized.onnx was written correctly. This explains why 19/22 non-cached models failed in the eval run (cache hits never initialize EpCatalog).

Stack trace captured by faulthandler:

Thread 0x0000bc98:
  File "windowsml/__init__.py", line 428 in close       # WinMLEpCatalogRelease(self._handle)
  File "windowsml/__init__.py", line 439 in __del__

Fix

After find_all_providers() returns, all EP library paths have been extracted into a plain Python dict. The EpCatalog handle is no longer needed for the rest of the process lifetime. Setting self._catalog._handle = None makes EpCatalog.close() a no-op (it guards on if self._handle:), preventing the crash regardless of when or which thread triggers __del__. OS reclaims native resources on process exit.

Applied to both EpCatalog-holding singletons:

  • WinML.__init__ in winml.py
  • WinMLEPRegistry._load_ep_catalog in session/ep_registry.py

Verification

Ran Intel/dpt-hybrid-midas NPU build (the model that crashed most reliably):

  • Before fix: 2/2 runs exit with code 3221225477 (0xC0000005)
  • After fix: 2/2 runs exit with code 0

🤖 Generated with Claude Code

…U crash on exit

WinMLEpCatalogRelease crashes with ACCESS_VIOLATION (0xC0000005) on some
QNN NPU driver configurations during process cleanup.  The crash is a
Windows SEH exception that Python's try/except cannot catch, causing
every non-cached winml build to exit with STATUS_ACCESS_VIOLATION
instead of 0.

Two independent singletons each create an EpCatalog and hold its native
handle live until interpreter shutdown — WinML (winml.py) and
WinMLEPRegistry (ep_registry.py).  Both are initialised during the
Optimize stage and their __del__ methods call WinMLEpCatalogRelease at
process exit, which crashes on affected systems.

Fix: null out _handle on both EpCatalog instances immediately after
find_all_providers() returns.  All EP library paths have been extracted
by that point, so the handle is no longer needed.  EpCatalog.close()
checks `if self._handle` before calling WinMLEpCatalogRelease, so the
call becomes a no-op for the rest of the process lifetime regardless of
when or which thread triggers cleanup.  The OS reclaims native resources
when the process exits.

Verified: Intel/dpt-hybrid-midas NPU build previously crashed 2/2 times
at exit; passes 2/2 times after this fix.
@DingmaomaoBJTU DingmaomaoBJTU requested a review from a team as a code owner May 22, 2026 01:17
Comment thread src/winml/modelkit/winml.py
Add Workaround:/TODO: markers to both workaround sites so reviewers
and future maintainers can easily identify and remove the code once
windowsml fixes WinMLEpCatalogRelease upstream.
@DingmaomaoBJTU DingmaomaoBJTU enabled auto-merge (squash) May 22, 2026 05:38
@DingmaomaoBJTU DingmaomaoBJTU merged commit e00839d into main May 22, 2026
9 checks passed
@DingmaomaoBJTU DingmaomaoBJTU deleted the qiowu/run_qnn_eval branch May 22, 2026 05:42
DingmaomaoBJTU added a commit that referenced this pull request May 22, 2026
…U crash on exit (#701)

## Summary

`WinMLEpCatalogRelease` crashes with `ACCESS_VIOLATION` (0xC0000005) on
some QNN NPU driver configurations during Python interpreter shutdown,
causing every non-cached `winml build` to exit with
`STATUS_ACCESS_VIOLATION` instead of 0.

## Root Cause

Two independent singletons — `WinML` (in `winml.py`) and
`WinMLEPRegistry` (in `session/ep_registry.py`) — each create a
`windowsml.EpCatalog` instance to enumerate EP library paths. After use,
Python's garbage collector eventually calls `EpCatalog.__del__` →
`close()` → `WinMLEpCatalogRelease(self._handle)`. On affected QNN NPU
driver configurations this native call raises a Windows SEH exception
(`STATUS_ACCESS_VIOLATION`), which Python's `try/except Exception`
cannot catch.

The crash fires on a background thread during interpreter shutdown —
after all build stages complete successfully — so exit code 3221225477
(0xC0000005) is observed even though `quantized.onnx` was written
correctly. This explains why 19/22 non-cached models failed in the eval
run (cache hits never initialize `EpCatalog`).

Stack trace captured by `faulthandler`:
```
Thread 0x0000bc98:
  File "windowsml/__init__.py", line 428 in close       # WinMLEpCatalogRelease(self._handle)
  File "windowsml/__init__.py", line 439 in __del__
```

## Fix

After `find_all_providers()` returns, all EP library paths have been
extracted into a plain Python dict. The `EpCatalog` handle is no longer
needed for the rest of the process lifetime. Setting
`self._catalog._handle = None` makes `EpCatalog.close()` a no-op (it
guards on `if self._handle:`), preventing the crash regardless of when
or which thread triggers `__del__`. OS reclaims native resources on
process exit.

Applied to both `EpCatalog`-holding singletons:
- `WinML.__init__` in `winml.py`
- `WinMLEPRegistry._load_ep_catalog` in `session/ep_registry.py`

## Verification

Ran `Intel/dpt-hybrid-midas` NPU build (the model that crashed most
reliably):
- **Before fix**: 2/2 runs exit with code 3221225477 (0xC0000005)
- **After fix**: 2/2 runs exit with code 0

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants