Skip to content

Add EP and hardware device type to Windows ML telemetry#28477

Merged
angelser merged 3 commits into
microsoft:mainfrom
angelser:angelser/winml-ep-device-telemetry
May 19, 2026
Merged

Add EP and hardware device type to Windows ML telemetry#28477
angelser merged 3 commits into
microsoft:mainfrom
angelser:angelser/winml-ep-device-telemetry

Conversation

@angelser
Copy link
Copy Markdown
Contributor

Problem

Windows ML engineers need telemetry that answers: "Which Execution Providers and hardware device types are apps using for inference, and how much?"

Today, the inference telemetry has these gaps:

Event Gap
SessionCreation No hardware device type (CPU/GPU/NPU), no vendor ID. Fires once — falls out of the 24h pipeline join window for long-lived sessions.
RuntimePerf No EP type, no hardware device — only session_id, requires a join back to SessionCreation.
ExecutionProviderEvent Only fires for DML. Irrelevant for QNN/OpenVINO/etc.

The new Windows ML EP plugin platform (OrtEpDevice / OrtEpFactory / OrtHardwareDevice) already has all the hardware metadata we need; we just weren't surfacing it.

What this PR does

1. New EpDeviceUsage ETW event

Emitted once per (EP, hardware device) tuple at session init and on every RuntimePerf heartbeat (plus a destructor flush). Each event is self-contained:

Field Example
executionProviderType QNNExecutionProvider
hardwareDeviceType NPU / GPU / CPU / FPGA / UNKNOWN
hardwareVendorId / hardwareDeviceId 0x5143 / 0x0901 (PCI IDs)
hardwareVendor Qualcomm
epVendor Qualcomm
assignedNodeCount 89 (count after graph partitioning)
totalRunsSinceLast / totalRunDurationSinceLast session-level run counters

This gives downstream consumers a trivial GROUP BY executionProviderType, hardwareDeviceType without needing to join back to SessionCreation. Works for long-lived sessions that span past the 24h pipeline join window.

2. SessionCreation enrichment

Added hardwareDeviceTypes and hardwareVendorIds (comma-separated, positionally aligned with the existing executionProviderIds). Bumped schemaVersion 0 -> 1.

Implementation notes

  • LogEpDeviceUsage added to the Telemetry interface with a no-op default; WindowsTelemetry implements it via TraceLogging under the existing Microsoft.ML.ONNXRuntime provider (no new provider GUID).
  • InferenceSession::PopulateEpDeviceInfo runs after graph partitioning. For EPs created via the V2 path (AppendExecutionProvider_V2 / SetEpSelectionPolicy / RegisterExecutionProviderLibrary) it pulls full hardware metadata from IExecutionProvider::GetEpDevices(). For legacy EPs it falls back to IExecutionProvider::GetDevice() (OrtDevice type + vendor ID; no PCI device ID).
  • Heartbeat block in Run() and destructor flush in ~InferenceSession both emit LogEpDeviceUsage per entry.

Testing

  • Debug build with Ninja: clean build (1636 targets)
  • onnxruntime_test_all (full suite): 1571 passed, 0 failed, 3 skipped (CUDA-EP-gated, environment)
  • No memory leaks reported

Compatibility

  • No public C API surface changes.
  • Telemetry::LogSessionCreation virtual gains two const std::string& parameters — all in-tree overrides are updated.
  • LogEpDeviceUsage has a no-op default, so non-Windows platforms are unaffected.

angelserMS and others added 2 commits May 12, 2026 11:10
Closes gaps in inference telemetry for the new Windows ML EP plugin
platform so we can answer "which Execution Providers and hardware
device types are apps using for inference, and how much?"

Two prongs:

1. New EpDeviceUsage ETW event emitted once per (EP, hardware device)
   tuple at session init and on every RuntimePerf heartbeat (and the
   final destructor flush). The event is self-contained — it carries
   EP type, hardware device type (CPU/GPU/NPU), PCI vendor and device
   IDs, EP vendor, assigned node count, and session-level run counters —
   so downstream consumers can GROUP BY executionProviderType,
   hardwareDeviceType without joining back to SessionCreation. This
   matters for long-lived sessions that span past the telemetry
   pipeline's 24-hour join window.

2. SessionCreation and SessionCreation_CaptureState now also emit
   hardwareDeviceTypes and hardwareVendorIds (comma-separated,
   positionally aligned with �xecutionProviderIds). Bumped
   schemaVersion 0 -> 1.

Implementation:

* Added LogEpDeviceUsage to the Telemetry interface (no-op default)
  and WindowsTelemetry (TraceLogging under the existing
  Microsoft.ML.ONNXRuntime provider — no new provider GUID).
* Added EpDeviceInfo to InferenceSession::Telemetry plus a
  pre-formatted summary for the SessionCreation enrichment.
* InferenceSession::PopulateEpDeviceInfo runs after graph
  partitioning. For EPs created via the V2 OrtEpDevice path
  (AppendExecutionProvider_V2 / SetEpSelectionPolicy /
  RegisterExecutionProviderLibrary) it pulls full hardware metadata
  from IExecutionProvider::GetEpDevices(). For legacy EPs it falls
  back to IExecutionProvider::GetDevice() (OrtDevice type +
  vendor ID; no PCI device ID).
* Heartbeat block in `Run()` and the destructor flush in
  `~InferenceSession` now also emit LogEpDeviceUsage for each entry.

No public C API surface changes. Telemetry interface signatures gain
two `const std::string&` parameters on LogSessionCreation and one
new virtual LogEpDeviceUsage with a no-op default for non-Windows
platforms.

Tested: full `onnxruntime_test_all` Debug suite — 1571 passed, 0
failed, 3 skipped (CUDA EP, environment-gated). No memory leaks.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dabhattimsft
dabhattimsft previously approved these changes May 13, 2026
Copy link
Copy Markdown
Contributor

@dabhattimsft dabhattimsft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Addresses cpplint build/include_what_you_use warning from the Optional
Lint C++ job: telemetry_.duration_per_batch_size_ is std::unordered_map
and was being used without an explicit include.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@ashrit-ms ashrit-ms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@angelser angelser merged commit 4d1dce8 into microsoft:main May 19, 2026
88 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants