feat: add native CPU kernel for SequenceMap (opset 17) by Rishi-Dave · Pull Request #28813 · microsoft/onnxruntime

Rishi-Dave · 2026-06-05T11:34:42Z

Summary

Implements a native CPU kernel for SequenceMap (opset 17), eliminating the ONNX function-body fallback that expands into a Loop over SequenceInsert and produces O(n^2) memory traffic.
Mirrors the canonical control-flow kernel pattern from Loop / If / Scan: derives from IControlFlowKernel, prepares feeds/fetches via FeedsFetchesManager, and invokes the body subgraph through utils::ExecuteSubgraph.
Adds unit tests covering identity, scalar broadcast via an additional tensor input, and multi-output body graphs.

Motivation

Fixes #23024. The ONNX spec defines SequenceMap via a context-dependent function body that decomposes the op into a Loop whose accumulator is grown by SequenceInsert on every iteration. Each SequenceInsert copies the accumulated sequence, so processing an n-element input requires O(n^2) memory traffic. Workloads that map per-element transforms over long sequences hit this quadratic behaviour and are forced to avoid the operator entirely.

A native kernel iterates the input in O(n), forwards the i-th element of each sequence-typed input plus passthrough tensor inputs to the body, and appends each body output to the appropriate output TensorSeq without per-iteration copies.

Changes

onnxruntime/core/providers/cpu/sequence/sequence_ops.h: declares SequenceMap as an IControlFlowKernel with a FeedsFetchesManager member.
onnxruntime/core/providers/cpu/sequence/sequence_ops.cc: implements SetupSubgraphExecutionInfo and Compute, registers the kernel for opset 17, validates length parity for sequence-typed additional_inputs, and assembles per-output TensorSeq results.
onnxruntime/core/providers/cpu/cpu_execution_provider.cc: adds the forward declaration and BuildKernelCreateInfo entry for the new kernel alongside the other opset-17 sequence ops.
onnxruntime/test/providers/cpu/sequence/sequence_ops_test.cc: adds SequenceMap_Identity, SequenceMap_AddScalar, and SequenceMap_TwoOutputs covering single-input identity, sequence + tensor broadcast, and dual-output body graphs.

Test Plan

onnxruntime_test_all --gtest_filter='SequenceOpsTest.SequenceMap*' — exercises the three new tests.
onnxruntime_test_all --gtest_filter='SequenceOpsTest.*' — confirms no regression in sibling sequence ops.
ONNX backend tests sequence_map_identity_*, sequence_map_add_*, and sequence_map_extract_shapes continue to be excluded for TensorRT EP only; the CPU EP now executes them via the native kernel rather than the function-body fallback.

Issue Resolution

Fixes #23024.

Replace the broad ORT_USE_CPUINFO macro (with negated platform exclusions) with inline defined(CPUINFO_SUPPORTED) && defined(__linux__) guards at each point of use. Since __APPLE__ and __linux__ are mutually exclusive, the previous negation-based condition collapses to simply defined(__linux__). Drop the intermediate ORT_USE_CPUINFO macro in favour of direct guards.

Without a dedicated kernel, SequenceMap falls back to the ONNX context-dependent function body, which expands the op into a Loop over SequenceInsert calls. Each SequenceInsert copies the accumulator sequence, producing O(n^2) memory traffic for an n-element input. This adds a native CPU kernel that: - Derives from IControlFlowKernel and sets up the body subgraph via the standard FeedsFetchesManager flow used by Loop, If, and Scan. - Iterates the input sequence sequentially in O(n), forwarding the i-th element of each sequence-typed input and passing tensor-typed additional_inputs through unchanged. - Validates that sequence-typed additional_inputs share the input sequence length. - Assembles one TensorSeq per body output and appends fetched tensors per iteration without intermediate copies. Adds unit tests for the identity body, an add-scalar body with a tensor additional input, and a body that emits two outputs to cover the multi-output path. Fixes microsoft#23024

Copilot

Pull request overview

This PR adds a native CPU Execution Provider kernel for ONNX SequenceMap (opset 17) to avoid the quadratic Loop + SequenceInsert function-body fallback, executing the body subgraph directly per sequence element via the control-flow kernel infrastructure.

Changes:

Introduces SequenceMap as a CPU control-flow kernel using FeedsFetchesManager + utils::ExecuteSubgraph.
Registers the new kernel in the CPU EP opset-17 registry.
Adds unit tests that construct minimal body subgraphs (identity, add-with-extra-input, two outputs) and validate results.
Also changes cpuinfo usage gating in PosixEnv to Linux-only.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
onnxruntime/core/providers/cpu/sequence/sequence_ops.h	Declares the new `SequenceMap` control-flow kernel and its `FeedsFetchesManager` member.
onnxruntime/core/providers/cpu/sequence/sequence_ops.cc	Implements `SetupSubgraphExecutionInfo`/`Compute` and registers the opset-17 CPU kernel.
onnxruntime/core/providers/cpu/cpu_execution_provider.cc	Registers `SequenceMap` in the CPU EP kernel registry.
onnxruntime/test/providers/cpu/sequence/sequence_ops_test.cc	Adds new `SequenceMap` unit tests with constructed `body` subgraphs.
onnxruntime/core/platform/posix/env.cc	Narrows cpuinfo integration to Linux-only in `PosixEnv`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+  const auto& subgraph_map = subgraph_session_state.GetOrtValueNameIdxMap();
+  std::unique_ptr<FeedsFetchesManager> ffm;
+  ORT_RETURN_IF_ERROR(FeedsFetchesManager::Create(feed_names, fetch_names, subgraph_map, ffm));
+  ORT_RETURN_IF_ERROR(utils::InitializeFeedFetchCopyInfo(subgraph_session_state, *ffm));


+  std::vector<std::string> outer_feed_names;
+  outer_feed_names.reserve(node_inputs.size());
+  for (const auto* input_def : node_inputs) {
+    outer_feed_names.push_back(input_def->Name());
+  }


+    std::vector<OrtValue> feeds;
+    feeds.reserve(static_cast<size_t>(num_outer_inputs));
+
+    // Build feeds: sequence inputs -> element i; tensor inputs -> pass-through OrtValue.
+    for (int k = 0; k < num_outer_inputs; ++k) {
+      const auto* seq_k = (k == 0) ? input_seq : ctx->Input<TensorSeq>(k);
+      if (seq_k != nullptr) {
+        feeds.push_back(seq_k->GetAt(i));
+      } else {
+        // Tensor input: shallow-copy the OrtValue (shared_ptr, safe) from the kernel context.
+        const auto* input_val = ctx_internal->GetInputMLValue(k);
+        ORT_ENFORCE(input_val != nullptr, "SequenceMap: input ", k, " is neither a sequence nor a tensor.");
+        feeds.push_back(*input_val);
+      }
+    }


+  for (int j = 0; j < num_outputs; ++j) {
+    output_seqs[j] = ctx->Output<TensorSeq>(j);
+    ORT_ENFORCE(output_seqs[j] != nullptr, "SequenceMap: failed to get output TensorSeq slot ", j);
+    output_seqs[j]->Reserve(seq_len);
+  }


+      if (i == 0) {
+        output_seqs[j]->SetType(fetches[j].Get<Tensor>().DataType());
+      }


+  TypeProto float_tensor;
+  float_tensor.mutable_tensor_type()->set_elem_type(TensorProto_DataType_FLOAT);
+  float_tensor.mutable_tensor_type()->mutable_shape()->add_dim();
+
+  auto& x_arg = graph.GetOrCreateNodeArg("x", &float_tensor);
+  auto& scalar_arg = graph.GetOrCreateNodeArg("scalar_in", &float_tensor);
+  auto& out_arg = graph.GetOrCreateNodeArg("add_out", &float_tensor);


+  // additional_inputs is a tensor (passed through to every iteration)
+  test.AddInput<float>("additional_inputs", {3}, {100.0f, 100.0f, 100.0f});


+// We can not use CPUINFO if it is not supported and we do not want to use
 // it on certain platforms because of the binary size increase.
 // We could use it to find out the number of physical cores for certain supported platforms
-#if defined(CPUINFO_SUPPORTED) && !defined(__APPLE__) && !defined(__ANDROID__) && !defined(__wasm__) && !defined(_AIX)
+#if defined(CPUINFO_SUPPORTED) && defined(__linux__)
 #include <cpuinfo.h>
-#define ORT_USE_CPUINFO
 #endif


+  for (size_t i = 0; i < seq_len; ++i) {
+    std::vector<OrtValue> feeds;
+    feeds.reserve(static_cast<size_t>(num_outer_inputs));


- Include implicit inputs in subgraph feed setup and Compute - Initialize output TensorSeq element type before iteration loop so empty input sequences produce correctly-typed outputs - Remove redundant per-iteration SetType - Hoist feeds/fetches allocations outside the iteration loop - Fix scalar broadcasting test: use rank-0 TypeProto and scalar value

Rishi-Dave · 2026-06-06T11:24:36Z

Thanks for the review. Addressed in 62a1bea:

SetupSubgraphExecutionInfo: appended Node().ImplicitInputDefs() to both feed_names and the outer feed list driving FindDevicesForValues, so feed_locations stays length-consistent with feed_names going into FinalizeFeedFetchCopyInfo.
Compute: appended ctx_internal->GetImplicitInputs() in ImplicitInputDefs() order, matching the setup order.
Hoisted feeds/fetches out of the per-iteration loop with reserve(num_outer_inputs + num_implicit) and clear() at the top of each iteration.
Output TensorSeq::elem_type_ is now initialized from ctx->OutputType(j)->AsSequenceTensorType()->GetElementType() before the loop, so a zero-length input sequence yields a correctly-typed empty output. Dropped the redundant per-iteration SetType.
Test: BuildAddBody's scalar_in is now a rank-0 TypeProto, and SequenceMap_AddScalar passes {} with {10.0f}, so the test actually exercises scalar broadcasting in the body Add.

On the posix/env.cc cpuinfo guard — it's a pre-existing fix needed for the Linux CI to build this PR (cpuinfo isn't available everywhere on non-x86 Linux paths), kept here to avoid a separate dependent PR. Happy to split it out if you'd prefer.

Rishi-Dave added 2 commits May 6, 2026 11:07

xadupre requested a review from Copilot June 5, 2026 12:35

Copilot started reviewing on behalf of xadupre June 5, 2026 12:35 View session

Copilot AI reviewed Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add native CPU kernel for SequenceMap (opset 17)#28813

feat: add native CPU kernel for SequenceMap (opset 17)#28813
Rishi-Dave wants to merge 3 commits into
microsoft:mainfrom
Rishi-Dave:rishidave/feat/sequence-map-cpu-kernel

Rishi-Dave commented Jun 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Rishi-Dave commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// additional_inputs is a tensor (passed through to every iteration)
		test.AddInput<float>("additional_inputs", {3}, {100.0f, 100.0f, 100.0f});

Conversation

Rishi-Dave commented Jun 5, 2026

Summary

Motivation

Changes

Test Plan

Issue Resolution

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Rishi-Dave commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants