Skip to content

feat: add native CPU kernel for SequenceMap (opset 17)#28813

Open
Rishi-Dave wants to merge 3 commits into
microsoft:mainfrom
Rishi-Dave:rishidave/feat/sequence-map-cpu-kernel
Open

feat: add native CPU kernel for SequenceMap (opset 17)#28813
Rishi-Dave wants to merge 3 commits into
microsoft:mainfrom
Rishi-Dave:rishidave/feat/sequence-map-cpu-kernel

Conversation

@Rishi-Dave
Copy link
Copy Markdown
Contributor

Summary

  • Implements a native CPU kernel for SequenceMap (opset 17), eliminating the ONNX function-body fallback that expands into a Loop over SequenceInsert and produces O(n^2) memory traffic.
  • Mirrors the canonical control-flow kernel pattern from Loop / If / Scan: derives from IControlFlowKernel, prepares feeds/fetches via FeedsFetchesManager, and invokes the body subgraph through utils::ExecuteSubgraph.
  • Adds unit tests covering identity, scalar broadcast via an additional tensor input, and multi-output body graphs.

Motivation

Fixes #23024. The ONNX spec defines SequenceMap via a context-dependent function body that decomposes the op into a Loop whose accumulator is grown by SequenceInsert on every iteration. Each SequenceInsert copies the accumulated sequence, so processing an n-element input requires O(n^2) memory traffic. Workloads that map per-element transforms over long sequences hit this quadratic behaviour and are forced to avoid the operator entirely.

A native kernel iterates the input in O(n), forwards the i-th element of each sequence-typed input plus passthrough tensor inputs to the body, and appends each body output to the appropriate output TensorSeq without per-iteration copies.

Changes

  • onnxruntime/core/providers/cpu/sequence/sequence_ops.h: declares SequenceMap as an IControlFlowKernel with a FeedsFetchesManager member.
  • onnxruntime/core/providers/cpu/sequence/sequence_ops.cc: implements SetupSubgraphExecutionInfo and Compute, registers the kernel for opset 17, validates length parity for sequence-typed additional_inputs, and assembles per-output TensorSeq results.
  • onnxruntime/core/providers/cpu/cpu_execution_provider.cc: adds the forward declaration and BuildKernelCreateInfo entry for the new kernel alongside the other opset-17 sequence ops.
  • onnxruntime/test/providers/cpu/sequence/sequence_ops_test.cc: adds SequenceMap_Identity, SequenceMap_AddScalar, and SequenceMap_TwoOutputs covering single-input identity, sequence + tensor broadcast, and dual-output body graphs.

Test Plan

  • onnxruntime_test_all --gtest_filter='SequenceOpsTest.SequenceMap*' — exercises the three new tests.
  • onnxruntime_test_all --gtest_filter='SequenceOpsTest.*' — confirms no regression in sibling sequence ops.
  • ONNX backend tests sequence_map_identity_*, sequence_map_add_*, and sequence_map_extract_shapes continue to be excluded for TensorRT EP only; the CPU EP now executes them via the native kernel rather than the function-body fallback.

Issue Resolution

Fixes #23024.

Rishi-Dave added 2 commits May 6, 2026 11:07
Replace the broad ORT_USE_CPUINFO macro (with negated platform exclusions)
with inline defined(CPUINFO_SUPPORTED) && defined(__linux__) guards at each
point of use. Since __APPLE__ and __linux__ are mutually exclusive, the
previous negation-based condition collapses to simply defined(__linux__).
Drop the intermediate ORT_USE_CPUINFO macro in favour of direct guards.
Without a dedicated kernel, SequenceMap falls back to the ONNX
context-dependent function body, which expands the op into a Loop over
SequenceInsert calls. Each SequenceInsert copies the accumulator
sequence, producing O(n^2) memory traffic for an n-element input.

This adds a native CPU kernel that:
- Derives from IControlFlowKernel and sets up the body subgraph via the
  standard FeedsFetchesManager flow used by Loop, If, and Scan.
- Iterates the input sequence sequentially in O(n), forwarding the i-th
  element of each sequence-typed input and passing tensor-typed
  additional_inputs through unchanged.
- Validates that sequence-typed additional_inputs share the input
  sequence length.
- Assembles one TensorSeq per body output and appends fetched tensors
  per iteration without intermediate copies.

Adds unit tests for the identity body, an add-scalar body with a
tensor additional input, and a body that emits two outputs to cover
the multi-output path.

Fixes microsoft#23024
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a native CPU Execution Provider kernel for ONNX SequenceMap (opset 17) to avoid the quadratic Loop + SequenceInsert function-body fallback, executing the body subgraph directly per sequence element via the control-flow kernel infrastructure.

Changes:

  • Introduces SequenceMap as a CPU control-flow kernel using FeedsFetchesManager + utils::ExecuteSubgraph.
  • Registers the new kernel in the CPU EP opset-17 registry.
  • Adds unit tests that construct minimal body subgraphs (identity, add-with-extra-input, two outputs) and validate results.
  • Also changes cpuinfo usage gating in PosixEnv to Linux-only.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
onnxruntime/core/providers/cpu/sequence/sequence_ops.h Declares the new SequenceMap control-flow kernel and its FeedsFetchesManager member.
onnxruntime/core/providers/cpu/sequence/sequence_ops.cc Implements SetupSubgraphExecutionInfo/Compute and registers the opset-17 CPU kernel.
onnxruntime/core/providers/cpu/cpu_execution_provider.cc Registers SequenceMap in the CPU EP kernel registry.
onnxruntime/test/providers/cpu/sequence/sequence_ops_test.cc Adds new SequenceMap unit tests with constructed body subgraphs.
onnxruntime/core/platform/posix/env.cc Narrows cpuinfo integration to Linux-only in PosixEnv.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +624 to +627
const auto& subgraph_map = subgraph_session_state.GetOrtValueNameIdxMap();
std::unique_ptr<FeedsFetchesManager> ffm;
ORT_RETURN_IF_ERROR(FeedsFetchesManager::Create(feed_names, fetch_names, subgraph_map, ffm));
ORT_RETURN_IF_ERROR(utils::InitializeFeedFetchCopyInfo(subgraph_session_state, *ffm));
Comment on lines +631 to +635
std::vector<std::string> outer_feed_names;
outer_feed_names.reserve(node_inputs.size());
for (const auto* input_def : node_inputs) {
outer_feed_names.push_back(input_def->Name());
}
Comment on lines +691 to +705
std::vector<OrtValue> feeds;
feeds.reserve(static_cast<size_t>(num_outer_inputs));

// Build feeds: sequence inputs -> element i; tensor inputs -> pass-through OrtValue.
for (int k = 0; k < num_outer_inputs; ++k) {
const auto* seq_k = (k == 0) ? input_seq : ctx->Input<TensorSeq>(k);
if (seq_k != nullptr) {
feeds.push_back(seq_k->GetAt(i));
} else {
// Tensor input: shallow-copy the OrtValue (shared_ptr, safe) from the kernel context.
const auto* input_val = ctx_internal->GetInputMLValue(k);
ORT_ENFORCE(input_val != nullptr, "SequenceMap: input ", k, " is neither a sequence nor a tensor.");
feeds.push_back(*input_val);
}
}
Comment on lines +673 to +677
for (int j = 0; j < num_outputs; ++j) {
output_seqs[j] = ctx->Output<TensorSeq>(j);
ORT_ENFORCE(output_seqs[j] != nullptr, "SequenceMap: failed to get output TensorSeq slot ", j);
output_seqs[j]->Reserve(seq_len);
}
Comment on lines +723 to +725
if (i == 0) {
output_seqs[j]->SetType(fetches[j].Get<Tensor>().DataType());
}
Comment on lines +590 to +596
TypeProto float_tensor;
float_tensor.mutable_tensor_type()->set_elem_type(TensorProto_DataType_FLOAT);
float_tensor.mutable_tensor_type()->mutable_shape()->add_dim();

auto& x_arg = graph.GetOrCreateNodeArg("x", &float_tensor);
auto& scalar_arg = graph.GetOrCreateNodeArg("scalar_in", &float_tensor);
auto& out_arg = graph.GetOrCreateNodeArg("add_out", &float_tensor);
Comment on lines +674 to +675
// additional_inputs is a tensor (passed through to every iteration)
test.AddInput<float>("additional_inputs", {3}, {100.0f, 100.0f, 100.0f});
Comment on lines +41 to 46
// We can not use CPUINFO if it is not supported and we do not want to use
// it on certain platforms because of the binary size increase.
// We could use it to find out the number of physical cores for certain supported platforms
#if defined(CPUINFO_SUPPORTED) && !defined(__APPLE__) && !defined(__ANDROID__) && !defined(__wasm__) && !defined(_AIX)
#if defined(CPUINFO_SUPPORTED) && defined(__linux__)
#include <cpuinfo.h>
#define ORT_USE_CPUINFO
#endif
Comment on lines +690 to +692
for (size_t i = 0; i < seq_len; ++i) {
std::vector<OrtValue> feeds;
feeds.reserve(static_cast<size_t>(num_outer_inputs));
- Include implicit inputs in subgraph feed setup and Compute
- Initialize output TensorSeq element type before iteration loop
  so empty input sequences produce correctly-typed outputs
- Remove redundant per-iteration SetType
- Hoist feeds/fetches allocations outside the iteration loop
- Fix scalar broadcasting test: use rank-0 TypeProto and scalar value
@Rishi-Dave
Copy link
Copy Markdown
Contributor Author

Thanks for the review. Addressed in 62a1bea:

  • SetupSubgraphExecutionInfo: appended Node().ImplicitInputDefs() to both feed_names and the outer feed list driving FindDevicesForValues, so feed_locations stays length-consistent with feed_names going into FinalizeFeedFetchCopyInfo.
  • Compute: appended ctx_internal->GetImplicitInputs() in ImplicitInputDefs() order, matching the setup order.
  • Hoisted feeds/fetches out of the per-iteration loop with reserve(num_outer_inputs + num_implicit) and clear() at the top of each iteration.
  • Output TensorSeq::elem_type_ is now initialized from ctx->OutputType(j)->AsSequenceTensorType()->GetElementType() before the loop, so a zero-length input sequence yields a correctly-typed empty output. Dropped the redundant per-iteration SetType.
  • Test: BuildAddBody's scalar_in is now a rank-0 TypeProto, and SequenceMap_AddScalar passes {} with {10.0f}, so the test actually exercises scalar broadcasting in the body Add.

On the posix/env.cc cpuinfo guard — it's a pre-existing fix needed for the Linux CI to build this PR (cpuinfo isn't available everywhere on non-x86 Linux paths), kept here to avoid a separate dependent PR. Happy to split it out if you'd prefer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Performance] Quadratic complexity with SequenceMap and Scan

2 participants