Fix string tensor deserialization in ORT format models by Copilot · Pull Request #28133 · microsoft/onnxruntime

Copilot · 2026-04-19T21:17:19Z

Description

ConvertInitializersIntoOrtValues() replaces initializer TensorProtos with ones pointing to in-memory raw buffers via TensorToTensorProto(..., use_tensor_buffer=true). For string tensors exceeding 127 bytes, this stores a pointer to std::string C++ objects as "external data"—but those objects contain heap pointers, not serializable content. The string_data field ends up empty, so ORT format save loses all string data. On reload: shape says N elements, string_data_size() is 0 → deserialization fails.

Changes:

tensorprotoutils.cc: Add !tensor.IsDataTypeString() guard in TensorToTensorProto so string tensors always populate string_data rather than taking the external-data-in-memory path
graph.cc: Skip string tensors in ConvertInitializersIntoOrtValues() since the raw-buffer optimization is fundamentally incompatible with string data
graph_test.cc: Add regression test creating a 20-element string initializer, calling ConvertInitializersIntoOrtValues(), and verifying string data survives

Motivation and Context

Since onnxruntime 1.23.0, loading ORT format models with string tensor initializers fails with:

INVALID_ARGUMENT: Deserialize tensor failed. UnpackTensor: the pre-allocate size does not match the size in proto

Reproduction: any model with a string initializer (e.g. Gather over a string array) saved via optimized_model_filepath with .ort extension, then reloaded.

Agent-Logs-Url: https://github.com/microsoft/onnxruntime/sessions/ce57c3d2-7ee1-4b02-ad3d-7ec106435fc3 Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>

The ConvertInitializersIntoOrtValuesSkipsStringTensors test was using ModelProto::default_instance() which returns a const singleton and lacks the required ir_version and opset_import fields. This caused Model::Load to fail with 'Missing opset in the model' across all CI platforms. Use a proper ModelProto with ir_version and opset_import set, matching the pattern used by adjacent tests.

tianleiwu

Review Summary

Well-targeted fix for a regression where ConvertInitializersIntoOrtValues() incorrectly replaced string tensor initializer TensorProtos with raw-buffer-backed external data references, causing ORT format serialization to lose string data. The dual-layer defense (graph.cc skip + tensorprotoutils.cc guard) is the right approach.

Positives:

utils::HasString() check in graph.cc is correctly placed before GetSizeInBytesFromTensorProto<0>, avoiding meaningless sizeof(std::string) * N byte-size comparisons.
Defense-in-depth guard in tensorprotoutils.cc protects other callers of TensorToTensorProto with use_tensor_buffer=true (constant folding, Initializer::ToProtoWithOrtValue, graph_utils, optimizer passes).
Regression test directly exercises the root cause with assertions on both the negative (HasExternalDataInMemory is false) and positive (string content preserved) conditions.

One suggestion below regarding test coverage.

Copilot

Pull request overview

Fixes loss of string initializer data when converting large initializers to “external data in-memory” during ORT-format model serialization, which previously produced invalid string TensorProtos and caused reload-time deserialization failures.

Changes:

Prevent TensorToTensorProto(..., use_tensor_buffer=true) from using the in-memory external-data path for string tensors.
Skip string tensor initializers in Graph::ConvertInitializersIntoOrtValues() to avoid the incompatible raw-buffer optimization.
Add a regression test ensuring string initializer string_data remains intact after ConvertInitializersIntoOrtValues().

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
onnxruntime/core/framework/tensorprotoutils.cc	Blocks in-memory external-data optimization for string tensors so `string_data` is preserved.
onnxruntime/core/graph/graph.cc	Skips string initializers during conversion to avoid producing non-serializable in-memory references.
onnxruntime/test/ir/graph_test.cc	Adds regression coverage for preserving string initializer content across conversion.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Extend the ConvertInitializersIntoOrtValues string tensor regression test to also verify end-to-end ORT format save/load round-trip. This ensures string initializer data survives serialization to flatbuffer format and deserialization back, covering the full failure path from the bug report.

yuslepukhin

Gaps / suggestions:

No test for the TensorToTensorProto defense-in-depth fix. The test only exercises the graph.cc guard. If someone later removes the graph.cc guard, the test would still pass (the tensorprotoutils.cc guard catches it), but the test wouldn't verify the tensorprotoutils.cc change independently. Consider adding a targeted test that directly calls TensorToTensorProto(string_tensor, name, /use_tensor_buffer=/true) and asserts the result has string_data populated and no external data.

InjectExternalInitializedTensors has the same latent vulnerability. In graph.cc:4102, InjectExternalInitializedTensors calls TensorToTensorProto with use_tensor_buffer=true without a string check. This is now protected by the tensorprotoutils.cc defense-in-depth, but the function's SizeInBytes() > threshold check itself would still evaluate sizeof(std::string)*N, and the resulting tensor proto (while now correct due to the fallback) would not have HasExternalDataInMemory set — so the ORT_ENFORCE(utils::HasExternalDataInMemory(tensor_proto)) at line 4107 would fire and throw for string tensors. This path is only reachable via the public API (AddExternalInitializers), and users are unlikely to pass string tensors there, but it's a latent bug with the same root cause.

graph_utils::AddInitializerWithOrtValue (graph_utils.cc line 444) also calls TensorToTensorProto(tensor, name, true) unconditionally. Now protected by the defense-in-depth, but callers creating string initializers through graph optimizer utility functions would silently get a non-external TensorProto where an external one was expected.

No negative test verifying that the original bug's error message ("the pre-allocate size does not match the size in proto") is no longer produced.

Minor Style Notes
The test comment says "sizeof(std::string) is typically 32 bytes" — this is correct for MSVC/libstdc++ on 64-bit but varies (libc++ uses 24 bytes). The comment's logic still holds since 20 × 24 = 480 > 127. Not a problem, just noting it's implementation-dependent.
The new includes in graph_test.cc (ort_format_version.h, ort.fbs.h, graph_flatbuffers_utils.h, ort_format_load_options.h) are all needed for the ORT format round-trip test portion.
Verdict
The PR is correct and safe to merge for the stated bug fix. The defense-in-depth in TensorToTensorProto is a good addition. The main recommendation is to also guard InjectExternalInitializedTensors (same pattern as the graph.cc fix — continue before the SizeInBytes check for string tensors), since with the current code it would ORT_ENFORCE-throw if a user ever passed a string tensor via AddExternalInitializers.

Claude Opus 4.6 • 3x

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…sors, add targeted TensorToTensorProto test - Add string tensor guard to InjectExternalInitializedTensors to prevent ORT_ENFORCE crash when string tensors are passed via AddExternalInitializers - Add targeted test for TensorToTensorProto defense-in-depth that directly verifies string tensors produce string_data even with use_tensor_buffer=true - Clarify sizeof(std::string) comment to note platform variance

tianleiwu · 2026-04-22T00:42:50Z

Addressed review feedback in 0cde15a:

InjectExternalInitializedTensors guard — Added !user_tensor.IsDataTypeString() check to the SizeInBytes condition, preventing the ORT_ENFORCE(HasExternalDataInMemory) crash for string tensors passed via AddExternalInitializers.
Targeted TensorToTensorProto test — New test TensorToTensorProtoStringTensorDefenseInDepth directly calls TensorToTensorProto(string_tensor, name, /*use_tensor_buffer=*/true) and asserts string_data is populated with no external data, independently verifying the defense-in-depth.
sizeof(std::string) comment — Updated to note platform variance (32 bytes MSVC/libstdc++, 24 bytes libc++) and that 20 elements exceed threshold on all platforms.

Re: graph_utils::AddInitializerWithOrtValue — This is now safe via the TensorToTensorProto defense-in-depth (produces string_data instead of external data). The downstream AddInitializerWithOrtValue(graph, tensor_proto, tensor) overload gracefully handles the non-external case (empty OrtValue, data inline in TensorProto). No crash path. Could tighten with an explicit skip in a follow-up if desired.

Re: negative test — The existing ConvertInitializersIntoOrtValuesSkipsStringTensors test covers the full round-trip (including ORT format save/load), which was the original failure path. The new targeted test verifies the defense-in-depth independently.

…me-string-tensor-deserialization

Initial plan

5f0bb02

Copilot AI assigned Copilot and tianleiwu Apr 19, 2026

Copilot started work on behalf of tianleiwu April 19, 2026 21:17 View session

Copilot AI linked an issue Apr 19, 2026 that may be closed by this pull request

Error: onnxruntime 1.23.0+ breaks string tensor deserialization #28117

Closed

Fix string tensor deserialization in ORT format models

cef6541

Agent-Logs-Url: https://github.com/microsoft/onnxruntime/sessions/ce57c3d2-7ee1-4b02-ad3d-7ec106435fc3 Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix string tensor deserialization issue in onnxruntime 1.23.0~~ Fix string tensor deserialization in ORT format models Apr 19, 2026

Copilot finished work on behalf of tianleiwu April 19, 2026 22:09

Copilot AI requested a review from tianleiwu April 19, 2026 22:09

tianleiwu requested a review from Copilot April 20, 2026 07:32

Copilot started reviewing on behalf of tianleiwu April 20, 2026 07:33 View session

tianleiwu approved these changes Apr 20, 2026

View reviewed changes

Comment thread onnxruntime/test/ir/graph_test.cc

Copilot AI reviewed Apr 20, 2026

View reviewed changes

Comment thread onnxruntime/test/ir/graph_test.cc

tianleiwu marked this pull request as ready for review April 20, 2026 16:20

tianleiwu requested review from edgchen1, skottmckay and yuslepukhin April 21, 2026 04:06

yuslepukhin reviewed Apr 21, 2026

View reviewed changes

yuslepukhin requested a review from Copilot April 21, 2026 22:45

yuslepukhin requested changes Apr 21, 2026

View reviewed changes

Copilot started reviewing on behalf of yuslepukhin April 21, 2026 22:50 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

Comment thread onnxruntime/core/framework/tensorprotoutils.cc

Comment thread onnxruntime/test/ir/graph_test.cc

Comment thread onnxruntime/test/ir/graph_test.cc Outdated

yuslepukhin approved these changes Apr 22, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into copilot/fix-onnxrunti…

5b385d9

…me-string-tensor-deserialization

tianleiwu enabled auto-merge (squash) April 22, 2026 23:17

tianleiwu merged commit c5e6bd8 into main Apr 23, 2026
88 of 89 checks passed

tianleiwu deleted the copilot/fix-onnxruntime-string-tensor-deserialization branch April 23, 2026 01:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix string tensor deserialization in ORT format models#28133

Fix string tensor deserialization in ORT format models#28133
tianleiwu merged 6 commits intomainfrom
copilot/fix-onnxruntime-string-tensor-deserialization

Copilot AI commented Apr 19, 2026 •

edited

Loading

Uh oh!

tianleiwu left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

yuslepukhin left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianleiwu commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Copilot AI commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

Review Summary

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

yuslepukhin left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianleiwu commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Copilot AI commented Apr 19, 2026 •

edited

Loading