Skip to content

Fix string tensor deserialization in ORT format models#28133

Merged
tianleiwu merged 6 commits intomainfrom
copilot/fix-onnxruntime-string-tensor-deserialization
Apr 23, 2026
Merged

Fix string tensor deserialization in ORT format models#28133
tianleiwu merged 6 commits intomainfrom
copilot/fix-onnxruntime-string-tensor-deserialization

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 19, 2026

Description

ConvertInitializersIntoOrtValues() replaces initializer TensorProtos with ones pointing to in-memory raw buffers via TensorToTensorProto(..., use_tensor_buffer=true). For string tensors exceeding 127 bytes, this stores a pointer to std::string C++ objects as "external data"—but those objects contain heap pointers, not serializable content. The string_data field ends up empty, so ORT format save loses all string data. On reload: shape says N elements, string_data_size() is 0 → deserialization fails.

Changes:

  • tensorprotoutils.cc: Add !tensor.IsDataTypeString() guard in TensorToTensorProto so string tensors always populate string_data rather than taking the external-data-in-memory path
  • graph.cc: Skip string tensors in ConvertInitializersIntoOrtValues() since the raw-buffer optimization is fundamentally incompatible with string data
  • graph_test.cc: Add regression test creating a 20-element string initializer, calling ConvertInitializersIntoOrtValues(), and verifying string data survives

Motivation and Context

Since onnxruntime 1.23.0, loading ORT format models with string tensor initializers fails with:

INVALID_ARGUMENT: Deserialize tensor failed. UnpackTensor: the pre-allocate size does not match the size in proto

Reproduction: any model with a string initializer (e.g. Gather over a string array) saved via optimized_model_filepath with .ort extension, then reloaded.

Copilot AI changed the title [WIP] Fix string tensor deserialization issue in onnxruntime 1.23.0 Fix string tensor deserialization in ORT format models Apr 19, 2026
Copilot AI requested a review from tianleiwu April 19, 2026 22:09
The ConvertInitializersIntoOrtValuesSkipsStringTensors test was using
ModelProto::default_instance() which returns a const singleton and lacks
the required ir_version and opset_import fields. This caused Model::Load
to fail with 'Missing opset in the model' across all CI platforms.

Use a proper ModelProto with ir_version and opset_import set, matching
the pattern used by adjacent tests.
Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Well-targeted fix for a regression where ConvertInitializersIntoOrtValues() incorrectly replaced string tensor initializer TensorProtos with raw-buffer-backed external data references, causing ORT format serialization to lose string data. The dual-layer defense (graph.cc skip + tensorprotoutils.cc guard) is the right approach.

Positives:

  • utils::HasString() check in graph.cc is correctly placed before GetSizeInBytesFromTensorProto<0>, avoiding meaningless sizeof(std::string) * N byte-size comparisons.
  • Defense-in-depth guard in tensorprotoutils.cc protects other callers of TensorToTensorProto with use_tensor_buffer=true (constant folding, Initializer::ToProtoWithOrtValue, graph_utils, optimizer passes).
  • Regression test directly exercises the root cause with assertions on both the negative (HasExternalDataInMemory is false) and positive (string content preserved) conditions.

One suggestion below regarding test coverage.

Comment thread onnxruntime/test/ir/graph_test.cc
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes loss of string initializer data when converting large initializers to “external data in-memory” during ORT-format model serialization, which previously produced invalid string TensorProtos and caused reload-time deserialization failures.

Changes:

  • Prevent TensorToTensorProto(..., use_tensor_buffer=true) from using the in-memory external-data path for string tensors.
  • Skip string tensor initializers in Graph::ConvertInitializersIntoOrtValues() to avoid the incompatible raw-buffer optimization.
  • Add a regression test ensuring string initializer string_data remains intact after ConvertInitializersIntoOrtValues().

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
onnxruntime/core/framework/tensorprotoutils.cc Blocks in-memory external-data optimization for string tensors so string_data is preserved.
onnxruntime/core/graph/graph.cc Skips string initializers during conversion to avoid producing non-serializable in-memory references.
onnxruntime/test/ir/graph_test.cc Adds regression coverage for preserving string initializer content across conversion.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/test/ir/graph_test.cc
Extend the ConvertInitializersIntoOrtValues string tensor regression test
to also verify end-to-end ORT format save/load round-trip. This ensures
string initializer data survives serialization to flatbuffer format and
deserialization back, covering the full failure path from the bug report.
@tianleiwu tianleiwu marked this pull request as ready for review April 20, 2026 16:20
Copy link
Copy Markdown
Member

@yuslepukhin yuslepukhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gaps / suggestions:

No test for the TensorToTensorProto defense-in-depth fix. The test only exercises the graph.cc guard. If someone later removes the graph.cc guard, the test would still pass (the tensorprotoutils.cc guard catches it), but the test wouldn't verify the tensorprotoutils.cc change independently. Consider adding a targeted test that directly calls TensorToTensorProto(string_tensor, name, /use_tensor_buffer=/true) and asserts the result has string_data populated and no external data.

InjectExternalInitializedTensors has the same latent vulnerability. In graph.cc:4102, InjectExternalInitializedTensors calls TensorToTensorProto with use_tensor_buffer=true without a string check. This is now protected by the tensorprotoutils.cc defense-in-depth, but the function's SizeInBytes() > threshold check itself would still evaluate sizeof(std::string)*N, and the resulting tensor proto (while now correct due to the fallback) would not have HasExternalDataInMemory set — so the ORT_ENFORCE(utils::HasExternalDataInMemory(tensor_proto)) at line 4107 would fire and throw for string tensors. This path is only reachable via the public API (AddExternalInitializers), and users are unlikely to pass string tensors there, but it's a latent bug with the same root cause.

graph_utils::AddInitializerWithOrtValue (graph_utils.cc line 444) also calls TensorToTensorProto(tensor, name, true) unconditionally. Now protected by the defense-in-depth, but callers creating string initializers through graph optimizer utility functions would silently get a non-external TensorProto where an external one was expected.

No negative test verifying that the original bug's error message ("the pre-allocate size does not match the size in proto") is no longer produced.

Minor Style Notes
The test comment says "sizeof(std::string) is typically 32 bytes" — this is correct for MSVC/libstdc++ on 64-bit but varies (libc++ uses 24 bytes). The comment's logic still holds since 20 × 24 = 480 > 127. Not a problem, just noting it's implementation-dependent.
The new includes in graph_test.cc (ort_format_version.h, ort.fbs.h, graph_flatbuffers_utils.h, ort_format_load_options.h) are all needed for the ORT format round-trip test portion.
Verdict
The PR is correct and safe to merge for the stated bug fix. The defense-in-depth in TensorToTensorProto is a good addition. The main recommendation is to also guard InjectExternalInitializedTensors (same pattern as the graph.cc fix — continue before the SizeInBytes check for string tensors), since with the current code it would ORT_ENFORCE-throw if a user ever passed a string tensor via AddExternalInitializers.

Claude Opus 4.6 • 3x

@yuslepukhin yuslepukhin requested a review from Copilot April 21, 2026 22:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/framework/tensorprotoutils.cc
Comment thread onnxruntime/test/ir/graph_test.cc
Comment thread onnxruntime/test/ir/graph_test.cc Outdated
…sors, add targeted TensorToTensorProto test

- Add string tensor guard to InjectExternalInitializedTensors to prevent
  ORT_ENFORCE crash when string tensors are passed via AddExternalInitializers
- Add targeted test for TensorToTensorProto defense-in-depth that directly
  verifies string tensors produce string_data even with use_tensor_buffer=true
- Clarify sizeof(std::string) comment to note platform variance
@tianleiwu
Copy link
Copy Markdown
Contributor

Addressed review feedback in 0cde15a:

  1. InjectExternalInitializedTensors guard — Added !user_tensor.IsDataTypeString() check to the SizeInBytes condition, preventing the ORT_ENFORCE(HasExternalDataInMemory) crash for string tensors passed via AddExternalInitializers.

  2. Targeted TensorToTensorProto test — New test TensorToTensorProtoStringTensorDefenseInDepth directly calls TensorToTensorProto(string_tensor, name, /*use_tensor_buffer=*/true) and asserts string_data is populated with no external data, independently verifying the defense-in-depth.

  3. sizeof(std::string) comment — Updated to note platform variance (32 bytes MSVC/libstdc++, 24 bytes libc++) and that 20 elements exceed threshold on all platforms.

Re: graph_utils::AddInitializerWithOrtValue — This is now safe via the TensorToTensorProto defense-in-depth (produces string_data instead of external data). The downstream AddInitializerWithOrtValue(graph, tensor_proto, tensor) overload gracefully handles the non-external case (empty OrtValue, data inline in TensorProto). No crash path. Could tighten with an explicit skip in a follow-up if desired.

Re: negative test — The existing ConvertInitializersIntoOrtValuesSkipsStringTensors test covers the full round-trip (including ORT format save/load), which was the original failure path. The new targeted test verifies the defense-in-depth independently.

@tianleiwu tianleiwu enabled auto-merge (squash) April 22, 2026 23:17
@tianleiwu tianleiwu merged commit c5e6bd8 into main Apr 23, 2026
88 of 89 checks passed
@tianleiwu tianleiwu deleted the copilot/fix-onnxruntime-string-tensor-deserialization branch April 23, 2026 01:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error: onnxruntime 1.23.0+ breaks string tensor deserialization

4 participants