Fix string tensor deserialization in ORT format models#28133
Conversation
Agent-Logs-Url: https://github.com/microsoft/onnxruntime/sessions/ce57c3d2-7ee1-4b02-ad3d-7ec106435fc3 Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
The ConvertInitializersIntoOrtValuesSkipsStringTensors test was using ModelProto::default_instance() which returns a const singleton and lacks the required ir_version and opset_import fields. This caused Model::Load to fail with 'Missing opset in the model' across all CI platforms. Use a proper ModelProto with ir_version and opset_import set, matching the pattern used by adjacent tests.
tianleiwu
left a comment
There was a problem hiding this comment.
Review Summary
Well-targeted fix for a regression where ConvertInitializersIntoOrtValues() incorrectly replaced string tensor initializer TensorProtos with raw-buffer-backed external data references, causing ORT format serialization to lose string data. The dual-layer defense (graph.cc skip + tensorprotoutils.cc guard) is the right approach.
Positives:
utils::HasString()check ingraph.ccis correctly placed beforeGetSizeInBytesFromTensorProto<0>, avoiding meaninglesssizeof(std::string) * Nbyte-size comparisons.- Defense-in-depth guard in
tensorprotoutils.ccprotects other callers ofTensorToTensorProtowithuse_tensor_buffer=true(constant folding,Initializer::ToProtoWithOrtValue,graph_utils, optimizer passes). - Regression test directly exercises the root cause with assertions on both the negative (
HasExternalDataInMemoryis false) and positive (string content preserved) conditions.
One suggestion below regarding test coverage.
There was a problem hiding this comment.
Pull request overview
Fixes loss of string initializer data when converting large initializers to “external data in-memory” during ORT-format model serialization, which previously produced invalid string TensorProtos and caused reload-time deserialization failures.
Changes:
- Prevent
TensorToTensorProto(..., use_tensor_buffer=true)from using the in-memory external-data path for string tensors. - Skip string tensor initializers in
Graph::ConvertInitializersIntoOrtValues()to avoid the incompatible raw-buffer optimization. - Add a regression test ensuring string initializer
string_dataremains intact afterConvertInitializersIntoOrtValues().
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| onnxruntime/core/framework/tensorprotoutils.cc | Blocks in-memory external-data optimization for string tensors so string_data is preserved. |
| onnxruntime/core/graph/graph.cc | Skips string initializers during conversion to avoid producing non-serializable in-memory references. |
| onnxruntime/test/ir/graph_test.cc | Adds regression coverage for preserving string initializer content across conversion. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Extend the ConvertInitializersIntoOrtValues string tensor regression test to also verify end-to-end ORT format save/load round-trip. This ensures string initializer data survives serialization to flatbuffer format and deserialization back, covering the full failure path from the bug report.
yuslepukhin
left a comment
There was a problem hiding this comment.
Gaps / suggestions:
No test for the TensorToTensorProto defense-in-depth fix. The test only exercises the graph.cc guard. If someone later removes the graph.cc guard, the test would still pass (the tensorprotoutils.cc guard catches it), but the test wouldn't verify the tensorprotoutils.cc change independently. Consider adding a targeted test that directly calls TensorToTensorProto(string_tensor, name, /use_tensor_buffer=/true) and asserts the result has string_data populated and no external data.
InjectExternalInitializedTensors has the same latent vulnerability. In graph.cc:4102, InjectExternalInitializedTensors calls TensorToTensorProto with use_tensor_buffer=true without a string check. This is now protected by the tensorprotoutils.cc defense-in-depth, but the function's SizeInBytes() > threshold check itself would still evaluate sizeof(std::string)*N, and the resulting tensor proto (while now correct due to the fallback) would not have HasExternalDataInMemory set — so the ORT_ENFORCE(utils::HasExternalDataInMemory(tensor_proto)) at line 4107 would fire and throw for string tensors. This path is only reachable via the public API (AddExternalInitializers), and users are unlikely to pass string tensors there, but it's a latent bug with the same root cause.
graph_utils::AddInitializerWithOrtValue (graph_utils.cc line 444) also calls TensorToTensorProto(tensor, name, true) unconditionally. Now protected by the defense-in-depth, but callers creating string initializers through graph optimizer utility functions would silently get a non-external TensorProto where an external one was expected.
No negative test verifying that the original bug's error message ("the pre-allocate size does not match the size in proto") is no longer produced.
Minor Style Notes
The test comment says "sizeof(std::string) is typically 32 bytes" — this is correct for MSVC/libstdc++ on 64-bit but varies (libc++ uses 24 bytes). The comment's logic still holds since 20 × 24 = 480 > 127. Not a problem, just noting it's implementation-dependent.
The new includes in graph_test.cc (ort_format_version.h, ort.fbs.h, graph_flatbuffers_utils.h, ort_format_load_options.h) are all needed for the ORT format round-trip test portion.
Verdict
The PR is correct and safe to merge for the stated bug fix. The defense-in-depth in TensorToTensorProto is a good addition. The main recommendation is to also guard InjectExternalInitializedTensors (same pattern as the graph.cc fix — continue before the SizeInBytes check for string tensors), since with the current code it would ORT_ENFORCE-throw if a user ever passed a string tensor via AddExternalInitializers.
Claude Opus 4.6 • 3x
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…sors, add targeted TensorToTensorProto test - Add string tensor guard to InjectExternalInitializedTensors to prevent ORT_ENFORCE crash when string tensors are passed via AddExternalInitializers - Add targeted test for TensorToTensorProto defense-in-depth that directly verifies string tensors produce string_data even with use_tensor_buffer=true - Clarify sizeof(std::string) comment to note platform variance
|
Addressed review feedback in 0cde15a:
Re: graph_utils::AddInitializerWithOrtValue — This is now safe via the Re: negative test — The existing |
…me-string-tensor-deserialization
Description
ConvertInitializersIntoOrtValues()replaces initializer TensorProtos with ones pointing to in-memory raw buffers viaTensorToTensorProto(..., use_tensor_buffer=true). For string tensors exceeding 127 bytes, this stores a pointer tostd::stringC++ objects as "external data"—but those objects contain heap pointers, not serializable content. Thestring_datafield ends up empty, so ORT format save loses all string data. On reload: shape says N elements,string_data_size()is 0 → deserialization fails.Changes:
tensorprotoutils.cc: Add!tensor.IsDataTypeString()guard inTensorToTensorProtoso string tensors always populatestring_datarather than taking the external-data-in-memory pathgraph.cc: Skip string tensors inConvertInitializersIntoOrtValues()since the raw-buffer optimization is fundamentally incompatible with string datagraph_test.cc: Add regression test creating a 20-element string initializer, callingConvertInitializersIntoOrtValues(), and verifying string data survivesMotivation and Context
Since onnxruntime 1.23.0, loading ORT format models with string tensor initializers fails with:
Reproduction: any model with a string initializer (e.g. Gather over a string array) saved via
optimized_model_filepathwith.ortextension, then reloaded.