Skip to content

S390x test fixes#27404

Merged
tianleiwu merged 30 commits intomicrosoft:mainfrom
AlekseiNikiforovIBM:s390x_test_fixes
Mar 31, 2026
Merged

S390x test fixes#27404
tianleiwu merged 30 commits intomicrosoft:mainfrom
AlekseiNikiforovIBM:s390x_test_fixes

Conversation

@AlekseiNikiforovIBM
Copy link
Copy Markdown
Contributor

Description

This PR contains fixes to various big endian support issues in onnxruntime, both in libraries and tests.

Motivation and Context

Currently some tests from onnxruntime testsuite fail.
This change fixes all tests from onnxruntime testsuite when it's built without training support.
It also includes a linking issue fix.

Following tests are fixed on s390x:
OrtModelOnlyTests.ValidateOrtFormatModelDoesNotRunOptimizersInFullBuild
FlatbufferUtilsTest.ExternalWriteReadWithLoadInitializers
SparseTensorConversionTests.SparseTensorProtoToDense_Rank1Indices64
SparseTensorConversionTests.SparseTensorProtoToDense_Rank1Indices32
SparseTensorConversionTests.SparseTensorProtoToDense_Rank1Indices16
SparseTensorConversionTests.SparseTensorProtoToDense_Rank1Indices8
SparseTensorConversionTests.SparseTensorProtoToDense_Rank2Indices_COO
SparseTensorConversionTests.TestConstantNodeConversion
OrtModelOnlyTests.SparseInitializerHandling
SparseTensorConversionTests.TestConstantNodeConversion
SparseTensorConversionTests.TestDenseToSparseConversion
ExecutionFrameTestInit.SparseInitializerAsOutput
CApiTest.SparseOutputModel

@AlekseiNikiforovIBM
Copy link
Copy Markdown
Contributor Author

I've added fixes for tests enabled with training. Although tests are for training, a lot of fixes are actually in common code.

@AlekseiNikiforovIBM
Copy link
Copy Markdown
Contributor Author

@tianleiwu @amarin16 @baijumeswani, could you please take a look?

@tianleiwu
Copy link
Copy Markdown
Contributor

tianleiwu commented Mar 3, 2026

1. onnxruntime/core/graph/graph_flatbuffers_utils.cc (Changes Requested)

In SaveInitializerOrtFormat, the handling of HasExternalData(be_copy) for big-endian machines introduces a bug where external data will be saved into the ORT Flatbuffer as Big-Endian bytes instead of Little-Endian.

Concrete byte-level trace of the bug:

For the inline data path (correct), using int32 value 1 as an example:

  1. be_copy.raw_data() starts as LE bytes from the ONNX proto: [01, 00, 00, 00]
  2. ConvertRawDataInTensorProto(be_copy) swaps in-place → [00, 00, 00, 01] (now BE)
  3. UnpackInitializerData → has raw_data → UnpackTensorWithRawDataReadLittleEndian:
    interprets [00, 00, 00, 01] as LE, swaps to native BE → output = [01, 00, 00, 00]
  4. Result: unpacked_tensor = [01, 00, 00, 00] = LE bytes ✅

For the external data path (buggy):

  1. ConvertRawDataInTensorProto(be_copy)no-op (external data, no inline data to swap)
  2. TensorProtoWithExternalDataToTensorProto(be_copy_external_data, {}, be_copy) — loads LE bytes from disk into a brand new TensorProto result (see tensorprotoutils.cc:296), which only copies name/data_type/dims but NOT external_data or data_location. So be_copy becomes an inline-data TensorProto with raw_data = [01, 00, 00, 00] (LE from disk).
  3. UnpackInitializerData(be_copy)HasExternalData now returns false, so it takes the non-external path → CASE_UNPACKUnpackTensorWithRawDataReadLittleEndian:
    interprets [01, 00, 00, 00] as LE, swaps to native BE → output = [00, 00, 00, 01]
  4. Result: unpacked_tensor = [00, 00, 00, 01] = BE bytes ❌ — flatbuffer stores BE data.

Root cause: The inline data path relies on a double-swap (step 2 + step 3) that cancels out, producing LE. The external data path only has a single swap (step 3), producing BE.

Proposed fix — add ConvertRawDataInTensorProto after loading external data to match the inline path's double-swap:

      if (onnxruntime::utils::HasExternalData(be_copy)) {
        auto be_copy_external_data{be_copy};
        ORT_RETURN_IF_ERROR(onnxruntime::utils::TensorProtoWithExternalDataToTensorProto(be_copy_external_data, {}, be_copy));
        // Swap the newly loaded LE raw_data to BE, matching what ConvertRawDataInTensorProto
        // would have done for inline data. UnpackInitializerData's ReadLittleEndian will then
        // swap it back to LE, producing the correct result.
        onnxruntime::utils::ConvertRawDataInTensorProto(be_copy);
      }

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses big-endian (s390x) correctness issues in ORT by standardizing raw tensor data handling (writing/reading in little-endian form where required) and adjusting affected tests/build wiring so the test suite passes without training enabled.

Changes:

  • Replace direct TensorProto::set_raw_data(...) usage with onnxruntime::utils::SetRawDataInTensorProto(...) across multiple components/tests to centralize endianness handling.
  • Improve test robustness on big-endian by unpacking tensor proto data via ORT utilities (e.g., UnpackTensor, ConvertRawDataInTensorProto) instead of memcpy/reinterpretation.
  • Update ORT-format/flatbuffer initializer handling and unit-test build configuration to support big-endian scenarios and non-training builds.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
winml/adapter/winml_adapter_model.cpp Use ORT helper to set TensorProto raw data with endian-awareness.
orttraining/orttraining/training_api/checkpoint.cc Remove little-endian-only guard for training checkpoints.
orttraining/orttraining/test/graph/optimizer_graph_builder_test.cc Read raw_data via UnpackTensor to be endian-correct.
orttraining/orttraining/core/optimizer/shape_optimizer.cc Use endian-aware raw data setter for constant initializers.
orttraining/orttraining/core/optimizer/megatron_transformer.cc Use endian-aware raw data setter for partitioned initializers.
orttraining/orttraining/core/optimizer/conv1d_replacement.cc Use endian-aware raw data setter for initializer creation.
orttraining/orttraining/core/framework/checkpointing.cc Remove big-endian “not implemented” restriction in checkpoint saving path.
onnxruntime/test/providers/nv_tensorrt_rtx/test_nv_trt_rtx_ep_util.cc Use endian-aware raw data setter in test model building utilities.
onnxruntime/test/framework/sparse_kernels_test.cc Convert/check raw_data in a big-endian-safe way (copy-by-value + conversion).
onnxruntime/test/framework/int2_test.cc Use endian-aware raw data setter in Int2 round-trip test.
onnxruntime/test/framework/endian_test.cc Use endian-aware raw data setter and add stronger assertions about conversion effects.
onnxruntime/test/flatbuffers/flatbuffer_utils_test.cc Remove manual conversion now handled elsewhere.
onnxruntime/core/optimizer/qdq_transformer/where_dummy_dq.cc Use endian-aware raw data setter for dummy initializer scalars.
onnxruntime/core/graph/graph_flatbuffers_utils.cc Add big-endian handling for saving/loading ORT-format initializers and tensor dims.
onnxruntime/core/graph/graph.cc Remove prior sparse-constant endian workaround; use endian-aware raw data setter for editor API.
onnxruntime/core/framework/tensorprotoutils.cc Enhance raw-data setter and conversion logic; ensure sparse raw bytes are little-endian.
onnxruntime/core/framework/data_transfer_utils.h Add byte-swapping to CopyTensorDataToByteSpan on big-endian.
cmake/onnxruntime_unittests.cmake Ensure endian_utils is linked into tests when training is disabled.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@AlekseiNikiforovIBM
Copy link
Copy Markdown
Contributor Author

Thanks for review, I'll rework that change. It is likely that byteswapping is excessive in that place but missing in some other place.

Build command:
./build.sh --config Debug --parallel 0 --enable_pybind --build_wheel --allow_running_as_root
Later this data is narrowed:
*p_data++ = static_cast<T>(*data_iter);

If for example BE int32_t data is 4 bytes:
0x00 0x00 0x00 0x01
After byteswapping it'll become:
0x01 0x00 0x00 0x00
And after narrowing to int16_t two rightmost bytes are used on big endian
and result is 0x00 0x00

If instead we byteswap it as two shorts, byteswapping result is:
0x00 0x00 0x01 0x00
And narrowing result is 0x01 0x00, which is correct LE representation of that number.

This change fixes following test on s390x:
FlatbufferUtilsTest.ExternalWriteReadWithLoadInitializers
Raw data is expected to be in LE.

This change fixes tests:
SparseTensorConversionTests.SparseTensorProtoToDense_Rank1Indices64
SparseTensorConversionTests.SparseTensorProtoToDense_Rank1Indices32
SparseTensorConversionTests.SparseTensorProtoToDense_Rank1Indices16
SparseTensorConversionTests.SparseTensorProtoToDense_Rank1Indices8
SparseTensorConversionTests.SparseTensorProtoToDense_Rank2Indices_COO
This change fixes tests:
OrtModelOnlyTests.SparseInitializerHandling
SparseTensorConversionTests.TestConstantNodeConversion
This change fixes following tests on s390x:
ExecutionFrameTestInit.SparseInitializerAsOutput
CApiTest.SparseOutputModel
This change will allow to assess and fix big-endian-specific issues
in training-related code.
This change fixes approximately 40 tests.
This change fixes test CheckpointingTest.SaveAndLoad on s390x.
…ronTransformer class

This change fixes following tests on s390x:
GraphTransformationTests.MegatronMLPPartitionRank0
GraphTransformationTests.MegatronMLPPartitionRank1
GraphTransformationTests.MegatronSelfAttentionPartitionRank0
GraphTransformationTests.MegatronSelfAttentionPartitionRank1
…roto

This should fix a lot of potential endianness issues on s390x
This change fixes following tests on s390x:
OptimizerGraphBuilderTest.LoadOptimState_FullPrecision_Adam
OptimizerGraphBuilderTest.LoadOptimState_FullPrecision_Lamb
Memory data is in native endian format,
while on-disk data should be in little endian format already.

Move out a part of ConvertRawDataInTensorProto function
into a separate one for convenience.

This change fixes test
OrtModelOnlyTests.ValidateOrtFormatModelDoesNotRunOptimizersInFullBuild
on s390x.
This change fixes following tests on s390x:
SaveWithExternalInitializers.Mnist
SaveWithExternalInitializers.ModelWithOriginalExternalData
SaveWithExternalInitializers.ModelWithOriginalExternalDataAlignOffset
…unction

ReadExternalDataForTensor already returns data in native endian format.

Also remove ConvertEndianessForVector function.
It does const_cast and unexpectedly modifies original data.
When needed, use WriteLittleEndian instead.

These changes fix tests on s390x:
TensorProtoUtilsTest.UnpackTensorWithExternalData
TensorProtoUtilsTest.ConstantTensorProtoWithExternalData
@AlekseiNikiforovIBM
Copy link
Copy Markdown
Contributor Author

Your finding is entirely correct if external data is in file. However, in-memory external data is actually in native endian format, i.e. big endian on big endian systems:

if (use_tensor_buffer && tensor.SizeInBytes() > kSmallTensorExternalDataThreshold) {
// https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/graph/graph_flatbuffers_utils.cc#L302
const auto* raw_data = tensor.DataRaw();
ORT_ENFORCE(raw_data, "Missing raw data for tensor proto. Invalid tensor.");
static_assert(sizeof(void*) <= sizeof(ExternalDataInfo::OFFSET_TYPE));
// we reinterpret_cast this back to void* in tensorprotoutils.cc:GetExtDataFromTensorProto.
// use intptr_t as OFFSET_TYPE is signed. in theory you could get a weird looking value if the address uses the
// high bit, but that should be unlikely in a scenario where we care about memory usage enough to use this path.
auto offset = narrow<ExternalDataInfo::OFFSET_TYPE>(reinterpret_cast<intptr_t>(raw_data));
ExternalDataInfo::SetExternalLocationToProto(onnxruntime::utils::kTensorProtoMemoryAddressTag,
offset, tensor.SizeInBytes(), tensor_proto);

It seems like a bad idea to byteswap this memory data in advance, so I've added byteswapping of data from file after reading it. It also revealed a couple additional byteswapping issues which I also investigated and fixed.

@tianleiwu
Copy link
Copy Markdown
Contributor

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 4 pipeline(s).

tianleiwu
tianleiwu previously approved these changes Mar 26, 2026
@tianleiwu tianleiwu enabled auto-merge (squash) March 26, 2026 03:06
auto-merge was automatically disabled March 26, 2026 15:30

Head branch was pushed to by a user without write access

@AlekseiNikiforovIBM
Copy link
Copy Markdown
Contributor Author

AlekseiNikiforovIBM commented Mar 26, 2026

I've added change to include endian headers in a couple of files due to failing pipeline indicating this issue in one of build configurations.

Edit: failing pipeline: https://github.com/microsoft/onnxruntime/actions/runs/23539560419

@tianleiwu
Copy link
Copy Markdown
Contributor

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 4 pipeline(s).

@tianleiwu
Copy link
Copy Markdown
Contributor

@AlekseiNikiforovIBM, there is still build errors in DirectML builds (please also check other failed builds):

D:\a\_work\onnxruntime\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlGraphFusionHelper.cpp(128,31): error C2653: 'endian': is not a class or namespace name [D:\a\_work\_temp\build\RelWithDebInfo\onnxruntime_providers_dml.vcxproj]
Error: D:\a\_work\onnxruntime\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlGraphFusionHelper.cpp(128,39): error C2065: 'native': undeclared identifier [D:\a\_work\_temp\build\RelWithDebInfo\onnxruntime_providers_dml.vcxproj]
Error: D:\a\_work\onnxruntime\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlGraphFusionHelper.cpp(128,49): error C2653: 'endian': is not a class or namespace name [D:\a\_work\_temp\build\RelWithDebInfo\onnxruntime_providers_dml.vcxproj]
Error: D:\a\_work\onnxruntime\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlGraphFusionHelper.cpp(128,57): error C2065: 'little': undeclared identifier [D:\a\_work\_temp\build\RelWithDebInfo\onnxruntime_providers_dml.vcxproj]
Error: D:\a\_work\onnxruntime\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlGraphFusionHelper.cpp(134,63): error C2664: 'size_t onnxruntime::utils::GetElementSizeOfTensor(onnx::TensorProto_DataType)': cannot convert argument 1 from 'int32_t' to 'onnx::TensorProto_DataType' [D:\a\_work\_temp\build\RelWithDebInfo\onnxruntime_providers_dml.vcxproj]

@tianleiwu
Copy link
Copy Markdown
Contributor

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 4 pipeline(s).

tianleiwu
tianleiwu previously approved these changes Mar 30, 2026
@tianleiwu tianleiwu enabled auto-merge (squash) March 30, 2026 08:15
auto-merge was automatically disabled March 30, 2026 14:33

Head branch was pushed to by a user without write access

@tianleiwu
Copy link
Copy Markdown
Contributor

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 4 pipeline(s).

@tianleiwu tianleiwu enabled auto-merge (squash) March 31, 2026 23:30
@tianleiwu tianleiwu merged commit f2c28e2 into microsoft:main Mar 31, 2026
105 of 181 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants