Support direct usage of ORT format model flatbuffer for initializers#12465
Conversation
…ers by leveraging the TensorProto external data infrastructure.
| Tensor& tensor, OrtCallback& ext_data_deleter) { | ||
| ORT_ENFORCE(utils::HasExternalData(tensor_proto)); | ||
| ORT_ENFORCE(!proto_path.empty()); | ||
| // TODO: Do we need a check here? It's internal only so I don't think having an empty proto_path when loading |
There was a problem hiding this comment.
On Windows we report just the filename. On Linux the failure would occur at a point there's an invalid file descriptor, so no filename or path.
Added something to GetExtDataFromTensorProto to ensure the full path is reported if there's an error.
|
LGTM |
The std::vector is a member of InferenceSession and is providing the storage. In reply to: 1205866045 Refers to: onnxruntime/core/session/inference_session.cc:993 in e30ebad. [](commit_id = e30ebad, deletion_comment = False) |
edgchen1
left a comment
There was a problem hiding this comment.
looks good to me overall
|
Could we ask customers to use the AddExternalInitializers API instead? |
Not sure that helps. Looks like the API takes an OrtValue that we convert to a TensorProto (and I assume back to an OrtValue during session state finalization). Due to that would you still have 2x the memory usage for each initializer? onnxruntime/onnxruntime/core/graph/graph.cc Lines 2866 to 2873 in 8a86b34 |
…direct_usage_of_ORT_format_model_flatbuffer_for_initializers
Yeah, the approach in the PR looks fine to me. Just some minor comments. |
…rect_usage_of_ORT_format_model_flatbuffer_for_initializers
Description:
An ORT format model contains initializer data that we currently copy into a
TensorProtoduring model load, and copy again into anOrtValue<Tensor>during session state finalization. We can do some optimizations to try and keep peak memory usage from these steps to roughly 2x the original size of the initializers, but that is still inefficient in a mobile scenario.There is no way to populate the raw_data field of a
TensorProtousing an existing buffer. TheOrtValue<Tensor>however does support theTensorbeing constructed from an existing buffer with optional ownership transfer.There is the capability for a
TensorPrototo point to external data. Typically the external data is stored in a separate file to the model, and theTensorProtocontains the filename, offset and size of the data. We can leverage this mechanism to point to external data that is already resident in memory (from the ORT format model flatbuffer) by using a special tag for the filename and storing the memory address in the 'offset' field.The existing code to create an
OrtValue<Tensor>from aTensorProtocontaining external data supports the copy-free approach of passing along a pointer with optional transfer of ownership to the OrtValue, as we normally mmap the file containing the external data and use the address of that buffer.This PR contains the small set of changes necessary to implement this approach to gather feedback. The usage is limited to an ORT format model where the caller provides a buffer containing the pre-loaded bytes for the model and they set a flag specifying not to copy the bytes (signifying that memory usage is important to them). An additional flag is provided to allow specifying that we may also use the buffer directly for initializers, as that creates a new requirement that the buffer remain valid for the entire duration of the InferenceSession (vs. currently where it is only required to be valid until InferenceSession initialization completes).
Motivation and Context
We have production mobile scenarios that require a reduction in peak memory usage.
Test output from potential production model is below. ORT format model being tested is 13.6MB.
Peak Working Set Size in bytes
Pre-InferenceSession::Load
Pre InferenceSession::Initialize
NOTE: Pre-packing was disabled for this testing. If we are using the user-provided buffer directly for the initializers, the pre-packing causes an additional copy of the initializer data when creating the pre-packed
OrtValue<Tensor>, and we can't free the original initializer data as that is within the user-provided buffer. If that buffer was mutable we could potentially do in-place pre-packing (pre-pack to temporary buffer, replace original data) to avoid that copy. This is a separate problem to solve if pre-packing is also required in the production scenario.