Skip to content

Support direct usage of ORT format model flatbuffer for initializers#12465

Merged
skottmckay merged 6 commits intomainfrom
skottmckay/Support_direct_usage_of_ORT_format_model_flatbuffer_for_initializers
Aug 12, 2022
Merged

Support direct usage of ORT format model flatbuffer for initializers#12465
skottmckay merged 6 commits intomainfrom
skottmckay/Support_direct_usage_of_ORT_format_model_flatbuffer_for_initializers

Conversation

@skottmckay
Copy link
Contributor

Description:
An ORT format model contains initializer data that we currently copy into a TensorProto during model load, and copy again into an OrtValue<Tensor> during session state finalization. We can do some optimizations to try and keep peak memory usage from these steps to roughly 2x the original size of the initializers, but that is still inefficient in a mobile scenario.

There is no way to populate the raw_data field of a TensorProto using an existing buffer. The OrtValue<Tensor> however does support the Tensor being constructed from an existing buffer with optional ownership transfer.

There is the capability for a TensorProto to point to external data. Typically the external data is stored in a separate file to the model, and the TensorProto contains the filename, offset and size of the data. We can leverage this mechanism to point to external data that is already resident in memory (from the ORT format model flatbuffer) by using a special tag for the filename and storing the memory address in the 'offset' field.

The existing code to create an OrtValue<Tensor> from a TensorProto containing external data supports the copy-free approach of passing along a pointer with optional transfer of ownership to the OrtValue, as we normally mmap the file containing the external data and use the address of that buffer.

This PR contains the small set of changes necessary to implement this approach to gather feedback. The usage is limited to an ORT format model where the caller provides a buffer containing the pre-loaded bytes for the model and they set a flag specifying not to copy the bytes (signifying that memory usage is important to them). An additional flag is provided to allow specifying that we may also use the buffer directly for initializers, as that creates a new requirement that the buffer remain valid for the entire duration of the InferenceSession (vs. currently where it is only required to be valid until InferenceSession initialization completes).

Motivation and Context
We have production mobile scenarios that require a reduction in peak memory usage.

Test output from potential production model is below. ORT format model being tested is 13.6MB.

Peak Working Set Size in bytes

Stage Original New Notes
Pre-load of model into buffer 7,737,344 7,663,616 Baseline with overhead from onnxruntime_test_all size
Model loaded.
Pre-InferenceSession::Load
23,093,248 23,040,000 Roughly equal as expected
Post InferenceSession.Load.
Pre InferenceSession::Initialize
42,946,560 29,179,904 13MB+ reduction
PostInferenceSession.Initialize 61,435,904 34,676,736 25MB+ reduction

NOTE: Pre-packing was disabled for this testing. If we are using the user-provided buffer directly for the initializers, the pre-packing causes an additional copy of the initializer data when creating the pre-packed OrtValue<Tensor>, and we can't free the original initializer data as that is within the user-provided buffer. If that buffer was mutable we could potentially do in-place pre-packing (pre-pack to temporary buffer, replace original data) to avoid that copy. This is a separate problem to solve if pre-packing is also required in the production scenario.

…ers by leveraging the TensorProto external data infrastructure.
Tensor& tensor, OrtCallback& ext_data_deleter) {
ORT_ENFORCE(utils::HasExternalData(tensor_proto));
ORT_ENFORCE(!proto_path.empty());
// TODO: Do we need a check here? It's internal only so I don't think having an empty proto_path when loading
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a check here? It's internal only

I think we can remove it, and make sure that the error message on failure includes that empty path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Windows we report just the filename. On Linux the failure would occur at a point there's an invalid file descriptor, so no filename or path.

Added something to GetExtDataFromTensorProto to ensure the full path is reported if there's an error.

@yuslepukhin
Copy link
Member

                            std::vector<uint8_t>& bytes_data_holder) {

Should this be uint8_t[]?


Refers to: onnxruntime/core/session/inference_session.cc:993 in e30ebad. [](commit_id = e30ebad, deletion_comment = False)

@yuslepukhin
Copy link
Member

LGTM

yuslepukhin
yuslepukhin previously approved these changes Aug 4, 2022
Copy link
Member

@yuslepukhin yuslepukhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@skottmckay
Copy link
Contributor Author

                            std::vector<uint8_t>& bytes_data_holder) {

The std::vector is a member of InferenceSession and is providing the storage.


In reply to: 1205866045


Refers to: onnxruntime/core/session/inference_session.cc:993 in e30ebad. [](commit_id = e30ebad, deletion_comment = False)

Copy link
Contributor

@edgchen1 edgchen1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me overall

@pranavsharma
Copy link
Contributor

Could we ask customers to use the AddExternalInitializers API instead?

@skottmckay
Copy link
Contributor Author

Could we ask customers to use the AddExternalInitializers API instead?

Not sure that helps. Looks like the API takes an OrtValue that we convert to a TensorProto (and I assume back to an OrtValue during session state finalization). Due to that would you still have 2x the memory usage for each initializer?

Status Graph::InjectExternalInitializedTensors(const InlinedHashMap<std::string, OrtValue>& external_initializers) {
for (const auto& e : external_initializers) {
const auto& name = e.first;
const OrtValue& ort_value = e.second;
auto tensor_proto = utils::TensorToTensorProto(ort_value.Get<Tensor>(), name);
ORT_RETURN_IF_ERROR(ReplaceInitializedTensorImpl(std::move(tensor_proto), true));
LOGS(logger_, INFO) << "Replaced external initializer: " << name;
}

…direct_usage_of_ORT_format_model_flatbuffer_for_initializers
@pranavsharma
Copy link
Contributor

Could we ask customers to use the AddExternalInitializers API instead?

Not sure that helps. Looks like the API takes an OrtValue that we convert to a TensorProto (and I assume back to an OrtValue during session state finalization). Due to that would you still have 2x the memory usage for each initializer?

Status Graph::InjectExternalInitializedTensors(const InlinedHashMap<std::string, OrtValue>& external_initializers) {
for (const auto& e : external_initializers) {
const auto& name = e.first;
const OrtValue& ort_value = e.second;
auto tensor_proto = utils::TensorToTensorProto(ort_value.Get<Tensor>(), name);
ORT_RETURN_IF_ERROR(ReplaceInitializedTensorImpl(std::move(tensor_proto), true));
LOGS(logger_, INFO) << "Replaced external initializer: " << name;
}

Yeah, the approach in the PR looks fine to me. Just some minor comments.

@skottmckay skottmckay changed the title RFC: support direct usage of ORT format model flatbuffer for initializers Support direct usage of ORT format model flatbuffer for initializers Aug 9, 2022
…rect_usage_of_ORT_format_model_flatbuffer_for_initializers
@skottmckay skottmckay merged commit 0b0c51e into main Aug 12, 2022
@skottmckay skottmckay deleted the skottmckay/Support_direct_usage_of_ORT_format_model_flatbuffer_for_initializers branch August 12, 2022 08:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants