Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve marshalling of strings in CreateStringTensor #15520

Closed
wants to merge 2 commits into from

Conversation

tannergooding
Copy link
Member

This should resolve dotnet/machinelearning#6620

The previous logic was correct, but it was creating tensor.Length managed arrays and tensor.Length GCHandles on top of that to ensure pinning. Disposal of GCHandles can be quite expensive as can having all of these short lived allocations in the form of the arrays as it increases GC pressure.

This new approach changes to using a MarshaledString and MarshaledStringArray helper. This ends up doing native allocations instead which means they are implicitly pinned, doesn't increase GC pressure, and doesn't require costly GCHandles.

From local testing, @michaelgsharp saw that this was between 2-5x faster.

This isn't necessarily the "best" it could be. Depending on exactly what's being done, it could be more optimal to allocate 1x big native allocation and converting all the strings to it sequentially. We could likewise transcode the data directly to the native memory rather than doing it in managed and then copying.

It looks like OrtFillStringTensor is basically just copying all the individual strings into a linearized buffer and so is effectively doing the "1x big native allocation" and so some other considerations may be "even better", but as is this should significantly helper the perf of the scenario in question.

There may be other places that are doing similar things with marshalling that could be switched over.

@tannergooding
Copy link
Member Author

It looks like there are up to 70 instances of DisposableList being used. If this PR is taken, a follow up PR that replaces all of these with MarshaledStringArray or similar functionality should be taken.

Each one should provide similar improvements to perf and a reduction in GC pressure, particularly where GCHandles are being used behind the scenes.

@yuslepukhin
Copy link
Member

I have some more changes in progress, I can merge those as well to avoid parallel efforts.

@yuslepukhin
Copy link
Member

yuslepukhin commented Apr 14, 2023

It looks like there are up to 70 instances of DisposableList being used. If this PR is taken, a follow up PR that replaces all of these with MarshaledStringArray or similar functionality should be taken.

Each one should provide similar improvements to perf and a reduction in GC pressure, particularly where GCHandles are being used behind the scenes.

We only copy strings, everything else is read directly from either pinned managed memory or native memory on the way back.

GC handles are being eliminated as not necessary.

try
{
marshaledStrings = new MarshaledStringArray(tensor);
ortValue = new OrtValue(valueHandle);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ortValue = new OrtValue(valueHandle);

Always been curious as why this new is moved under try and then we have to check for null in the finally/catch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It avoids an extra allocation in the case of failure or if the constructor itself fails. It's not strictly necessary, just generally considered "good practice" to handle it that way.

@@ -1824,4 +1826,105 @@ internal class NativeLib

#endregion
} //class NativeMethods

internal static class EmptyArray<T>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal static class EmptyArray

Can this be moved to utilities?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly. I'm not overly familiar with the codebase so wasn't sure where the best place to put various things would be.

@tannergooding
Copy link
Member Author

Closing in favor of #15545 which includes these changes + others

yuslepukhin added a commit that referenced this pull request Apr 20, 2023
### Description

  Reduce a number of auxillary objects created to reduce GC pressure.
Eliminate GCHandle type of memory pinning in most of the places.
Improve string marshalling by allocating unmanaged memory that does not
require pinning. Change native methods from `IntPtr` to `byte[]`
(marshalling pinning is more efficient).

Allocate input/output UTF-8 names in unmanaged heap for the lifetime of
InferenceSession. So we do not keep converting them and pinning on every
Run.

Introduce a new native API that allows to allocate and convert/copy
strings directly into a native tensor.

The PR delivers around 50% latency improvements and less GC pauses.

Inspired by: #15520

### Motivation and Context
Client experience GC pressure and performance degradation when dealing
with string tensors.


Co-Authored-By: @tannergooding
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

High amount GC gen2 delays with ONNX models converted to ML.Net
2 participants