-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve marshalling of strings in CreateStringTensor #15520
Conversation
It looks like there are up to 70 instances of Each one should provide similar improvements to perf and a reduction in GC pressure, particularly where GCHandles are being used behind the scenes. |
I have some more changes in progress, I can merge those as well to avoid parallel efforts. |
We only copy strings, everything else is read directly from either pinned managed memory or native memory on the way back. GC handles are being eliminated as not necessary. |
try | ||
{ | ||
marshaledStrings = new MarshaledStringArray(tensor); | ||
ortValue = new OrtValue(valueHandle); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It avoids an extra allocation in the case of failure or if the constructor itself fails. It's not strictly necessary, just generally considered "good practice" to handle it that way.
@@ -1824,4 +1826,105 @@ internal class NativeLib | |||
|
|||
#endregion | |||
} //class NativeMethods | |||
|
|||
internal static class EmptyArray<T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly. I'm not overly familiar with the codebase so wasn't sure where the best place to put various things would be.
Closing in favor of #15545 which includes these changes + others |
### Description Reduce a number of auxillary objects created to reduce GC pressure. Eliminate GCHandle type of memory pinning in most of the places. Improve string marshalling by allocating unmanaged memory that does not require pinning. Change native methods from `IntPtr` to `byte[]` (marshalling pinning is more efficient). Allocate input/output UTF-8 names in unmanaged heap for the lifetime of InferenceSession. So we do not keep converting them and pinning on every Run. Introduce a new native API that allows to allocate and convert/copy strings directly into a native tensor. The PR delivers around 50% latency improvements and less GC pauses. Inspired by: #15520 ### Motivation and Context Client experience GC pressure and performance degradation when dealing with string tensors. Co-Authored-By: @tannergooding
This should resolve dotnet/machinelearning#6620
The previous logic was correct, but it was creating
tensor.Length
managed arrays andtensor.Length
GCHandles on top of that to ensure pinning. Disposal of GCHandles can be quite expensive as can having all of these short lived allocations in the form of the arrays as it increases GC pressure.This new approach changes to using a
MarshaledString
andMarshaledStringArray
helper. This ends up doing native allocations instead which means they are implicitly pinned, doesn't increase GC pressure, and doesn't require costly GCHandles.From local testing, @michaelgsharp saw that this was between 2-5x faster.
This isn't necessarily the "best" it could be. Depending on exactly what's being done, it could be more optimal to allocate 1x big native allocation and converting all the strings to it sequentially. We could likewise transcode the data directly to the native memory rather than doing it in managed and then copying.
It looks like
OrtFillStringTensor
is basically just copying all the individual strings into a linearized buffer and so is effectively doing the "1x big native allocation" and so some other considerations may be "even better", but as is this should significantly helper the perf of the scenario in question.There may be other places that are doing similar things with marshalling that could be switched over.