-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C# ] Improve string marshalling and reduce GC pressure #15545
Conversation
Eliminate GCHandle type of memory pinning in most of the places. Improve string marshalling by allocating unmanaged memory that does not require pinning on our side (marshalling pinning is more efficient). Pin input/output names once for the life time of InferenceSession. Co-Authored-By: @tannergooding
var namePin = new Memory<byte>(utf8).Pin(); | ||
_namesMemoryHandles.Add(namePin); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This creates a GCHandle
per entry and while allocating the handle isn't terribly expensive (most of the time), it can have a large negative downstream impact on the GC as it limits the ability for the GC to compact memory and causes it to need to work around every pin.
In .NET Core, the POH (pinned object heap) was introduced as a way to have long-term pinned data without negatively impacting the normal heap.
It would likely be better to allocate such data in native and track a pointer. Disposal would then call Free
on the allocations instead. This will avoid the additional GC pressure and GC pessimizations caused by long term pinning of many objects.
The same goes for all the other Memory<T>.Pin()
occurences #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know about native memory, however, it is available since .NET 7
|
||
// Put the key/value pair into the dictionary | ||
_customMetadataMap[key] = value; | ||
ortAllocationKeys.Add(new OrtMemoryAllocation(allocator, Marshal.ReadIntPtr(customMetadataMapKeysHandle, IntPtr.Size * i), 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the Marshal.*
APIs are very slow or pessimize some behavior due to back-compat.
In .NET Core, most of the APIs have been "replaced" with other recommended APIs. For example, rather than Marshal.AllocHGlobal
it is recommended you use NativeMemory.Alloc
.
In the case of APIs like Marshal.ReadIntPtr
, it is better to use System.Runtime.CompilerServices.Unsafe.ReadUnaligned
or to do a simple pointer dereference. Either of these options work on .NET Standard/.NET Framework as well (Unsafe
is available via a NuGet package, pointers just require you to cast and dereference).
String conversion functions are often better replaced with System.Text.Encoding.*
APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Left a callout on two places where things could be further improved.
Per our online chat, I am going to introduce a new API to try out. In reply to: 1513466146 In reply to: 1513466146 |
/// </summary> | ||
/// <param name="index"></param> | ||
/// <param name="buffer_length"></param> | ||
/// <returns></returns> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please complete documentation #Pending
We have validated on server, these changes fixes both the GC issues and latency issue.
|
Add ProviderOptionsUpdater
Is it possible to get numbers at fixed QPS to compare? I see throughput is higher but so is latency in some places. |
We created fixed QPS from client side for the above tests. The numbers here are server-side numbers. |
Sorry - not quite following. Are you saying in your testing the client sends fixed QPS and these are the server side numbers where you're maxing out throughput so server size QPS is less than what the client is attempting to send? If so, can the client send fixed QPS less than the max so server side QPS should match in order to check the latency in that scenario? e.g. use QPS of 350. |
private string GetOutputName(ulong index, out byte[] utf8) | ||
private string GetOutputName(ulong index, out IntPtr utf8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason we're going back to IntPtr here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not going back. We are caching utf8 strings in unmanaged memory, so I reworked the function to make a copy directly into unmanaged memory rather than having an intermediate in a form of byte[]
.
@@ -172,13 +170,15 @@ public void TesEnvWithCustomLogger() | |||
|
|||
var model = TestDataLoader.LoadModelFromEmbeddedResource("squeezenet.onnx"); | |||
// Trigger some logging | |||
using (var session = new InferenceSession(model)) ; | |||
// Empty stmt intentional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: better if comments explain 'why' rather than just 'what'. I assume it's because what you're testing just requires session initialization, so a comment to that effect would help the next developer more.
@skottmckay the existing models and optimized models that @sanketshahMS is mentioning are not a good comparison with each other of the performance of the code changes that this PR is doing. They are comparing the performance of this PR AND the changes in their model, which are drastic enough changes they are hard to use for direct comparisons here. The existing model passes in a single string, I.E. string tensor with a dimension of 1, that is then split up/parsed/tokenized inside the ONNX model. The test case I have been running use the new model, so a variable length string tensor (test data has minimum of 10 values per tensor and maximum of 1000 values per tensor). It has 140,000 rows and a total of 55 million string values that get passed in. Using the new model with the original ONNX code took 23.6 seconds. Using that exact same model/data with the new ORT code from this PR gives us a runtime of 4 seconds. Across the board latency/GC/etc are much much lower with this new code. |
Description
Reduce a number of auxillary objects created to reduce GC pressure.
Eliminate GCHandle type of memory pinning in most of the places.
Improve string marshalling by allocating unmanaged memory that does not
require pinning. Change native methods from
IntPtr
tobyte[]
(marshalling pinning is more efficient).Allocate input/output UTF-8 names in unmanaged heap for the lifetime of InferenceSession. So we do not keep converting them and pinning on every Run.
Introduce a new native API that allows to allocate and convert/copy strings directly into a native tensor.
The PR delivers around 50% latency improvements and less GC pauses.
Inspired by: #15520
Motivation and Context
Client experience GC pressure and performance degradation when dealing with string tensors.
Co-Authored-By: @tannergooding