Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C# ] Improve string marshalling and reduce GC pressure #15545

Merged
merged 15 commits into from
Apr 20, 2023

Conversation

yuslepukhin
Copy link
Member

@yuslepukhin yuslepukhin commented Apr 17, 2023

Description

Reduce a number of auxillary objects created to reduce GC pressure.
Eliminate GCHandle type of memory pinning in most of the places.
Improve string marshalling by allocating unmanaged memory that does not
require pinning. Change native methods from IntPtr to byte[] (marshalling pinning is more efficient).
Allocate input/output UTF-8 names in unmanaged heap for the lifetime of InferenceSession. So we do not keep converting them and pinning on every Run.
Introduce a new native API that allows to allocate and convert/copy strings directly into a native tensor.

The PR delivers around 50% latency improvements and less GC pauses.

Inspired by: #15520

Motivation and Context

Client experience GC pressure and performance degradation when dealing with string tensors.

Co-Authored-By: @tannergooding

 Eliminate GCHandle type of memory pinning in most of the places.
 Improve string marshalling by allocating unmanaged memory that does not
 require pinning on our side (marshalling pinning is more efficient).
 Pin input/output names once for the life time of InferenceSession.

Co-Authored-By: @tannergooding
Comment on lines 953 to 954
var namePin = new Memory<byte>(utf8).Pin();
_namesMemoryHandles.Add(namePin);
Copy link
Member

@tannergooding tannergooding Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates a GCHandle per entry and while allocating the handle isn't terribly expensive (most of the time), it can have a large negative downstream impact on the GC as it limits the ability for the GC to compact memory and causes it to need to work around every pin.

In .NET Core, the POH (pinned object heap) was introduced as a way to have long-term pinned data without negatively impacting the normal heap.

It would likely be better to allocate such data in native and track a pointer. Disposal would then call Free on the allocations instead. This will avoid the additional GC pressure and GC pessimizations caused by long term pinning of many objects.

The same goes for all the other Memory<T>.Pin() occurences #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know about native memory, however, it is available since .NET 7


// Put the key/value pair into the dictionary
_customMetadataMap[key] = value;
ortAllocationKeys.Add(new OrtMemoryAllocation(allocator, Marshal.ReadIntPtr(customMetadataMapKeysHandle, IntPtr.Size * i), 0));
Copy link
Member

@tannergooding tannergooding Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the Marshal.* APIs are very slow or pessimize some behavior due to back-compat.

In .NET Core, most of the APIs have been "replaced" with other recommended APIs. For example, rather than Marshal.AllocHGlobal it is recommended you use NativeMemory.Alloc.

In the case of APIs like Marshal.ReadIntPtr, it is better to use System.Runtime.CompilerServices.Unsafe.ReadUnaligned or to do a simple pointer dereference. Either of these options work on .NET Standard/.NET Framework as well (Unsafe is available via a NuGet package, pointers just require you to cast and dereference).

String conversion functions are often better replaced with System.Text.Encoding.* APIs.

tannergooding
tannergooding previously approved these changes Apr 18, 2023
Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Left a callout on two places where things could be further improved.

@yuslepukhin
Copy link
Member Author

yuslepukhin commented Apr 18, 2023

Per our online chat, I am going to introduce a new API to try out.


In reply to: 1513466146


In reply to: 1513466146

include/onnxruntime/core/session/onnxruntime_cxx_api.h Outdated Show resolved Hide resolved
Comment on lines 1195 to 1198
/// </summary>
/// <param name="index"></param>
/// <param name="buffer_length"></param>
/// <returns></returns>
Copy link
Contributor

@skottmckay skottmckay Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please complete documentation #Pending

onnxruntime/core/session/onnxruntime_c_api.cc Show resolved Hide resolved
@sanketshahMS
Copy link

We have validated on server, these changes fixes both the GC issues and latency issue.

Model type QPS CPU % P95 latency in ms P99 latency in ms P99.9 latency in ms
Existing models with string [1] input feature 437 94 36 41 56
Existing models with string [1] input feature + DLLs from this PR 493 86% 30 45 68
New optimized models string[?] (dynamic vector size) input feature 380 89% 30 272 560
New optimized model string[?] (dynamic vector size) + DLLs from this PR 544 87% 43 53 67

@skottmckay
Copy link
Contributor

skottmckay commented Apr 19, 2023

We have validated on server, these changes fixes both the GC issues and latency issue.

Model type QPS CPU % P95 latency in ms P99 latency in ms P99.9 latency in ms
Existing models with string [1] input feature 437 94 36 41 56
Existing models with string [1] input feature + DLLs from this PR 493 86% 30 45 68
New optimized models string[?] (dynamic vector size) input feature 380 89% 30 272 560
New optimized model string[?] (dynamic vector size) + DLLs from this PR 544 87% 43 53 67

Is it possible to get numbers at fixed QPS to compare? I see throughput is higher but so is latency in some places.

@sanketshahMS
Copy link

We have validated on server, these changes fixes both the GC issues and latency issue.
Model type QPS CPU % P95 latency in ms P99 latency in ms P99.9 latency in ms
Existing models with string [1] input feature 437 94 36 41 56
Existing models with string [1] input feature + DLLs from this PR 493 86% 30 45 68
New optimized models string[?] (dynamic vector size) input feature 380 89% 30 272 560
New optimized model string[?] (dynamic vector size) + DLLs from this PR 544 87% 43 53 67

Is it possible to get numbers at fixed QPS to compare? I see throughput is higher but so is latency in some places.

We created fixed QPS from client side for the above tests. The numbers here are server-side numbers.
We are optimizing for throughput, so cannot have tests with fixed QPS

@skottmckay
Copy link
Contributor

We created fixed QPS from client side for the above tests. The numbers here are server-side numbers.
We are optimizing for throughput, so cannot have tests with fixed QPS

Sorry - not quite following. Are you saying in your testing the client sends fixed QPS and these are the server side numbers where you're maxing out throughput so server size QPS is less than what the client is attempting to send?

If so, can the client send fixed QPS less than the max so server side QPS should match in order to check the latency in that scenario? e.g. use QPS of 350.

Comment on lines -994 to +987
private string GetOutputName(ulong index, out byte[] utf8)
private string GetOutputName(ulong index, out IntPtr utf8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason we're going back to IntPtr here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not going back. We are caching utf8 strings in unmanaged memory, so I reworked the function to make a copy directly into unmanaged memory rather than having an intermediate in a form of byte[].

@@ -172,13 +170,15 @@ public void TesEnvWithCustomLogger()

var model = TestDataLoader.LoadModelFromEmbeddedResource("squeezenet.onnx");
// Trigger some logging
using (var session = new InferenceSession(model)) ;
// Empty stmt intentional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: better if comments explain 'why' rather than just 'what'. I assume it's because what you're testing just requires session initialization, so a comment to that effect would help the next developer more.

@michaelgsharp
Copy link
Member

@skottmckay the existing models and optimized models that @sanketshahMS is mentioning are not a good comparison with each other of the performance of the code changes that this PR is doing. They are comparing the performance of this PR AND the changes in their model, which are drastic enough changes they are hard to use for direct comparisons here.

The existing model passes in a single string, I.E. string tensor with a dimension of 1, that is then split up/parsed/tokenized inside the ONNX model.
The new models do the splitting/parsing before the ONNX model, which means the string tensor that is passed in has dimensions much larger than just 1 so it really hits the string issues here.

The test case I have been running use the new model, so a variable length string tensor (test data has minimum of 10 values per tensor and maximum of 1000 values per tensor). It has 140,000 rows and a total of 55 million string values that get passed in.

Using the new model with the original ONNX code took 23.6 seconds. Using that exact same model/data with the new ORT code from this PR gives us a runtime of 4 seconds. Across the board latency/GC/etc are much much lower with this new code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants