[C# ] Improve string marshalling and reduce GC pressure #15545

yuslepukhin · 2023-04-17T18:52:03Z

Description

Reduce a number of auxillary objects created to reduce GC pressure.
Eliminate GCHandle type of memory pinning in most of the places.
Improve string marshalling by allocating unmanaged memory that does not
require pinning. Change native methods from IntPtr to byte[] (marshalling pinning is more efficient).
Allocate input/output UTF-8 names in unmanaged heap for the lifetime of InferenceSession. So we do not keep converting them and pinning on every Run.
Introduce a new native API that allows to allocate and convert/copy strings directly into a native tensor.

The PR delivers around 50% latency improvements and less GC pauses.

Inspired by: #15520

Motivation and Context

Client experience GC pressure and performance degradation when dealing with string tensors.

Co-Authored-By: @tannergooding

@tannergooding

Eliminate GCHandle type of memory pinning in most of the places. Improve string marshalling by allocating unmanaged memory that does not require pinning on our side (marshalling pinning is more efficient). Pin input/output names once for the life time of InferenceSession. Co-Authored-By: @tannergooding

csharp/src/Microsoft.ML.OnnxRuntime/SessionOptions.shared.cs

tannergooding · 2023-04-18T12:06:37Z

csharp/src/Microsoft.ML.OnnxRuntime/InferenceSession.shared.cs

+                    var namePin = new Memory<byte>(utf8).Pin();
+                    _namesMemoryHandles.Add(namePin);


This creates a GCHandle per entry and while allocating the handle isn't terribly expensive (most of the time), it can have a large negative downstream impact on the GC as it limits the ability for the GC to compact memory and causes it to need to work around every pin.

In .NET Core, the POH (pinned object heap) was introduced as a way to have long-term pinned data without negatively impacting the normal heap.

It would likely be better to allocate such data in native and track a pointer. Disposal would then call Free on the allocations instead. This will avoid the additional GC pressure and GC pessimizations caused by long term pinning of many objects.

The same goes for all the other Memory<T>.Pin() occurences #Resolved

Good to know about native memory, however, it is available since .NET 7

tannergooding · 2023-04-18T12:11:03Z

csharp/src/Microsoft.ML.OnnxRuntime/InferenceSession.shared.cs

-
-                            // Put the key/value pair into the dictionary
-                            _customMetadataMap[key] = value;
+                            ortAllocationKeys.Add(new OrtMemoryAllocation(allocator, Marshal.ReadIntPtr(customMetadataMapKeysHandle, IntPtr.Size * i), 0));


Most of the Marshal.* APIs are very slow or pessimize some behavior due to back-compat.

In .NET Core, most of the APIs have been "replaced" with other recommended APIs. For example, rather than Marshal.AllocHGlobal it is recommended you use NativeMemory.Alloc.

In the case of APIs like Marshal.ReadIntPtr, it is better to use System.Runtime.CompilerServices.Unsafe.ReadUnaligned or to do a simple pointer dereference. Either of these options work on .NET Standard/.NET Framework as well (Unsafe is available via a NuGet package, pointers just require you to cast and dereference).

String conversion functions are often better replaced with System.Text.Encoding.* APIs.

tannergooding

Overall LGTM. Left a callout on two places where things could be further improved.

yuslepukhin · 2023-04-18T16:27:37Z

Per our online chat, I am going to introduce a new API to try out.

In reply to: 1513466146

Add C++ API

csharp/src/Microsoft.ML.OnnxRuntime/InferenceSession.shared.cs

csharp/src/Microsoft.ML.OnnxRuntime/ProviderOptions.shared.cs

csharp/src/Microsoft.ML.OnnxRuntime/Tensors/ArrayUtilities.shared.cs

include/onnxruntime/core/session/onnxruntime_cxx_api.h

skottmckay · 2023-04-19T02:23:57Z

include/onnxruntime/core/session/onnxruntime_cxx_api.h

+  /// </summary>
+  /// <param name="index"></param>
+  /// <param name="buffer_length"></param>
+  /// <returns></returns>


nit: please complete documentation #Pending

onnxruntime/core/session/onnxruntime_c_api.cc

sanketshahMS · 2023-04-19T17:55:30Z

We have validated on server, these changes fixes both the GC issues and latency issue.

Model type	QPS	CPU %	P95 latency in ms	P99 latency in ms	P99.9 latency in ms
Existing models with string [1] input feature	437	94	36	41	56
Existing models with string [1] input feature + DLLs from this PR	493	86%	30	45	68
New optimized models string[?] (dynamic vector size) input feature	380	89%	30	272	560
New optimized model string[?] (dynamic vector size) + DLLs from this PR	544	87%	43	53	67

Add ProviderOptionsUpdater

skottmckay · 2023-04-19T22:22:02Z

We have validated on server, these changes fixes both the GC issues and latency issue.

Model type QPS CPU % P95 latency in ms P99 latency in ms P99.9 latency in ms
Existing models with string [1] input feature 437 94 36 41 56
Existing models with string [1] input feature + DLLs from this PR 493 86% 30 45 68
New optimized models string[?] (dynamic vector size) input feature 380 89% 30 272 560
New optimized model string[?] (dynamic vector size) + DLLs from this PR 544 87% 43 53 67

Is it possible to get numbers at fixed QPS to compare? I see throughput is higher but so is latency in some places.

sanketshahMS · 2023-04-19T22:26:47Z

We have validated on server, these changes fixes both the GC issues and latency issue.
Model type QPS CPU % P95 latency in ms P99 latency in ms P99.9 latency in ms
Existing models with string [1] input feature 437 94 36 41 56
Existing models with string [1] input feature + DLLs from this PR 493 86% 30 45 68
New optimized models string[?] (dynamic vector size) input feature 380 89% 30 272 560
New optimized model string[?] (dynamic vector size) + DLLs from this PR 544 87% 43 53 67

Is it possible to get numbers at fixed QPS to compare? I see throughput is higher but so is latency in some places.

We created fixed QPS from client side for the above tests. The numbers here are server-side numbers.
We are optimizing for throughput, so cannot have tests with fixed QPS

skottmckay · 2023-04-19T23:19:46Z

We created fixed QPS from client side for the above tests. The numbers here are server-side numbers.
We are optimizing for throughput, so cannot have tests with fixed QPS

Sorry - not quite following. Are you saying in your testing the client sends fixed QPS and these are the server side numbers where you're maxing out throughput so server size QPS is less than what the client is attempting to send?

If so, can the client send fixed QPS less than the max so server side QPS should match in order to check the latency in that scenario? e.g. use QPS of 350.

skottmckay · 2023-04-19T22:24:48Z

csharp/src/Microsoft.ML.OnnxRuntime/InferenceSession.shared.cs

-        private string GetOutputName(ulong index, out byte[] utf8)
+        private string GetOutputName(ulong index, out IntPtr utf8)


What's the reason we're going back to IntPtr here?

We are not going back. We are caching utf8 strings in unmanaged memory, so I reworked the function to make a copy directly into unmanaged memory rather than having an intermediate in a form of byte[].

skottmckay · 2023-04-19T23:30:27Z

csharp/test/Microsoft.ML.OnnxRuntime.Tests.Common/OrtEnvTests.cs

@@ -172,13 +170,15 @@ public void TesEnvWithCustomLogger()

            var model = TestDataLoader.LoadModelFromEmbeddedResource("squeezenet.onnx");
            // Trigger some logging
-            using (var session = new InferenceSession(model)) ;
+            // Empty stmt intentional


nit: better if comments explain 'why' rather than just 'what'. I assume it's because what you're testing just requires session initialization, so a comment to that effect would help the next developer more.

michaelgsharp · 2023-04-20T03:34:37Z

@skottmckay the existing models and optimized models that @sanketshahMS is mentioning are not a good comparison with each other of the performance of the code changes that this PR is doing. They are comparing the performance of this PR AND the changes in their model, which are drastic enough changes they are hard to use for direct comparisons here.

The existing model passes in a single string, I.E. string tensor with a dimension of 1, that is then split up/parsed/tokenized inside the ONNX model.
The new models do the splitting/parsing before the ONNX model, which means the string tensor that is passed in has dimensions much larger than just 1 so it really hits the string issues here.

The test case I have been running use the new model, so a variable length string tensor (test data has minimum of 10 values per tensor and maximum of 1000 values per tensor). It has 140,000 rows and a total of 55 million string values that get passed in.

Using the new model with the original ONNX code took 23.6 seconds. Using that exact same model/data with the new ORT code from this PR gives us a runtime of 4 seconds. Across the board latency/GC/etc are much much lower with this new code.

yuslepukhin added 4 commits April 14, 2023 13:53

Removed PinnedHandle

1575f22

Reduce the number of auxilary OrtMemoryAllocation instances

8bdf230

Reuse pinned names

dd15bec

yuslepukhin requested review from tannergooding and pranavsharma April 17, 2023 18:52

yuslepukhin mentioned this pull request Apr 17, 2023

[C#] Add missing rocm csharp api #15540

Merged

pranavsharma reviewed Apr 17, 2023

View reviewed changes

csharp/src/Microsoft.ML.OnnxRuntime/SessionOptions.shared.cs Show resolved Hide resolved

tannergooding reviewed Apr 18, 2023

View reviewed changes

tannergooding previously approved these changes Apr 18, 2023

View reviewed changes

yuslepukhin added 5 commits April 18, 2023 10:03

Allocate input/output names in native heap to avoid pinning

3220cf4

Fix a comment

9f80b60

Merge branch 'main' into yuslepukhin/gc_reduction

a3144e7

Introduce GetStringTensorElementBuffer

238c775

Add C++ API

Fill out string tensor straight to the native buffer

a5862d1

yuslepukhin dismissed tannergooding’s stale review via a5862d1 April 18, 2023 22:16

yuslepukhin added 2 commits April 18, 2023 15:27

Address formatting failures

7c02b0d

Address lint failures

4ee05bb

yuslepukhin requested review from skottmckay and edgchen1 April 19, 2023 00:47

skottmckay reviewed Apr 19, 2023

View reviewed changes

Merge branch 'main' into yuslepukhin/gc_reduction

3a50226

tannergooding mentioned this pull request Apr 19, 2023

Improve marshalling of strings in CreateStringTensor #15520

Closed

yuslepukhin added 3 commits April 19, 2023 14:39

Address comments

38b45b5

Add ProviderOptionsUpdater

Merge branch 'main' into yuslepukhin/gc_reduction

415df19

Rename the new API OrtApi struct member

0782488

skottmckay approved these changes Apr 19, 2023

View reviewed changes

yuslepukhin merged commit a5dec8e into main Apr 20, 2023

yuslepukhin deleted the yuslepukhin/gc_reduction branch April 20, 2023 22:12

michaelgsharp mentioned this pull request Apr 26, 2023

High amount GC gen2 delays with ONNX models converted to ML.Net dotnet/machinelearning#6620

Open

skottmckay mentioned this pull request Oct 11, 2023

[Mobile] Unable to load models in Xamarin iOS #16463

Open

gsgou mentioned this pull request Jan 28, 2024

[Mobile] Unable to load models in Xamarin iOS due to #15545 changes #19295

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C# ] Improve string marshalling and reduce GC pressure #15545

[C# ] Improve string marshalling and reduce GC pressure #15545

yuslepukhin commented Apr 17, 2023 •

edited

Loading

tannergooding Apr 18, 2023 •

edited by yuslepukhin

Loading

yuslepukhin Apr 18, 2023

tannergooding Apr 18, 2023 •

edited

Loading

tannergooding left a comment

yuslepukhin commented Apr 18, 2023 •

edited

Loading

skottmckay Apr 19, 2023 •

edited by yuslepukhin

Loading

sanketshahMS commented Apr 19, 2023

skottmckay commented Apr 19, 2023 •

edited

Loading

sanketshahMS commented Apr 19, 2023

skottmckay commented Apr 19, 2023

skottmckay Apr 19, 2023

yuslepukhin Apr 20, 2023

skottmckay Apr 19, 2023

michaelgsharp commented Apr 20, 2023

		var namePin = new Memory<byte>(utf8).Pin();
		_namesMemoryHandles.Add(namePin);

		private string GetOutputName(ulong index, out byte[] utf8)
		private string GetOutputName(ulong index, out IntPtr utf8)

[C# ] Improve string marshalling and reduce GC pressure #15545

[C# ] Improve string marshalling and reduce GC pressure #15545

Conversation

yuslepukhin commented Apr 17, 2023 • edited Loading

Description

Motivation and Context

tannergooding Apr 18, 2023 • edited by yuslepukhin Loading

Choose a reason for hiding this comment

yuslepukhin Apr 18, 2023

Choose a reason for hiding this comment

tannergooding Apr 18, 2023 • edited Loading

Choose a reason for hiding this comment

tannergooding left a comment

Choose a reason for hiding this comment

yuslepukhin commented Apr 18, 2023 • edited Loading

skottmckay Apr 19, 2023 • edited by yuslepukhin Loading

Choose a reason for hiding this comment

sanketshahMS commented Apr 19, 2023

skottmckay commented Apr 19, 2023 • edited Loading

sanketshahMS commented Apr 19, 2023

skottmckay commented Apr 19, 2023

skottmckay Apr 19, 2023

Choose a reason for hiding this comment

yuslepukhin Apr 20, 2023

Choose a reason for hiding this comment

skottmckay Apr 19, 2023

Choose a reason for hiding this comment

michaelgsharp commented Apr 20, 2023

yuslepukhin commented Apr 17, 2023 •

edited

Loading

tannergooding Apr 18, 2023 •

edited by yuslepukhin

Loading

tannergooding Apr 18, 2023 •

edited

Loading

yuslepukhin commented Apr 18, 2023 •

edited

Loading

skottmckay Apr 19, 2023 •

edited by yuslepukhin

Loading

skottmckay commented Apr 19, 2023 •

edited

Loading