Observe io.netty.allocator.type for custom Netty allocator #10292

jaredbwasserman · 2023-06-20T19:48:13Z

I modified code that depends on grpc-java to use the unpooled allocator type for netty. In order to accomplish this I had to:

Pass in unpooled for the io.netty.allocator.type property. See https://github.com/netty/netty/blob/4.1/buffer/src/main/java/io/netty/buffer/ByteBufUtil.java#L78
Pass in false for the io.grpc.netty.useCustomAllocator property. I am using v1.32.1 but this also applies to the latest version v1.56.x. See https://github.com/grpc/grpc-java/blob/v1.32.x/netty/src/main/java/io/grpc/netty/Utils.java#L126 and https://github.com/grpc/grpc-java/blob/v1.56.x/netty/src/main/java/io/grpc/netty/Utils.java#L136

Why is the default value for io.grpc.netty.useCustomAllocator true? In order for the code to use the ByteBufAllocator.DEFAULT the io.grpc.netty.useCustomAllocator property must be overridden to false, which seems to contradict the idea of a default. Would it make more sense for io.grpc.netty.useCustomAllocator to default to false so that ByteBufAllocator.DEFAULT is used by default?

I am asking because I had to do some debugging to figure out why only supplying io.netty.allocator.type=unpooled did not change the allocator type as I had expected. Thanks!

The text was updated successfully, but these errors were encountered:

temawi · 2023-06-20T21:40:22Z

Sorry you had to jump through hoops to get your allocator enabled.

A custom allocator was implemented to have lower memory usage than the default allocator could provide. Maybe it's a bit counterintuitive, but for gRPC purposes we consider our custom allocator to be "the default".

ejona86 · 2023-06-20T22:36:14Z

It isn't all that well documented because people haven't really needed it. I think I might have heard of one person using NettyChannelBuilder.withOption(ChannelOption.ALLOCATOR, yourAlloc). We made our own custom allocator to reduce memory usage (especially direct memory), and it seems it has been working.

What problem did you have such that you wanted unpooled buffers?

jaredbwasserman · 2023-06-22T01:12:15Z

Thanks for information.

TL; DR - There is a direct memory leak which appears to be caused by the netty pooled allocator type. Switching to the netty unpooled allocator type fixes the leak.

Here is a high-level view of the system:

Top level is a Flink application running Flink version 1.13.5
Flink is using Google Pub/Sub as an input source. We are depending on com.google.cloud:libraries-bom:8.1.0
The BOM pulls in com.google.cloud:google-cloud-pubsub:jar:1.108.0
Pubsub in turn depends on io.grpc:grpc-netty-shaded:jar:1.30.2 (looks like I had the version incorrect in the initial question)

There are conditions where the system's direct memory can leak. Typically, when the Flink job restarts, we observe an increase in direct memory usage. So far, I have isolated the direct memory increase to the netty pooled allocator. This is because when I switch to using the unpooled allocator type (using the steps detailed in the initial question), there is no longer a direct memory leak.

Unfortunately, I do not fully understand everything that is going on and why the memory leak is occurring. I suspect that during a Flink restart something related to the netty pools is not properly cleaned up, but I need to do more investigation on my side to get a more detailed picture of what is occurring. There is a chance we are running into this netty issue netty/netty#12067 but I cannot prove it.

ejona86 · 2023-06-22T14:55:01Z

If the allocator itself is leaking, and it isn't something simple like channels/servers leaked/not shut down, then I'd suspect the ClassLoader is being leaked and that's the real problem. Assuming it isn't via something easy like a gRPC thread is still running, I'd suspect ThreadLocals. Any Threads that call Netty (which includes a thread issuing an RPC) can retain a ThreadLocal which holds onto the classloader. The only sure way to clean those up is to shut down the threads. netty/netty#6891 (comment) . This is mostly a JDK problem but partly a Netty problem.

ejona86 · 2023-06-28T14:52:56Z

io.grpc.netty.useCustomAllocator is about whether grpc should use its custom allocator vs Netty's default allocator, and that custom allocator is used by default. So I think the current name make sense. We also aren't going to "optimize" the value too much, since we really would hope it doesn't need wide use.

It would be fine to observe io.netty.allocator.type when creating our allocators. That would be an easy change, and I think for most of the other values we are already observing the system properties (via observing Netty's defaults).

But most the important part of this issue is the leak. The other things really don't matter, because they are just workarounds. But the leak's also quite bothersome to debug. I think the best thing to do is to shut down Threads used by Netty/gRPC so the ClassLoader can be released and see if the leak is addressed.

We could improve the behavior of FastThreadLocal.destroy(), so that the ClassLoader might be eventually GC'ed, but it requires creating new ThreadLocals to trigger and so is probably too slow/non-deterministic. That's why I didn't change it before. ThreadLocalMap (in Java) would need to use a ReferenceQueue instead of the slow cleanSomeSlots(). But that doesn't make much sense for the JDK because almost no ThreadLocals will ever be GCed (because of cycles like this one, which requires an explicit API to break). And if you are doing that, "you might as well shut down the Threads."

ejona86 · 2023-06-30T17:18:39Z

Let's observe io.netty.allocator.type as part of Utils when we create our allocator. This issue will track that change. (Implementation note: if Netty is unpooled, we can just use their allocator since all the interesting settings other than heap/direct (which we don't change) are for pooled.)

jaredbwasserman · 2023-07-06T01:07:02Z

Thanks @ejona86 for the information. I had some time to dig into the application's behavior a bit more on my side.

With respect to your suggestion to shut down Threads used by Netty/gRPC so the ClassLoader can be released and see if the leak is addressed it appears grpc worker threads are already being shut down and recreated as part of the restart process our Flink application goes through. I profiled locally and found there are sets of threads beginning with grpc-nio-worker-ELG- which are periodically destroyed and recreated. It is possible I misunderstood which threads you meant, but these seemed the most likely based on your description.

I also did some debugging to better understand the cleanup process these threads go through. In all cases the remove() function inside InternalThreadLocalMap hit the case where the thread is an instance of FastThreadLocalThread and therefore set the threadLocalMap to null (https://github.com/netty/netty/blob/4.1/common/src/main/java/io/netty/util/internal/InternalThreadLocalMap.java#L141). The source of remove() seems to come from calling FastThreadLocal.removeAll() in one of two places:

The doStartThread method inside SingleThreadEventExecutor (https://github.com/netty/netty/blob/4.1/common/src/main/java/io/netty/util/concurrent/SingleThreadEventExecutor.java#L1050)
The run method inside FastThreadLocalRunnable (https://github.com/netty/netty/blob/4.1/common/src/main/java/io/netty/util/concurrent/FastThreadLocalRunnable.java#L32)

This case of remove() differs from calling destroy() because the slowThreadLocalMap is not modified in the cases I observed whereas destroy() explicitly modifies slowThreadLocalMap (https://github.com/netty/netty/blob/4.1/common/src/main/java/io/netty/util/internal/InternalThreadLocalMap.java#L148).

Given the observations I made locally, I am not sure if the memory leak scenario you described in netty/netty#6891 (comment) would apply for this case (it is very possible I am missing some understanding though). Can the ClassLoader still leak in the case that the grpc-nio-worker-ELG- threads are periodically recreated?

Circling back to the allocator type. Under the assumption the direct memory leak is somehow related to FastThreadLocal the UnpooledByteBufAllocator avoids the leak by not relying on FastThreadLocal. The PoolThreadLocalCache inside PooledByteBufAllocator is a FastThreadLocal (https://github.com/netty/netty/blob/4.1/buffer/src/main/java/io/netty/buffer/PooledByteBufAllocator.java#L507) while UnpooledByteBufAllocator does not utilize FastThreadLocal in any way.

ejona86 · 2023-07-12T15:16:36Z

I was not talking about grpc-nio-worker-ELG-. Those threads I assumed were already getting shut down. if not, then it means you'd be leaking a Channel or Server.

The problem I was describing occurs when you call gRPC from a long-lived thread. gRPC will encode a message using the allocator, and Netty then may use a ThreadLocal to store a cache. Independent of what information is trying to be stored, this will leak, as the value will be InternalThreadLocalMap which has a reference to Netty's ClassLoader and keeps it alive for the remainder of the Thread's life. InternalThreadLocalMap isn't really special in the issue, as storing new Object() {} to a ThreadLocal value would cause the same issue. The only way to avoid the issue is to only store JDK-defined types in the ThreadLocal.

Any usage of Netty from a thread risks that FastThreadLocal is used. There's two options of how unpooled helps:

FastThreadLocal is never used from another thread. This would allow the ClassLoader to garbage collect.
The ClassLoader still leaks, but at least it no longer holds onto large pooled resources.

Note that destroy() does not do what it claims. It used to set slowThreadLocalMap = null, which given enough activity on a thread may eventually GC the class loader. The current call to remove() is mostly pointless as it only impacts the current thread.

…Allocator Fixes grpc#10292

…Allocator (#10543) Fixes #10292

…Allocator (grpc#10543) Fixes grpc#10292

jaredbwasserman added the question label Jun 20, 2023

ejona86 changed the title ~~Confusing behavior for useCustomAllocator~~ Observe io.netty.allocator.type for custom Netty allocator Jun 30, 2023

ejona86 added enhancement netty and removed question labels Jun 30, 2023

ejona86 added this to the Next milestone Jun 30, 2023

ejona86 assigned larry-safran Jun 30, 2023

larry-safran added a commit to larry-safran/grpc-java that referenced this issue Sep 7, 2023

netty: Respect io.netty.allocator.type="unpooled" when getting Netty …

15b3344

…Allocator Fixes grpc#10292

larry-safran mentioned this issue Sep 7, 2023

netty: Respect io.netty.allocator.type="unpooled" when getting Netty Allocator #10543

Merged

larry-safran closed this as completed in #10543 Sep 7, 2023

larry-safran added a commit that referenced this issue Sep 7, 2023

netty: Respect io.netty.allocator.type="unpooled" when getting Netty …

878e301

…Allocator (#10543) Fixes #10292

DNVindhya pushed a commit to DNVindhya/grpc-java that referenced this issue Oct 5, 2023

netty: Respect io.netty.allocator.type="unpooled" when getting Netty …

3120c82

…Allocator (grpc#10543) Fixes grpc#10292

ejona86 modified the milestones: Next, 1.59 Nov 13, 2023

github-actions bot locked as resolved and limited conversation to collaborators Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observe io.netty.allocator.type for custom Netty allocator #10292

Observe io.netty.allocator.type for custom Netty allocator #10292

jaredbwasserman commented Jun 20, 2023

temawi commented Jun 20, 2023

ejona86 commented Jun 20, 2023

jaredbwasserman commented Jun 22, 2023

ejona86 commented Jun 22, 2023

ejona86 commented Jun 28, 2023

ejona86 commented Jun 30, 2023 •

edited

jaredbwasserman commented Jul 6, 2023

ejona86 commented Jul 12, 2023

Observe io.netty.allocator.type for custom Netty allocator #10292

Observe io.netty.allocator.type for custom Netty allocator #10292

Comments

jaredbwasserman commented Jun 20, 2023

temawi commented Jun 20, 2023

ejona86 commented Jun 20, 2023

jaredbwasserman commented Jun 22, 2023

ejona86 commented Jun 22, 2023

ejona86 commented Jun 28, 2023

ejona86 commented Jun 30, 2023 • edited

jaredbwasserman commented Jul 6, 2023

ejona86 commented Jul 12, 2023

ejona86 commented Jun 30, 2023 •

edited