Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large number of PoolThreadCache garbage objects #5671

Open
lokeshj1703 opened this issue May 5, 2019 · 18 comments
Open

Large number of PoolThreadCache garbage objects #5671

lokeshj1703 opened this issue May 5, 2019 · 18 comments
Assignees
Milestone

Comments

@lokeshj1703
Copy link

We are using grpc-java version 1.16.0 in Apache Ratis for RPC communication. Apache Ratis is a java implementation of Raft consensus protocol and uses grpc for server-server and client-server communication. In the scenario there are ~10KB requests coming from the client to the server and the server is replicating those requests to other servers. There are thousands of these requests coming gradually to the server from the client. There are other small sized metadata related communications as well.

We are currently seeing a lot of GCs getting triggered and lot of garbage PoolThreadCache objects. Here is a snippet of the major unreachable heap objects.

org.apache.ratis.thirdparty.io.netty.buffer.PoolThreadCache self 126Kb (< 0.1%), 2,021 object(s)
	.tinySubPageDirectCaches ↘ 173,584Kb (24.5%), 2,021 reference(s)
		org.apache.ratis.thirdparty.io.netty.buffer.PoolThreadCache$MemoryRegionCache[] self 284Kb (< 0.1%), 2,021 object(s)
			{array elements} ↘ 173,300Kb (24.4%), 64,672 reference(s)
				org.apache.ratis.thirdparty.io.netty.buffer.PoolThreadCache$SubPageMemoryRegionCache self 2,021Kb (0.3%), 64,672 object(s)
					.queue ↘ 171,279Kb (24.1%), 64,672 reference(s)
						org.apache.ratis.thirdparty.io.netty.util.internal.shaded.org.jctools.queues.MpscArrayQueue self 40,925Kb (5.8%), 64,672 object(s)
							.buffer ↘ 130,354Kb (18.4%), 64,672 reference(s)
								Object[] self 130,354Kb (18.4%), 64,672 object(s)

	.tinySubPageHeapCaches ↘ 173,584Kb (24.5%), 2,021 reference(s)
		org.apache.ratis.thirdparty.io.netty.buffer.PoolThreadCache$MemoryRegionCache[] self 284Kb (< 0.1%), 2,021 object(s)
			{array elements} ↘ 173,300Kb (24.4%), 64,672 reference(s)
				org.apache.ratis.thirdparty.io.netty.buffer.PoolThreadCache$SubPageMemoryRegionCache self 2,021Kb (0.3%), 64,672 object(s)
					.queue ↘ 171,279Kb (24.1%), 64,672 reference(s)
						org.apache.ratis.thirdparty.io.netty.util.internal.shaded.org.jctools.queues.MpscArrayQueue self 40,925Kb (5.8%), 64,672 object(s)
							.buffer ↘ 130,354Kb (18.4%), 64,672 reference(s)
								Object[] self 130,354Kb (18.4%), 64,672 object(s)

	.smallSubPageDirectCaches ↘ 13,641Kb (1.9%), 2,021 reference(s)
		org.apache.ratis.thirdparty.io.netty.buffer.PoolThreadCache$MemoryRegionCache[] self 63Kb (< 0.1%), 2,021 object(s)
			{array elements} ↘ 13,578Kb (1.9%), 8,084 reference(s)
				org.apache.ratis.thirdparty.io.netty.buffer.PoolThreadCache$SubPageMemoryRegionCache self 252Kb (< 0.1%), 8,084 object(s)
					.queue ↘ 13,325Kb (1.9%), 8,084 reference(s)
						org.apache.ratis.thirdparty.io.netty.util.internal.shaded.org.jctools.queues.MpscArrayQueue self 5,115Kb (0.7%), 8,084 object(s)
							.buffer ↘ 8,210Kb (1.2%), 8,084 reference(s)
								Object[] self 8,210Kb (1.2%), 8,084 object(s)

	.smallSubPageHeapCaches ↘ 13,641Kb (1.9%), 2,021 reference(s)
		org.apache.ratis.thirdparty.io.netty.buffer.PoolThreadCache$MemoryRegionCache[] self 63Kb (< 0.1%), 2,021 object(s)
			{array elements} ↘ 13,578Kb (1.9%), 8,084 reference(s)
				org.apache.ratis.thirdparty.io.netty.buffer.PoolThreadCache$SubPageMemoryRegionCache self 252Kb (< 0.1%), 8,084 object(s)
					.queue ↘ 13,325Kb (1.9%), 8,084 reference(s)
						org.apache.ratis.thirdparty.io.netty.util.internal.shaded.org.jctools.queues.MpscArrayQueue self 5,115Kb (0.7%), 8,084 object(s)
							.buffer ↘ 8,210Kb (1.2%), 8,084 reference(s)
								Object[] self 8,210Kb (1.2%), 8,084 object(s)

	.normalDirectCaches ↘ 5,699Kb (0.8%), 2,021 reference(s)
	
	.normalHeapCaches ↘ 5,699Kb (0.8%), 2,021 reference(s)

	.freed ↘ 31Kb (< 0.1%), 2,021 reference(s)

It seems like there is lot of garbage being generated via thread caches.

@lokeshj1703
Copy link
Author

I did some logging for the pool arenas in PooledByteBufAllocator.DEFAULT instance. It seems like there are lots of thread pool caches being generated and then they are GC'ed.
Thread caches are being generated even for heap arenas where there are no allocations happening. Before GC each arena has ~444 thread caches and after GC that number drops down to ~39.

Before GC

PooledByteBufAllocatorMetric(usedHeapMemory: 0; usedDirectMemory: 67108864; numHeapArenas: 4; numDirectArenas: 4; tinyCacheSize: 512; smallCacheSize: 256; normalCacheSize: 64; numThreadLocalCaches: 1773; chunkSize: 16777216)
ACTIVE allocations: 0 tiny 0 small 0 normal 0 huge 0
TOTAL 0 tiny 0 small 0 normal 0 huge 0
DEALLOCATIONS 0 tiny 0 small 0 normal 0 huge 0
total caches: 444 subpages tiny 32 small 4
... for 3 other heap arenas similar stats

ACTIVE allocations: 651 tiny 35 small 526 normal 90 huge 0
TOTAL 599390 tiny 321 small 570115 normal 28954 huge 0
DEALLOCATIONS 598739 tiny 286 small 569589 normal 28864 huge 0
total caches: 444 subpages tiny 32 small 4

ACTIVE allocations: 590 tiny 69 small 478 normal 43 huge 0
TOTAL 737544 tiny 3538 small 568056 normal 165950 huge 0
DEALLOCATIONS 736954 tiny 3469 small 567578 normal 165907 huge 0
total caches: 443 subpages tiny 32 small 4

ACTIVE allocations: 658 tiny 141 small 476 normal 41 huge 0
TOTAL 582933 tiny 3113 small 563923 normal 15897 huge 0
DEALLOCATIONS 582275 tiny 2972 small 563447 normal 15856 huge 0
total caches: 443 subpages tiny 32 small 4

ACTIVE allocations: 587 tiny 38 small 486 normal 63 huge 0
TOTAL 601430 tiny 254 small 566369 normal 34807 huge 0
DEALLOCATIONS 600843 tiny 216 small 565883 normal 34744 huge 0
total caches: 443 subpages tiny 32 small 4

2019-05-05 11:45:02,375 DEBUG org.apache.ratis.thirdparty.io.netty.buffer.PoolThreadCache: Freed {} thread-local buffer(s) from thread: {}
2019-05-05 11:45:02,376 DEBUG org.apache.ratis.thirdparty.io.netty.buffer.PoolThreadCache: Freed {} thread-local buffer(s) from thread: {}
2019-05-05 11:45:02,376 DEBUG org.apache.ratis.thirdparty.io.netty.buffer.PoolThreadCache: Freed {} thread-local buffer(s) from thread: {}
...

After GC

ACTIVE allocations: 0 tiny 0 small 0 normal 0 huge 0
TOTAL 0 tiny 0 small 0 normal 0 huge 0
DEALLOCATIONS 0 tiny 0 small 0 normal 0 huge 0
total caches: 39 subpages tiny 32 small 4
... for 3 other heap arenas similar stats
ACTIVE allocations: 238 tiny 35 small 113 normal 90 huge 0
TOTAL 599423 tiny 321 small 570140 normal 28962 huge 0
DEALLOCATIONS 599185 tiny 286 small 570027 normal 28872 huge 0
total caches: 39 subpages tiny 32 small 4

ACTIVE allocations: 181 tiny 69 small 69 normal 43 huge 0
TOTAL 737574 tiny 3538 small 568086 normal 165950 huge 0
DEALLOCATIONS 737393 tiny 3469 small 568017 normal 165907 huge 0
total caches: 39 subpages tiny 32 small 4

ACTIVE allocations: 247 tiny 143 small 63 normal 41 huge 0
TOTAL 582966 tiny 3115 small 563948 normal 15903 huge 0
DEALLOCATIONS 582719 tiny 2972 small 563885 normal 15862 huge 0
total caches: 39 subpages tiny 32 small 4

ACTIVE allocations: 175 tiny 38 small 74 normal 63 huge 0
TOTAL 601460 tiny 254 small 566393 normal 34813 huge 0
DEALLOCATIONS 601285 tiny 216 small 566319 normal 34750 huge 0
total caches: 38 subpages tiny 32 small 4
{Heap before GC invocations=1116 (full 43):
 PSYoungGen      total 1150976K, used 1071616K [0x0000000760000000, 0x00000007bfe80000, 0x00000007c0000000)
  eden space 747520K, 100% used [0x0000000760000000,0x000000078da00000,0x000000078da00000)
  from space 403456K, 80% used [0x000000078da00000,0x00000007a1680000,0x00000007a6400000)
  to   space 396800K, 0% used [0x00000007a7b00000,0x00000007a7b00000,0x00000007bfe80000)
 ParOldGen       total 1572864K, used 682200K [0x0000000700000000, 0x0000000760000000, 0x0000000760000000)
  object space 1572864K, 43% used [0x0000000700000000,0x0000000729a362b8,0x0000000760000000)
 Metaspace       used 58169K, capacity 59914K, committed 60416K, reserved 1101824K
  class space    used 6994K, capacity 7287K, committed 7424K, reserved 1048576K
2019-05-05T11:45:01.798+0000: 6267.116: [GC (Allocation Failure) [PSYoungGen: 1071616K->349120K(1169408K)] 1753816K->1032800K(2742272K), 0.3386167 secs] [Times: user=0.36 sys=0.00, real=0.34 secs]
Heap after GC invocations=1116 (full 43):
 PSYoungGen      total 1169408K, used 349120K [0x0000000760000000, 0x00000007bff80000, 0x00000007c0000000)
  eden space 772608K, 0% used [0x0000000760000000,0x0000000760000000,0x000000078f280000)
  from space 396800K, 87% used [0x00000007a7b00000,0x00000007bcff0000,0x00000007bfe80000)
  to   space 399872K, 0% used [0x000000078f280000,0x000000078f280000,0x00000007a7900000)
 ParOldGen       total 1572864K, used 683680K [0x0000000700000000, 0x0000000760000000, 0x0000000760000000)
  object space 1572864K, 43% used [0x0000000700000000,0x0000000729ba82b8,0x0000000760000000)
 Metaspace       used 58169K, capacity 59914K, committed 60416K, reserved 1101824K
  class space    used 6994K, capacity 7287K, committed 7424K, reserved 1048576K
}

@lokeshj1703 lokeshj1703 changed the title Large number of PooledThreadCache garbage objects Large number of PoolThreadCache garbage objects May 6, 2019
@ejona86
Copy link
Member

ejona86 commented May 6, 2019

As an experiment, try setting -Dorg.apache.ratis.thirdparty.io.netty.allocator.useCacheForAllThreads=false when running your binary. This will disable caches outside of Netty's threads.

This may be related to #4317.

@lokeshj1703
Copy link
Author

@ejona86 Thanks for the parameter. It significantly reduces the GC pressure. The total GC time reduced from ~3min to 37 secs.

There are still lots of thread caches getting associated to every pool. I could see 773 thread caches in every pool which after GC went down to 109. While building a netty server I am currently using separate executors for RPC processing and worker event group. Which threads would these thread caches be associated to?

@ejona86
Copy link
Member

ejona86 commented May 9, 2019

@lokeshj1703, with that paramater, Netty would still create a cache for every eventLoop thread. By default Netty creates (2 x cores) threads. So if you are creating two NioEventLoopGroups and have ~27 cores, then it would be about expected. If for the RPC processing you are talking about ServerBuilder.executor() and some threads created by ThreadPoolExecutor, then those shouldn't count; only "Netty threads" should count. Event loop threads are lazily created, so not all of them may be created by the time you are monitoring.

You may be interested in the other "system property" mentioned in https://netty.io/4.1/api/io/netty/buffer/PooledByteBufAllocator.html . Documentation is a bit terse. There's a few more comments sprinkled in the code. Remember to prefix them with org.apache.ratis.thirdparty. because of the shading.

@ejona86
Copy link
Member

ejona86 commented May 9, 2019

When re-reading " While building a netty server I am currently using separate executors for RPC processing and worker event group." it now seems obvious what you meant.

Only the worker event group would count there. If you use any ManagedChannels, then they could have their own event loop (we have a default one we use if you don't provide one). The biggest question to determine if the 109 number makes sense is how many cores you have.

@lokeshj1703
Copy link
Author

lokeshj1703 commented May 9, 2019

@ejona86 Thanks a lot for the information!
We are using ManagedChannel as well and the default event loop group is used there. The machine has 32 cores. So that would make 64 event loop threads in the managed channel and the remaining in the worker group (30 threads) for the netty server.
But there are 4 direct memory arenas. Each arena after GC has 109 thread caches. Therefore there are total of 109 x 4 = 436 thread caches. This is after GC.
Before GC there were 773 thread caches per direct memory arena i.e. 773 x 4 = 3092 thread caches in total. I think the total thread caches do not add up to number of event loop threads?
Also in a matter of 10-15 secs thread caches in each arena increase from 84 to 773 and then fall back to 109. That is another point of concern. Ideally the number of thread caches per direct memory arena should not vary to such an extent right?

@ejona86
Copy link
Member

ejona86 commented May 9, 2019

4 direct memory arenas... I'd expect two arenas from PooledByteBufAllocator.DEFAULT: one for direct memory and one for heap memory. I'd only expect more if there were more shaded/unshaded copies of Netty in your classpath (under a different package from org.apache.ratis.thirdparty.io.netty). Or if someone else is using Netty and calling new PooledByteBufAllocator(...).

How long-lived are the ManagedChannels? When there are no usages of our default event loops, we release them after 1 second. If there is at least one ManagedChannel alive at all times, then those threads wouldn't be re-created. It looks like Ratis's GrpcClientProtocolClient holds a ManagedChannel, so that would count. Which means you probably do have one of those that are long-lived...

Given you have set useCacheForAllThreads=false, I'd expect the number of Netty threads to remain constant, and thus the number of caches to remain constant.

I poked at Ratis. I'm sort of amazed. They properly are shaded gRPC (at least) because they were using unstable APIs. Props to them.

@lokeshj1703
Copy link
Author

In our case the default number of arenas turn out to be correct.

 final int defaultMinNumArena = NettyRuntime.availableProcessors() * 2;
        final int defaultChunkSize = DEFAULT_PAGE_SIZE << DEFAULT_MAX_ORDER;
        DEFAULT_NUM_HEAP_ARENA = Math.max(0,
                SystemPropertyUtil.getInt(
                        "org.apache.ratis.thirdparty.io.netty.allocator.numHeapArenas",
                        (int) Math.min(
                                defaultMinNumArena,
                                runtime.maxMemory() / defaultChunkSize / 2 / 3)));

NettyRuntime.availableProcessors() = 2, runtime.maxMemory() = 4096 * 1048576 and defaultChunkSize=16 * 1048576. Therefore number of heap arenas turn out to be 4. Am I missing something?

It makes sense that the thread caches should be constant. I'll look into it. Please let me know if you have any pointers for where to start.

@ejona86
Copy link
Member

ejona86 commented May 10, 2019

@lokeshj1703, you're right. I was thinking a different concept was the arena.

I would start by looking at Netty thread creation/death, since that can be pretty easy to notice with just two thread dumps. I am pretty sure the Netty threads will by default all have unique names. That won't be true when someone specifies a ThreadFactory that chooses its own name, though.

@ejona86 ejona86 self-assigned this May 28, 2019
@zhangkun83 zhangkun83 added this to the Next milestone Jun 12, 2019
@zandegran
Copy link

We are also affected by this memory leak large amounts of io.netty.util.internal.shaded.org.jctools.queues.MpscArrayQueue.buffer objects. Upto 30% of the heap

@zhoufeilongjava
Copy link

netty/netty@94d7557
i affected by mem leak by weakOrderQueue.

@yosezky
Copy link

yosezky commented Sep 22, 2020

@ejona86 Hi. I wonder if you could help me out with some related (?) problem that I've encountered.
A lot of objects of same type are collected in memory, without being GCed:
image
we're using:
implementation "io.grpc:grpc-netty:1.27.0"
implementation "io.grpc:grpc-netty-shaded:1.27.0"

Maybe I can try to use:
-Dio.grpc.netty.shaded.io.netty.allocator.useCacheForAllThreads=false
or something like this?

@ejona86
Copy link
Member

ejona86 commented Sep 22, 2020

@yosezky, if I'm reading that right, then "32-512 items" and "350 KB" would not be "a lot of objects" for most people. I don't see any obvious problem demonstrated in your screenshot.

@yosezky
Copy link

yosezky commented Sep 23, 2020

@ejona86 Thanks for your reply!!
The screen shot indeed doesn't look problematic, as you said. The problem is with the amount of this kind of objects that we see in that screen shot. The following screen shot shows it better (different dump):
image
I think that many of these Object[] instances are just arrays of null values, and they are also related to these caches.
I've just now realised that the service didn't run enough time to really demonstrate the memory leak. I'll monitor if for several more days.

@MajorHe1
Copy link

@ejona86 Thanks for your reply!! The screen shot indeed doesn't look problematic, as you said. The problem is with the amount of this kind of objects that we see in that screen shot. The following screen shot shows it better (different dump): image I think that many of these Object[] instances are just arrays of null values, and they are also related to these caches. I've just now realised that the service didn't run enough time to really demonstrate the memory leak. I'll monitor if for several more days.

Hello, is there any process to solve this problem?

@ThoSap
Copy link

ThoSap commented May 27, 2022

I have the same issue.

P.S. setting system property -Dio.netty.allocator.useCacheForAllThreads=false fixed the issue.

@maybeyj
Copy link

maybeyj commented Dec 21, 2023

why?I have the same issue
image

@qsLI
Copy link

qsLI commented Jan 18, 2024

As an experiment, try setting -Dorg.apache.ratis.thirdparty.io.netty.allocator.useCacheForAllThreads=false when running your binary. This will disable caches outside of Netty's threads.

This may be related to #4317.

What is the side effect? Will the competition become more intense during direct bytebuffer allocation?

Why don't we allocate direct byte buffers in the event loop thread like Lettuce does?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants