High latency and steady CPU increase since migration from 3.12.x to 4.x [HZ-871] #20055

dreis2211 · 2021-12-09T14:09:02Z

Hi,

since the upgrade from Hazelcast 3 to Hazelcast 4 one of our applications is showing a steady increase in CPU over time.

The drops in CPU are caused by restarts (mostly because of deployments) but also to not let the CPU grow indefinitely. The seemingly lower values on the end are just caused by us restarting the thing after several unsuccessful tries to get to the bottom of the problem. The CPU is already rising again.

After some investigation I found out that several collections are not drained and seem to constantly grow and thus more and more time is spent on them.

1. OperationParkerImpl

2. InvocationMonitor / InvocationRegistry

Furthermore, the latency on all operations seems to be fairly high, but for ReadManyOperations it's 16 years(!) - for a server that was only running for a couple of hours at the time of checking the heap dump. And the maxMicros seems to be at Integer.MAX - I'm not sure what happens here (int-max is half an hour i think).

I can explain that ReadManyOperations has likely the biggest numbers here, because we use quite some ReliableTopics (multiple thousands) and thus eventually (Reliable)MessageRunner is doing ringbuffer.readManyAsync at some point. (Also likely the reason why RingBufferService is the biggest entry in our dump)

Our setup consists of just two nodes with 8 cores and 10GB RAM each. As you can see in the first graph on the left that was plenty enough for our application.

Unfortunately, I'm not able to share the application nor reproduction steps, but I hope I could show some neuralgic points in the dumps and flamegraphs. Please help out, because I'm out of ideas at the moment.

Cheers,
Christoph

The text was updated successfully, but these errors were encountered:

dreis2211 · 2021-12-09T22:29:23Z

Maybe some helpful additional configs we specify in various places:

RingbufferConfig ringbufferConfig = new RingbufferConfig("*");
ringbufferConfig.setCapacity(1000);
ringbufferConfig.setTimeToLiveSeconds(60);
hazelcast.getConfig().addRingBufferConfig(ringbufferConfig);

ReliableTopicConfig topicConfig = new ReliableTopicConfig("*");
topicConfig.setReadBatchSize(100);
topicConfig.setTopicOverloadPolicy(TopicOverloadPolicy.DISCARD_OLDEST);
hazelcast.getConfig().addReliableTopicConfig(topicConfig);

And

MapConfig cacheConfig = new MapConfig(name);
cacheConfig.setTimeToLiveSeconds(6 * 60);
cacheConfig.getEvictionConfig().setEvictionPolicy(EvictionPolicy.LRU);
cacheConfig.setBackupCount(1);
cacheConfig.setAsyncBackupCount(0);

NearCacheConfig nearCacheConfig = new NearCacheConfig();
nearCacheConfig.setMaxIdleSeconds(300);
nearCacheConfig.setInvalidateOnChange(true);

cacheConfig.setNearCacheConfig(nearCacheConfig);

hazelcast.getConfig().addMapConfig(cacheConfig);

dreis2211 · 2021-12-21T22:26:21Z

Interestingly, also the network graphs differ since the upgrade to 4.1.x...

Is there maybe some batching that should happen but isn't?

dreis2211 · 2022-01-05T17:41:24Z

@ramizdundar Any insights already by any chance? This is getting to a point where our application becomes unresponsive now...Is this maybe related to #19630?

ramizdundar · 2022-01-06T10:04:31Z

Unfortunately, no. Because BETA release is close, I believe I won't have much time to look into this. I will try to dive deeper next week.

dreis2211 · 2022-01-07T22:19:01Z

I've tried the latest 4.2.x with the same result, so I changed the title accordingly.

vbekiaris · 2022-01-20T15:37:35Z

@dreis2211 can you post the exact 4.x versions you tested with? (I am particularly interested if your issue Is reproducible with the initial 4.0 release?)

Buildup of Invocations in InvocationRegistry#invocations implies operations are submitted but not completed, so their result is still pending. Do you by any chance run your Hazelcast members with diagnostics turned on? Could you share diagnostics?
Also, are the Hazelcast members embedded in your app or is your app using reliable topics/maps/etc via a Hazelcast client?

dreis2211 · 2022-01-20T16:01:46Z

@vbekiaris As it's essentially a Spring-Boot application that was upgraded to (at that point) 2.5.5 from 2.3.x the initial broken version is using 4.1.5 - see https://github.com/spring-projects/spring-boot/blob/34677508a76d0decad4ba2ced46ccfcf6804fb4a/spring-boot-project/spring-boot-dependencies/build.gradle#L390. We are now on 4.2.4 without any noticeable impact. The application keeps getting unresponsive after the CPU has grown too much. So we were never on 4.0.

We're using a relative vanilla setup with everything autoconfigured by SB, so diagnostics are likely disabled. We're also not using any client as we are building up instances via Hazelcast#newHazelcastInstance. Hope that helps you.

I'm still wondering if this is somehow connected to #19630 which never got backported to 4.x (and its connected PRs). Especially because OperationParkerImpl.waitSetMap is also relatively large as noted in the first message... The invocations are PartitionInvocations with op pointing to ReadManyOperation mostly

github-actions · 2022-02-01T11:01:37Z

Internal Jira issue: HZ-871

pveentjer · 2022-02-02T07:41:16Z

There are several thousand topics. If there is a single ReadManyOperations per topic, there should be several thousand ReadManyOperations within the OperationParker.

But since there are nearly 100K ReadManyOperations, there must be many concurrent invocations per topic!

I'm not sure what the cause is of this isue.

If it is just a problematic usage of the API, we can't do much about it.

If there is a memory leak that causes ReadManyOperations the be retained, it must be addressed by us.

Can you provide some information how many concurrent requests you have per topic?

dreis2211 · 2022-02-02T07:43:06Z

@pveentjer How would I get that information?

If it would be problematic usage of the API why would this work with 3.x then? I would expect that there is no major regression if the changelog and migration guides don't really mention anything particular in that regards.

pveentjer · 2022-02-02T07:51:54Z

Do you know the maximum number of concurrent read requests you have over all topics? Is that a few thousand or close to 100K?

pveentjer · 2022-02-02T07:55:36Z

What kind of profiler are you using? Could you run with async profiler and see what is going on? It will give a good impression of where time is spent. The ratio between GC and non GC activities is also very useful.

I can help you set it up.

dreis2211 · 2022-02-02T07:57:08Z

@pveentjer Please look at the screenshots above. They are from async-profiler.

dreis2211 · 2022-02-02T07:58:13Z

@pveentjer I'd like to ask again - how would I get the information about topics and concurrent requests per topic?

dreis2211 · 2022-02-02T18:29:24Z

flamegraph2.html.zip

Keep in mind that this is still very early on in terms of rising CPU levels. More data will follow tomorrow, but I thought I do one today as well.

vbekiaris · 2022-02-03T09:28:08Z

Apparently, this has been fixed in 5.x with #19462 but was again not backported to 4.x. Any chance this could happen?

Yes, we will prepare backports for 4.2.z and 4.1.z

dreis2211 · 2022-02-03T10:16:05Z

Not really new stuff, but here is a profiling from this morning with -XX:+DebugNonSafepoints where the CPU has shifted again to handle operations and invocations mostly:
flamegraph3.html.zip

Regarding the big libjvm block we're trying to get debug symbols up and running to get further insights, but I assume the operation and invocation handling is also pressuring the GC a bit

dreis2211 · 2022-02-04T07:42:52Z

As we're not having debug symbols yet, I used jstat to see if we spent too much time in GCs.

host:~$ jstat -gcutil <pid> 1000 10
  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT    CGC    CGCT     GCT
  0,00 100,00  52,38  88,10  95,60  85,42  14155 1202,111     0    0,000  8700   85,929 1288,040
  0,00 100,00  62,86  88,10  95,60  85,42  14155 1202,111     0    0,000  8700   85,929 1288,040
  0,00 100,00  72,38  88,10  95,60  85,42  14155 1202,111     0    0,000  8700   85,929 1288,040
  0,00 100,00  86,67  88,20  95,60  85,42  14155 1202,111     0    0,000  8700   85,929 1288,040
  0,00 100,00   2,56  88,97  95,60  85,42  14156 1202,171     0    0,000  8700   85,929 1288,100
  0,00 100,00  18,80  89,06  95,60  85,42  14156 1202,171     0    0,000  8700   85,929 1288,100
  0,00 100,00  35,90  90,36  95,60  85,42  14156 1202,171     0    0,000  8700   85,929 1288,100
  0,00 100,00  47,01  90,36  95,60  85,42  14156 1202,171     0    0,000  8700   85,929 1288,100
  0,00 100,00  55,56  90,36  95,60  85,42  14156 1202,171     0    0,000  8700   85,929 1288,100
  0,00 100,00  64,96  90,36  95,60  85,42  14156 1202,171     0    0,000  8700   85,929 1288,100

I can't see too much GC activity here. Yes, eden space is filling up in let's say 10% on average in every second and we have a young GC every ~6-8 seconds, but that doesn't seem too bad. Especially if one can trust the timings at the end there's not much added in the 10 seconds interval (60ms)

dreis2211 · 2022-02-09T16:10:48Z

Although not a very big block, I just saw that InvocationMonitor$MonitorInvocationsTask.detectAndHandleTimeout is fairly new and impacts the invocation monitoring badly - more specifically the comparison of addresses.

This seems to have been introduced with #17734, but has also been backported already to 3.12.13. We were on Hazelcast 3.12.11 before the upgrade, so we were never impacted by this change. I wonder if this could be optimized again - especially since Invocation.Context says the following:

* The {@link Context} contains all 'static' dependencies for an Invocation; dependencies that don't
* change between invocations....

Shouldn't it be safe then to have a local field on Invocation that is only filled once?

dreis2211 · 2022-02-10T07:44:18Z

I might miss something, but I wrote the following test on our end to imitate the behaviour of MessageRunner and this test timeouts. Isn't this what MessageRunner is doing internally basically when calling next (apart from using whenCompleteAsync)? The test below works fine if minSize is 0 instead of 1 when calling readManyAsync. On 3.12.11 it doesn't timeout (you have to change toCompletableFuture().get() to just get() though).

@Test
public void imitateMessageRunnerBehaviour() throws Exception {
	RingbufferConfig ringbufferConfig = new RingbufferConfig("test");
	ringbufferConfig.setCapacity(1000);
	ringbufferConfig.setTimeToLiveSeconds(60);
	hazelcastInstance.getConfig().addRingBufferConfig(ringbufferConfig);
	Ringbuffer<String> ringbuffer = hazelcastInstance.getRingbuffer("test");

	long sequence = ringbuffer.tailSequence() + 1;
	System.out.println("Initial sequence: " + sequence);
	ringbuffer.add("a");

	ReadResultSet<String> result = ringbuffer.readManyAsync(sequence, 1, 10, null).toCompletableFuture().get();
	assertThat(result).containsOnly("a");
	System.out.println("=======");
	System.out.println("Current sequence: " + sequence);
	System.out.println("Read count: " + result.readCount());
	System.out.println("Next sequence: " + result.getNextSequenceToReadFrom());
	sequence = result.getNextSequenceToReadFrom();

	long nextSequence = sequence;
	assertTimeoutPreemptively(Duration.ofSeconds(10), () -> {
		ReadResultSet<String> r = ringbuffer.readManyAsync(nextSequence, 1, 10, null).toCompletableFuture().get();
		System.out.println("=======");
		System.out.println("Sequence: " + nextSequence);
		System.out.println("Read count: " + r.readCount());
		System.out.println("Next sequence: " + r.getNextSequenceToReadFrom());
		assertThat(r).containsOnly("a");
	});
}

As background: We're using ReliableTopicProxy.(add|remove)MessageListener and I wonder if they cause this massive amounts of ReadManyOperation (and also the high latency when the test above is not bogus...).

EDIT: I see weird things in the heap dump that forced me to investigate this in the first place. I see ReliableMessageRunner instances with the cancelled flag set to true, while their internal runnersMap shows a ReliableMessageRunner (maybe the same one(?)) with cancelled on false. I wonder if that's just caused by different threads not seeing the data modifications or if there is another problem.

pveentjer · 2022-02-11T09:03:27Z

As we're not having debug symbols yet, I used jstat to see if we spent too much time in GCs.

host:~$ jstat -gcutil <pid> 1000 10
  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT    CGC    CGCT     GCT
  0,00 100,00  52,38  88,10  95,60  85,42  14155 1202,111     0    0,000  8700   85,929 1288,040
  0,00 100,00  62,86  88,10  95,60  85,42  14155 1202,111     0    0,000  8700   85,929 1288,040
  0,00 100,00  72,38  88,10  95,60  85,42  14155 1202,111     0    0,000  8700   85,929 1288,040
  0,00 100,00  86,67  88,20  95,60  85,42  14155 1202,111     0    0,000  8700   85,929 1288,040
  0,00 100,00   2,56  88,97  95,60  85,42  14156 1202,171     0    0,000  8700   85,929 1288,100
  0,00 100,00  18,80  89,06  95,60  85,42  14156 1202,171     0    0,000  8700   85,929 1288,100
  0,00 100,00  35,90  90,36  95,60  85,42  14156 1202,171     0    0,000  8700   85,929 1288,100
  0,00 100,00  47,01  90,36  95,60  85,42  14156 1202,171     0    0,000  8700   85,929 1288,100
  0,00 100,00  55,56  90,36  95,60  85,42  14156 1202,171     0    0,000  8700   85,929 1288,100
  0,00 100,00  64,96  90,36  95,60  85,42  14156 1202,171     0    0,000  8700   85,929 1288,100

I can't see too much GC activity here. Yes, eden space is filling up in let's say 10% on average in every second and we have a young GC every ~6-8 seconds, but that doesn't seem too bad. Especially if one can trust the timings at the end there's not much added in the 10 seconds interval (60ms)

Based on the async profiler HTML file, it seems that in 38% of the collected CPU samples, GC is the cause. It would be good to have the debug symbols, but pretty sure this is GC.

dreis2211 · 2022-02-11T09:07:50Z

Well, this is just libjvm stuff - not necessarily GC. It could be method compilation etc (also unlikely since jstat is not showing this either)... It is likely GC, but since jstat is not reflecting this, I'm at least still skeptical until we have proof.

dreis2211 · 2022-02-11T09:11:46Z

The problem described above is btw similar to #17061 . We end up with invocations that never really vanish under certain circumstances even though the MessageRunner might be cancelled already.

pveentjer · 2022-02-11T09:16:02Z

Method compilation should be be out of the way pretty quickly. So I would be surprised if that would be the cause. I typically see the GC activities as red blocks next to the application; hence I'm pretty sure it is GC related. But.. best to get the the debug symbols so we know for sure.

Let me check the ReliableTopicProxy.(add|remove)MessageListener path.

I had a look. So what happens is that listener is unregistered. Once the result of its pending invocation is returned, no new invocation is submitted.

So if you would create/destroy listeners, and they have a pending invocation to get data from the ringbuffer, you could end up with an accumulation of pending calls.

pveentjer · 2022-02-11T09:32:46Z

EDIT: I see weird things in the heap dump that forced me to investigate this in the first place. I see ReliableMessageRunner instances with the cancelled flag set to true, while their internal runnersMap shows a ReliableMessageRunner (maybe the same one(?)) with cancelled on false. I wonder if that's just caused by different threads not seeing the data modifications or if there is another problem.

The runnersMap is not internal to the ReliableMessageRunner, the runnersMap is internal to the ReliableTopicProxy. As soon as the ReliableMessageRunner is cancelled (when the topic listener is removed), it is removed from the runnersMap. Although one could be able to shortly see a cancelled ReliableMessageRunner in the runnersMap since the cancelledFlag is set to true before it is removed from the RunnersMap.

The cancelledFlag is volatile and therefore excluded from any visibility problems.

dreis2211 · 2022-02-11T09:34:49Z

@pveentjer Is there maybe a way to cancel these invocations when the message listener is cancelled? Or - as #17061 suggests to have a timeout mechanism there that eventually will expire the invocation on its own?

pveentjer · 2022-02-11T09:35:08Z

Perhaps the following could happen:

You create a message listener on some topic. It causes an invocation on the remote system waiting for data.

You remove the listener.

But the invocation of the listener isn't canceled; since it is just waiting for data that doesn't arrive. And as a consequence, the canceled flag isn't seen. If data would be found, the invocation would complete and the canceled flag would prevent a new invocation to be sent.

And over time you get an accumulation of useless invocation waiting for data that never arrives.

dreis2211 · 2022-02-11T10:52:22Z

@pveentjer I'd like to repeat my question. Do you see any way forward for those invocations? E.g. cancelling them when the message listener is removed and/or the runner is cancelled? Or passing timeouts to them, so they eventually get cleaned up?

pveentjer · 2022-02-11T12:10:57Z

AFAIK there currently is no way for them to be canceled. Cancellation of the future (which could be obtained internally) won't solve the problem. we could remove the invocation from the invocation service, but that only solves the 'local' side of the problem and not the remote side. The end result would be the same.

I think the most sensible approach would be to add a new method to the readManyAsync method of the Ringbuffer with an explicit timeout (which defaults to Long.MAX_VALUE). This way the call is going to return to the MessageRunner when there is data or when there is a timeout. And in the latter case, the canceled status could be checked before resubmitting.

This won't cause any wire-level changes.

Ideally, there would be cancellation logic; so that operation can explicitly be canceled on the remote side. But this is going to lead to wire-level changes.

How often do you register and unregister listeners? Because if you have so many, then this must happen very frequently.

dreis2211 · 2022-02-11T13:16:21Z

This happens indeed fairly frequently. For every topic there will be at least one register and unregister. And we create many topics during the lifecycle of the app (it's essentially a chat with many channels and users that come and go that drives the amount of topics). Which brings me to the question if there's an easy way to get the amount of topics in a Hazelcast cluster?

In fact there is a pooling mechanism to don't create too many topics where I would likely put our own metrics if there are none from Hazelcast itself.

@Service
@Slf4j
public class HazelcastTopicPool {

	private final HazelcastInstance instance;

	private final IQueue<String> pool;

	private final IMap<String, String> map;

	public HazelcastTopicPool(HazelcastInstance instance) {
		this.instance = instance;
		this.pool = instance.getQueue("topics-pool");
		this.map = instance.getMap("topics");
		this.map.addLocalEntryListener(new CleanupListener(this));
	}

	public <T> ITopic<T> getOrCreateTopic(String topic) {
		// Check if topic already exists - if yes just return.
		return this.<T>getTopic(topic)
			.orElseGet(() -> {

				// Try to receive a free hazelcast topic from the pool - otherwise generate a new.
				String freeTopic = pool.poll();
				if (freeTopic == null) {
                                          // Has been com.hazelcast.util.UuidUtil.newUnsecureUUID() on version 3.x
					freeTopic = UUID.randomUUID().toString();
				}

				// Mark the free topic as used.
				String oldValue = map.putIfAbsent(topic, freeTopic, Duration.ofMinutes(2).toMillis(), TimeUnit.MILLISECONDS);
				if (oldValue != null) {
					pool.add(freeTopic);
					return instance.getReliableTopic(oldValue);
				} else {
					return instance.getReliableTopic(freeTopic);
				}
			});
	}

	public <T> Optional<ITopic<T>> getTopic(String topic) {
		for (int i = 0; i < 100; i++) {
			String existing = map.get(topic);
			if (existing == null) {
				return Optional.empty();
			}
			if (map.replace(topic, existing, existing)) {
				return Optional.of(instance.getReliableTopic(existing));
			}
		}

		throw new IllegalStateException("Cannot retrieve topic!");
	}

	<T> boolean refreshTopic(String topic, ITopic<T> iTopic) {
		String name = iTopic.getName();
		return map.replace(topic, name, name);
	}

	private void cleanupTopic(String topic, String oldTopic) {
		map.remove(topic);
		pool.add(oldTopic);
	}

	private static class CleanupListener implements EntryEvictedListener<String, String>, EntryExpiredListener<String, String> {

		private final HazelcastTopicPool pool;

		private CleanupListener(HazelcastTopicPool pool) {
			this.pool = pool;
		}

		@Override
		public void entryEvicted(EntryEvent<String, String> event) {
			log.info("Evicted topic: {}", event.getKey());
			pool.cleanupTopic(event.getKey(), event.getOldValue());
		}

		@Override
		public void entryExpired(EntryEvent<String, String> event) {
			log.info("Expired topic: {}", event.getKey());
			pool.cleanupTopic(event.getKey(), event.getOldValue());
		}
	}

}

In fact there was a bug in there that we didn't cleanup the topic (names) anymore and return them to the pool as we only used EntryEvictedListener, which broke the expiry handling with the upgrade from 3 to 4. Btw. this wasn't really mentioned in upgrade notes or release notes prominently enough. But this is fixed already and didn't show any noticeable impact unfortunately.

Long.MAX_VALUE should be way too many years to be a sensible default, don't you think? This would be as good as having no timeout. This new method is the way forward I think, but it should be also configurable what timeout is passed.

pveentjer · 2022-02-11T13:24:05Z

It is a sensible default for blocking calls without a timeout since there is no timeout and Long.MAX_VALUE is a good approximation of infinite.

In case of ITopic that is another question. So I guess a few seconds timout by default should be good enough for most cases without causing excessive load on the system unless there are a huge number of topic listeners. And I agree that having a configurable timeout makes sense.

dreis2211 · 2022-02-15T17:10:27Z

I checked a couple of things because I was still surprised what the difference between Hazelcast 3 and 4 was. In fact, the stuck invocations can happen both on 3 & 4.

After more digging I found another behaviour change that broke the HazelcastTopicPool posted above in yet another way. Calling IMap.replace will essentially remove the previously set TTL. E.g. consider the following stripped down example:

HazelcastInstance hazelcastInstance = Hazelcast.newHazelcastInstance();
IMap<String, String> topics = hazelcastInstance.getMap("topics");
topics.addLocalEntryListener(new EntryExpiredListener<String, String>() {
	@Override
	public void entryExpired(EntryEvent<String, String> event) {
		System.out.println("Expired: " + event);
	}
});
String key = "key";
topics.putIfAbsent(key, "value", 2000, TimeUnit.MILLISECONDS);
EntryView ev = topics.getEntryView(key);
System.out.println("====== FIRST ENTRY VIEW ======");
System.out.println("Expiry: " + new Date(ev.getExpirationTime()));
System.out.println("Update: " + new Date(ev.getLastUpdateTime()));
System.out.println("TTL: " + ev.getTtl());

String existingValue = topics.get(key);
System.out.println(existingValue);
Thread.sleep(1000);
if (existingValue != null) {
	topics.replace(key, existingValue, existingValue);
	System.out.println("====== SECOND ENTRY VIEW ======");
	ev = topics.getEntryView(key);
	System.out.println("Expiry: " + new Date(ev.getExpirationTime()));
	System.out.println("Update: " + new Date(ev.getLastUpdateTime()));
	System.out.println("TTL: " + ev.getTtl());
}

System.out.println("===== AFTERMATH =====");
Thread.sleep(5000);
System.out.println("Value:" + topics.get(key));

On Hazelcast 4.2.4 this code will output:

====== FIRST ENTRY VIEW ======
Expiry: Tue Feb 15 17:34:25 CET 2022
Update: Thu Jan 01 00:59:59 CET 1970
TTL: 2000
value
====== SECOND ENTRY VIEW ======
Expiry: Sun Aug 17 08:12:55 CET 292278994 // This is wrong
Update: Thu Jan 01 00:59:59 CET 1970
TTL: 9223372036854775807
===== AFTERMATH =====
Value:value

On Hazelcast 3.12.11 however this is shown:

====== FIRST ENTRY VIEW ======
Expiry: Tue Feb 15 17:45:12 CET 2022
Update: Tue Feb 15 17:45:10 CET 2022
TTL: 2000
value
====== SECOND ENTRY VIEW ======
Expiry: Tue Feb 15 17:45:13 CET 2022
Update: Tue Feb 15 17:45:11 CET 2022
TTL: 2000
===== AFTERMATH =====
Expired: EntryEvent{entryEventType=EXPIRED ..., name='topics', key=key, oldValue=value, value=null, mergingValue=null}
Value:null

As you can see on Hazelcast 4.2.4 the time-to-live is basically reset to a value that never really expires. (Not speaking of the last update time being weird.)

My workaround for Hazelcast 4.2.4 would be to add a MapConfig with setTimeToLiveSeconds for the topics map rather than the individual timeout on putIfAbsent. I'd appreciate if you could say that this is safe to use.

The effect of this bug is that the above mentioned HazelcastTopicPool will never receive expiry events when refreshTopic is called at anytime before the initial timeout and thus the pooling of topic names will never work. As the underlying IQueue.poll will always return null, a new random topic UUID is created. Eventually leading to more ReliableTopicProxy instances and more ringbuffers and more invocations etc....

I didn't find anything in the migration guides nor in the release notes that would describe that behaviour change - if it was intended after all. With the EntryEvictedListener change mentioned in an earlier comment not handling expiry events anymore, this makes it already two things that are not reflected in any migration guide and create an incredibly frustrating upgrade experience, if you don't mind me saying. Admittedly, our pooling is somewhat unique, but adding a local map listener or having TTLs isn't.

The workaround of adding a MapConfig is not yet deployed, so I can't confirm yet that this fixes our issues. But I wanted to let you know already.

Cheers,
Christoph

dreis2211 · 2022-02-26T16:34:45Z

The fix to configure a MapConfig with setTimeToLiveSeconds for the topics map inside our HazelcastTopicPool rather than the individual timeout on putIfAbsent seems to work.

I've also applied an additional publish of a special empty message after we call ReliableTopicProxy.removeMessageListener to workaround the discussed problem of invocations waiting for incoming data that never arrives.

dreis2211 · 2022-03-04T09:14:11Z

I should note that during this investigations two PRs were created:

Holmistr · 2022-05-31T09:33:24Z

@dreis2211 Thanks for your continous work on this. Are you still hitting this issue?

dreis2211 · 2022-06-01T08:31:39Z

@Holmistr No, after I applied the mentioned workarounds we don't hit this anymore. But especially the special publish is quite hacky. But it works to stop invocations waiting for incoming data that never arrives.

The changes to the map expiry are somewhat surprising and you have to tell me if they are a bug or not.

dreis2211 added the Type: Defect label Dec 9, 2021

dreis2211 mentioned this issue Dec 10, 2021

Do not register the JmxPublisher by default, especially if jmx is not enabled #20018

Open

ramizdundar self-assigned this Dec 31, 2021

mmedenjak added Module: Invocation System Source: Community PR or issue was opened by a community user Team: Core Team: Performance Type: Perf. Defect labels Jan 3, 2022

dreis2211 changed the title ~~High latency and steady CPU increase since migration from 3.12.x to 4.1.x~~ High latency and steady CPU increase since migration from 3.12.x to 4.x Jan 7, 2022

ramizdundar removed their assignment Jan 21, 2022

AyberkSorgun added this to the 5.1 milestone Feb 1, 2022

AyberkSorgun assigned AyberkSorgun and pveentjer and unassigned AyberkSorgun Feb 1, 2022

AyberkSorgun added the to-jira label Feb 1, 2022

github-actions bot changed the title ~~High latency and steady CPU increase since migration from 3.12.x to 4.x~~ High latency and steady CPU increase since migration from 3.12.x to 4.x [HZ-871] Feb 1, 2022

Holmistr modified the milestones: 5.1, 5.2 Feb 15, 2022

dreis2211 mentioned this issue Feb 21, 2022

Reduce allocations from InvocationMonitor #20772

Merged

6 tasks

Holmistr unassigned pveentjer May 10, 2022

AyberkSorgun modified the milestones: 5.2, 5.2 Backlog May 10, 2022

AyberkSorgun closed this as completed Jun 22, 2022

AyberkSorgun removed this from the 5.2 Backlog milestone Oct 12, 2022

High latency and steady CPU increase since migration from 3.12.x to 4.x [HZ-871] #20055

High latency and steady CPU increase since migration from 3.12.x to 4.x [HZ-871] #20055

Comments

dreis2211 commented Dec 9, 2021 • edited

dreis2211 commented Dec 9, 2021

dreis2211 commented Dec 21, 2021

dreis2211 commented Jan 5, 2022 • edited

ramizdundar commented Jan 6, 2022

dreis2211 commented Jan 7, 2022

vbekiaris commented Jan 20, 2022

dreis2211 commented Jan 20, 2022 • edited

github-actions bot commented Feb 1, 2022

pveentjer commented Feb 2, 2022

dreis2211 commented Feb 2, 2022 • edited

pveentjer commented Feb 2, 2022

pveentjer commented Feb 2, 2022

dreis2211 commented Feb 2, 2022

dreis2211 commented Feb 2, 2022

dreis2211 commented Feb 2, 2022

vbekiaris commented Feb 3, 2022

dreis2211 commented Feb 3, 2022

dreis2211 commented Feb 4, 2022 • edited

dreis2211 commented Feb 9, 2022 • edited

dreis2211 commented Feb 10, 2022 • edited

pveentjer commented Feb 11, 2022

dreis2211 commented Feb 11, 2022 • edited

dreis2211 commented Feb 11, 2022

pveentjer commented Feb 11, 2022 • edited

pveentjer commented Feb 11, 2022

dreis2211 commented Feb 11, 2022 • edited

pveentjer commented Feb 11, 2022 • edited

dreis2211 commented Feb 11, 2022 • edited

pveentjer commented Feb 11, 2022 • edited

dreis2211 commented Feb 11, 2022

pveentjer commented Feb 11, 2022 • edited

dreis2211 commented Feb 15, 2022 • edited

dreis2211 commented Feb 26, 2022

dreis2211 commented Mar 4, 2022

Holmistr commented May 31, 2022

dreis2211 commented Jun 1, 2022

dreis2211 commented Dec 9, 2021 •

edited

dreis2211 commented Jan 5, 2022 •

edited

dreis2211 commented Jan 20, 2022 •

edited

dreis2211 commented Feb 2, 2022 •

edited

dreis2211 commented Feb 4, 2022 •

edited

dreis2211 commented Feb 9, 2022 •

edited

dreis2211 commented Feb 10, 2022 •

edited

dreis2211 commented Feb 11, 2022 •

edited

pveentjer commented Feb 11, 2022 •

edited

dreis2211 commented Feb 11, 2022 •

edited

pveentjer commented Feb 11, 2022 •

edited

dreis2211 commented Feb 11, 2022 •

edited

pveentjer commented Feb 11, 2022 •

edited

pveentjer commented Feb 11, 2022 •

edited

dreis2211 commented Feb 15, 2022 •

edited