Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High latency and steady CPU increase since migration from 3.12.x to 4.x [HZ-871] #20055

Closed
dreis2211 opened this issue Dec 9, 2021 · 49 comments

Comments

@dreis2211
Copy link
Contributor

dreis2211 commented Dec 9, 2021

Hi,

since the upgrade from Hazelcast 3 to Hazelcast 4 one of our applications is showing a steady increase in CPU over time.

image

The drops in CPU are caused by restarts (mostly because of deployments) but also to not let the CPU grow indefinitely. The seemingly lower values on the end are just caused by us restarting the thing after several unsuccessful tries to get to the bottom of the problem. The CPU is already rising again.

After some investigation I found out that several collections are not drained and seem to constantly grow and thus more and more time is spent on them.

1. OperationParkerImpl
image
image

2. InvocationMonitor / InvocationRegistry
image
image

Furthermore, the latency on all operations seems to be fairly high, but for ReadManyOperations it's 16 years(!) - for a server that was only running for a couple of hours at the time of checking the heap dump. And the maxMicros seems to be at Integer.MAX - I'm not sure what happens here (int-max is half an hour i think).
image

I can explain that ReadManyOperations has likely the biggest numbers here, because we use quite some ReliableTopics (multiple thousands) and thus eventually (Reliable)MessageRunner is doing ringbuffer.readManyAsync at some point. (Also likely the reason why RingBufferService is the biggest entry in our dump)
image

Our setup consists of just two nodes with 8 cores and 10GB RAM each. As you can see in the first graph on the left that was plenty enough for our application.

Unfortunately, I'm not able to share the application nor reproduction steps, but I hope I could show some neuralgic points in the dumps and flamegraphs. Please help out, because I'm out of ideas at the moment.

Cheers,
Christoph

@dreis2211
Copy link
Contributor Author

Maybe some helpful additional configs we specify in various places:

RingbufferConfig ringbufferConfig = new RingbufferConfig("*");
ringbufferConfig.setCapacity(1000);
ringbufferConfig.setTimeToLiveSeconds(60);
hazelcast.getConfig().addRingBufferConfig(ringbufferConfig);

ReliableTopicConfig topicConfig = new ReliableTopicConfig("*");
topicConfig.setReadBatchSize(100);
topicConfig.setTopicOverloadPolicy(TopicOverloadPolicy.DISCARD_OLDEST);
hazelcast.getConfig().addReliableTopicConfig(topicConfig);

And

MapConfig cacheConfig = new MapConfig(name);
cacheConfig.setTimeToLiveSeconds(6 * 60);
cacheConfig.getEvictionConfig().setEvictionPolicy(EvictionPolicy.LRU);
cacheConfig.setBackupCount(1);
cacheConfig.setAsyncBackupCount(0);

NearCacheConfig nearCacheConfig = new NearCacheConfig();
nearCacheConfig.setMaxIdleSeconds(300);
nearCacheConfig.setInvalidateOnChange(true);

cacheConfig.setNearCacheConfig(nearCacheConfig);

hazelcast.getConfig().addMapConfig(cacheConfig);

@dreis2211
Copy link
Contributor Author

Interestingly, also the network graphs differ since the upgrade to 4.1.x...
image
Is there maybe some batching that should happen but isn't?

@dreis2211
Copy link
Contributor Author

dreis2211 commented Jan 5, 2022

@ramizdundar Any insights already by any chance? This is getting to a point where our application becomes unresponsive now...Is this maybe related to #19630?

@ramizdundar
Copy link
Contributor

Unfortunately, no. Because BETA release is close, I believe I won't have much time to look into this. I will try to dive deeper next week.

@dreis2211 dreis2211 changed the title High latency and steady CPU increase since migration from 3.12.x to 4.1.x High latency and steady CPU increase since migration from 3.12.x to 4.x Jan 7, 2022
@dreis2211
Copy link
Contributor Author

I've tried the latest 4.2.x with the same result, so I changed the title accordingly.

@vbekiaris
Copy link
Collaborator

@dreis2211 can you post the exact 4.x versions you tested with? (I am particularly interested if your issue Is reproducible with the initial 4.0 release?)

Buildup of Invocations in InvocationRegistry#invocations implies operations are submitted but not completed, so their result is still pending. Do you by any chance run your Hazelcast members with diagnostics turned on? Could you share diagnostics?
Also, are the Hazelcast members embedded in your app or is your app using reliable topics/maps/etc via a Hazelcast client?

@dreis2211
Copy link
Contributor Author

dreis2211 commented Jan 20, 2022

@vbekiaris As it's essentially a Spring-Boot application that was upgraded to (at that point) 2.5.5 from 2.3.x the initial broken version is using 4.1.5 - see https://github.com/spring-projects/spring-boot/blob/34677508a76d0decad4ba2ced46ccfcf6804fb4a/spring-boot-project/spring-boot-dependencies/build.gradle#L390. We are now on 4.2.4 without any noticeable impact. The application keeps getting unresponsive after the CPU has grown too much. So we were never on 4.0.

We're using a relative vanilla setup with everything autoconfigured by SB, so diagnostics are likely disabled. We're also not using any client as we are building up instances via Hazelcast#newHazelcastInstance. Hope that helps you.

I'm still wondering if this is somehow connected to #19630 which never got backported to 4.x (and its connected PRs). Especially because OperationParkerImpl.waitSetMap is also relatively large as noted in the first message... The invocations are PartitionInvocations with op pointing to ReadManyOperation mostly

@ramizdundar ramizdundar removed their assignment Jan 21, 2022
@AyberkSorgun AyberkSorgun added this to the 5.1 milestone Feb 1, 2022
@github-actions github-actions bot changed the title High latency and steady CPU increase since migration from 3.12.x to 4.x High latency and steady CPU increase since migration from 3.12.x to 4.x [HZ-871] Feb 1, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Feb 1, 2022

Internal Jira issue: HZ-871

@pveentjer
Copy link
Contributor

There are several thousand topics. If there is a single ReadManyOperations per topic, there should be several thousand ReadManyOperations within the OperationParker.

But since there are nearly 100K ReadManyOperations, there must be many concurrent invocations per topic!

I'm not sure what the cause is of this isue.

If it is just a problematic usage of the API, we can't do much about it.

If there is a memory leak that causes ReadManyOperations the be retained, it must be addressed by us.

Can you provide some information how many concurrent requests you have per topic?

@dreis2211
Copy link
Contributor Author

dreis2211 commented Feb 2, 2022

@pveentjer How would I get that information?

If it would be problematic usage of the API why would this work with 3.x then? I would expect that there is no major regression if the changelog and migration guides don't really mention anything particular in that regards.

@pveentjer
Copy link
Contributor

Do you know the maximum number of concurrent read requests you have over all topics? Is that a few thousand or close to 100K?

@pveentjer
Copy link
Contributor

What kind of profiler are you using? Could you run with async profiler and see what is going on? It will give a good impression of where time is spent. The ratio between GC and non GC activities is also very useful.

I can help you set it up.

@dreis2211
Copy link
Contributor Author

@pveentjer Please look at the screenshots above. They are from async-profiler.

@dreis2211
Copy link
Contributor Author

@pveentjer I'd like to ask again - how would I get the information about topics and concurrent requests per topic?

@dreis2211
Copy link
Contributor Author

flamegraph2.html.zip

Keep in mind that this is still very early on in terms of rising CPU levels. More data will follow tomorrow, but I thought I do one today as well.

@vbekiaris
Copy link
Collaborator

Apparently, this has been fixed in 5.x with #19462 but was again not backported to 4.x. Any chance this could happen?

Yes, we will prepare backports for 4.2.z and 4.1.z

@dreis2211
Copy link
Contributor Author

Not really new stuff, but here is a profiling from this morning with -XX:+DebugNonSafepoints where the CPU has shifted again to handle operations and invocations mostly:
flamegraph3.html.zip

Regarding the big libjvm block we're trying to get debug symbols up and running to get further insights, but I assume the operation and invocation handling is also pressuring the GC a bit

@dreis2211
Copy link
Contributor Author

dreis2211 commented Feb 4, 2022

As we're not having debug symbols yet, I used jstat to see if we spent too much time in GCs.

host:~$ jstat -gcutil <pid> 1000 10
  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT    CGC    CGCT     GCT
  0,00 100,00  52,38  88,10  95,60  85,42  14155 1202,111     0    0,000  8700   85,929 1288,040
  0,00 100,00  62,86  88,10  95,60  85,42  14155 1202,111     0    0,000  8700   85,929 1288,040
  0,00 100,00  72,38  88,10  95,60  85,42  14155 1202,111     0    0,000  8700   85,929 1288,040
  0,00 100,00  86,67  88,20  95,60  85,42  14155 1202,111     0    0,000  8700   85,929 1288,040
  0,00 100,00   2,56  88,97  95,60  85,42  14156 1202,171     0    0,000  8700   85,929 1288,100
  0,00 100,00  18,80  89,06  95,60  85,42  14156 1202,171     0    0,000  8700   85,929 1288,100
  0,00 100,00  35,90  90,36  95,60  85,42  14156 1202,171     0    0,000  8700   85,929 1288,100
  0,00 100,00  47,01  90,36  95,60  85,42  14156 1202,171     0    0,000  8700   85,929 1288,100
  0,00 100,00  55,56  90,36  95,60  85,42  14156 1202,171     0    0,000  8700   85,929 1288,100
  0,00 100,00  64,96  90,36  95,60  85,42  14156 1202,171     0    0,000  8700   85,929 1288,100

I can't see too much GC activity here. Yes, eden space is filling up in let's say 10% on average in every second and we have a young GC every ~6-8 seconds, but that doesn't seem too bad. Especially if one can trust the timings at the end there's not much added in the 10 seconds interval (60ms)

@dreis2211
Copy link
Contributor Author

dreis2211 commented Feb 9, 2022

Although not a very big block, I just saw that InvocationMonitor$MonitorInvocationsTask.detectAndHandleTimeout is fairly new and impacts the invocation monitoring badly - more specifically the comparison of addresses.
image

This seems to have been introduced with #17734, but has also been backported already to 3.12.13. We were on Hazelcast 3.12.11 before the upgrade, so we were never impacted by this change. I wonder if this could be optimized again - especially since Invocation.Context says the following:

* The {@link Context} contains all 'static' dependencies for an Invocation; dependencies that don't
* change between invocations....

Shouldn't it be safe then to have a local field on Invocation that is only filled once?

@dreis2211
Copy link
Contributor Author

dreis2211 commented Feb 10, 2022

I might miss something, but I wrote the following test on our end to imitate the behaviour of MessageRunner and this test timeouts. Isn't this what MessageRunner is doing internally basically when calling next (apart from using whenCompleteAsync)? The test below works fine if minSize is 0 instead of 1 when calling readManyAsync. On 3.12.11 it doesn't timeout (you have to change toCompletableFuture().get() to just get() though).

@Test
public void imitateMessageRunnerBehaviour() throws Exception {
	RingbufferConfig ringbufferConfig = new RingbufferConfig("test");
	ringbufferConfig.setCapacity(1000);
	ringbufferConfig.setTimeToLiveSeconds(60);
	hazelcastInstance.getConfig().addRingBufferConfig(ringbufferConfig);
	Ringbuffer<String> ringbuffer = hazelcastInstance.getRingbuffer("test");

	long sequence = ringbuffer.tailSequence() + 1;
	System.out.println("Initial sequence: " + sequence);
	ringbuffer.add("a");

	ReadResultSet<String> result = ringbuffer.readManyAsync(sequence, 1, 10, null).toCompletableFuture().get();
	assertThat(result).containsOnly("a");
	System.out.println("=======");
	System.out.println("Current sequence: " + sequence);
	System.out.println("Read count: " + result.readCount());
	System.out.println("Next sequence: " + result.getNextSequenceToReadFrom());
	sequence = result.getNextSequenceToReadFrom();

	long nextSequence = sequence;
	assertTimeoutPreemptively(Duration.ofSeconds(10), () -> {
		ReadResultSet<String> r = ringbuffer.readManyAsync(nextSequence, 1, 10, null).toCompletableFuture().get();
		System.out.println("=======");
		System.out.println("Sequence: " + nextSequence);
		System.out.println("Read count: " + r.readCount());
		System.out.println("Next sequence: " + r.getNextSequenceToReadFrom());
		assertThat(r).containsOnly("a");
	});
}

As background: We're using ReliableTopicProxy.(add|remove)MessageListener and I wonder if they cause this massive amounts of ReadManyOperation (and also the high latency when the test above is not bogus...).

EDIT: I see weird things in the heap dump that forced me to investigate this in the first place. I see ReliableMessageRunner instances with the cancelled flag set to true, while their internal runnersMap shows a ReliableMessageRunner (maybe the same one(?)) with cancelled on false. I wonder if that's just caused by different threads not seeing the data modifications or if there is another problem.
image

@pveentjer
Copy link
Contributor

As we're not having debug symbols yet, I used jstat to see if we spent too much time in GCs.

host:~$ jstat -gcutil <pid> 1000 10
  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT    CGC    CGCT     GCT
  0,00 100,00  52,38  88,10  95,60  85,42  14155 1202,111     0    0,000  8700   85,929 1288,040
  0,00 100,00  62,86  88,10  95,60  85,42  14155 1202,111     0    0,000  8700   85,929 1288,040
  0,00 100,00  72,38  88,10  95,60  85,42  14155 1202,111     0    0,000  8700   85,929 1288,040
  0,00 100,00  86,67  88,20  95,60  85,42  14155 1202,111     0    0,000  8700   85,929 1288,040
  0,00 100,00   2,56  88,97  95,60  85,42  14156 1202,171     0    0,000  8700   85,929 1288,100
  0,00 100,00  18,80  89,06  95,60  85,42  14156 1202,171     0    0,000  8700   85,929 1288,100
  0,00 100,00  35,90  90,36  95,60  85,42  14156 1202,171     0    0,000  8700   85,929 1288,100
  0,00 100,00  47,01  90,36  95,60  85,42  14156 1202,171     0    0,000  8700   85,929 1288,100
  0,00 100,00  55,56  90,36  95,60  85,42  14156 1202,171     0    0,000  8700   85,929 1288,100
  0,00 100,00  64,96  90,36  95,60  85,42  14156 1202,171     0    0,000  8700   85,929 1288,100

I can't see too much GC activity here. Yes, eden space is filling up in let's say 10% on average in every second and we have a young GC every ~6-8 seconds, but that doesn't seem too bad. Especially if one can trust the timings at the end there's not much added in the 10 seconds interval (60ms)

Based on the async profiler HTML file, it seems that in 38% of the collected CPU samples, GC is the cause. It would be good to have the debug symbols, but pretty sure this is GC.

@dreis2211
Copy link
Contributor Author

dreis2211 commented Feb 11, 2022

Well, this is just libjvm stuff - not necessarily GC. It could be method compilation etc (also unlikely since jstat is not showing this either)... It is likely GC, but since jstat is not reflecting this, I'm at least still skeptical until we have proof.

@dreis2211
Copy link
Contributor Author

The problem described above is btw similar to #17061 . We end up with invocations that never really vanish under certain circumstances even though the MessageRunner might be cancelled already.

@pveentjer
Copy link
Contributor

pveentjer commented Feb 11, 2022

Method compilation should be be out of the way pretty quickly. So I would be surprised if that would be the cause. I typically see the GC activities as red blocks next to the application; hence I'm pretty sure it is GC related. But.. best to get the the debug symbols so we know for sure.

Let me check the ReliableTopicProxy.(add|remove)MessageListener path.

I had a look. So what happens is that listener is unregistered. Once the result of its pending invocation is returned, no new invocation is submitted.

So if you would create/destroy listeners, and they have a pending invocation to get data from the ringbuffer, you could end up with an accumulation of pending calls.

@pveentjer
Copy link
Contributor

EDIT: I see weird things in the heap dump that forced me to investigate this in the first place. I see ReliableMessageRunner instances with the cancelled flag set to true, while their internal runnersMap shows a ReliableMessageRunner (maybe the same one(?)) with cancelled on false. I wonder if that's just caused by different threads not seeing the data modifications or if there is another problem. image

The runnersMap is not internal to the ReliableMessageRunner, the runnersMap is internal to the ReliableTopicProxy. As soon as the ReliableMessageRunner is cancelled (when the topic listener is removed), it is removed from the runnersMap. Although one could be able to shortly see a cancelled ReliableMessageRunner in the runnersMap since the cancelledFlag is set to true before it is removed from the RunnersMap.

The cancelledFlag is volatile and therefore excluded from any visibility problems.

@dreis2211
Copy link
Contributor Author

dreis2211 commented Feb 11, 2022

@pveentjer Is there maybe a way to cancel these invocations when the message listener is cancelled? Or - as #17061 suggests to have a timeout mechanism there that eventually will expire the invocation on its own?

@pveentjer
Copy link
Contributor

pveentjer commented Feb 11, 2022

Perhaps the following could happen:

You create a message listener on some topic. It causes an invocation on the remote system waiting for data.

You remove the listener.

But the invocation of the listener isn't canceled; since it is just waiting for data that doesn't arrive. And as a consequence, the canceled flag isn't seen. If data would be found, the invocation would complete and the canceled flag would prevent a new invocation to be sent.

And over time you get an accumulation of useless invocation waiting for data that never arrives.

@dreis2211
Copy link
Contributor Author

dreis2211 commented Feb 11, 2022

@pveentjer I'd like to repeat my question. Do you see any way forward for those invocations? E.g. cancelling them when the message listener is removed and/or the runner is cancelled? Or passing timeouts to them, so they eventually get cleaned up?

@pveentjer
Copy link
Contributor

pveentjer commented Feb 11, 2022

AFAIK there currently is no way for them to be canceled. Cancellation of the future (which could be obtained internally) won't solve the problem. we could remove the invocation from the invocation service, but that only solves the 'local' side of the problem and not the remote side. The end result would be the same.

I think the most sensible approach would be to add a new method to the readManyAsync method of the Ringbuffer with an explicit timeout (which defaults to Long.MAX_VALUE). This way the call is going to return to the MessageRunner when there is data or when there is a timeout. And in the latter case, the canceled status could be checked before resubmitting.

This won't cause any wire-level changes.

Ideally, there would be cancellation logic; so that operation can explicitly be canceled on the remote side. But this is going to lead to wire-level changes.

How often do you register and unregister listeners? Because if you have so many, then this must happen very frequently.

@dreis2211
Copy link
Contributor Author

This happens indeed fairly frequently. For every topic there will be at least one register and unregister. And we create many topics during the lifecycle of the app (it's essentially a chat with many channels and users that come and go that drives the amount of topics). Which brings me to the question if there's an easy way to get the amount of topics in a Hazelcast cluster?

In fact there is a pooling mechanism to don't create too many topics where I would likely put our own metrics if there are none from Hazelcast itself.

@Service
@Slf4j
public class HazelcastTopicPool {

	private final HazelcastInstance instance;

	private final IQueue<String> pool;

	private final IMap<String, String> map;

	public HazelcastTopicPool(HazelcastInstance instance) {
		this.instance = instance;
		this.pool = instance.getQueue("topics-pool");
		this.map = instance.getMap("topics");
		this.map.addLocalEntryListener(new CleanupListener(this));
	}

	public <T> ITopic<T> getOrCreateTopic(String topic) {
		// Check if topic already exists - if yes just return.
		return this.<T>getTopic(topic)
			.orElseGet(() -> {

				// Try to receive a free hazelcast topic from the pool - otherwise generate a new.
				String freeTopic = pool.poll();
				if (freeTopic == null) {
                                          // Has been com.hazelcast.util.UuidUtil.newUnsecureUUID() on version 3.x
					freeTopic = UUID.randomUUID().toString();
				}

				// Mark the free topic as used.
				String oldValue = map.putIfAbsent(topic, freeTopic, Duration.ofMinutes(2).toMillis(), TimeUnit.MILLISECONDS);
				if (oldValue != null) {
					pool.add(freeTopic);
					return instance.getReliableTopic(oldValue);
				} else {
					return instance.getReliableTopic(freeTopic);
				}
			});
	}

	public <T> Optional<ITopic<T>> getTopic(String topic) {
		for (int i = 0; i < 100; i++) {
			String existing = map.get(topic);
			if (existing == null) {
				return Optional.empty();
			}
			if (map.replace(topic, existing, existing)) {
				return Optional.of(instance.getReliableTopic(existing));
			}
		}

		throw new IllegalStateException("Cannot retrieve topic!");
	}

	<T> boolean refreshTopic(String topic, ITopic<T> iTopic) {
		String name = iTopic.getName();
		return map.replace(topic, name, name);
	}

	private void cleanupTopic(String topic, String oldTopic) {
		map.remove(topic);
		pool.add(oldTopic);
	}

	private static class CleanupListener implements EntryEvictedListener<String, String>, EntryExpiredListener<String, String> {

		private final HazelcastTopicPool pool;

		private CleanupListener(HazelcastTopicPool pool) {
			this.pool = pool;
		}

		@Override
		public void entryEvicted(EntryEvent<String, String> event) {
			log.info("Evicted topic: {}", event.getKey());
			pool.cleanupTopic(event.getKey(), event.getOldValue());
		}

		@Override
		public void entryExpired(EntryEvent<String, String> event) {
			log.info("Expired topic: {}", event.getKey());
			pool.cleanupTopic(event.getKey(), event.getOldValue());
		}
	}

} 

In fact there was a bug in there that we didn't cleanup the topic (names) anymore and return them to the pool as we only used EntryEvictedListener, which broke the expiry handling with the upgrade from 3 to 4. Btw. this wasn't really mentioned in upgrade notes or release notes prominently enough. But this is fixed already and didn't show any noticeable impact unfortunately.

Long.MAX_VALUE should be way too many years to be a sensible default, don't you think? This would be as good as having no timeout. This new method is the way forward I think, but it should be also configurable what timeout is passed.

@pveentjer
Copy link
Contributor

pveentjer commented Feb 11, 2022

It is a sensible default for blocking calls without a timeout since there is no timeout and Long.MAX_VALUE is a good approximation of infinite.

In case of ITopic that is another question. So I guess a few seconds timout by default should be good enough for most cases without causing excessive load on the system unless there are a huge number of topic listeners. And I agree that having a configurable timeout makes sense.

@Holmistr Holmistr modified the milestones: 5.1, 5.2 Feb 15, 2022
@dreis2211
Copy link
Contributor Author

dreis2211 commented Feb 15, 2022

I checked a couple of things because I was still surprised what the difference between Hazelcast 3 and 4 was. In fact, the stuck invocations can happen both on 3 & 4.

After more digging I found another behaviour change that broke the HazelcastTopicPool posted above in yet another way. Calling IMap.replace will essentially remove the previously set TTL. E.g. consider the following stripped down example:

HazelcastInstance hazelcastInstance = Hazelcast.newHazelcastInstance();
IMap<String, String> topics = hazelcastInstance.getMap("topics");
topics.addLocalEntryListener(new EntryExpiredListener<String, String>() {
	@Override
	public void entryExpired(EntryEvent<String, String> event) {
		System.out.println("Expired: " + event);
	}
});
String key = "key";
topics.putIfAbsent(key, "value", 2000, TimeUnit.MILLISECONDS);
EntryView ev = topics.getEntryView(key);
System.out.println("====== FIRST ENTRY VIEW ======");
System.out.println("Expiry: " + new Date(ev.getExpirationTime()));
System.out.println("Update: " + new Date(ev.getLastUpdateTime()));
System.out.println("TTL: " + ev.getTtl());

String existingValue = topics.get(key);
System.out.println(existingValue);
Thread.sleep(1000);
if (existingValue != null) {
	topics.replace(key, existingValue, existingValue);
	System.out.println("====== SECOND ENTRY VIEW ======");
	ev = topics.getEntryView(key);
	System.out.println("Expiry: " + new Date(ev.getExpirationTime()));
	System.out.println("Update: " + new Date(ev.getLastUpdateTime()));
	System.out.println("TTL: " + ev.getTtl());
}

System.out.println("===== AFTERMATH =====");
Thread.sleep(5000);
System.out.println("Value:" + topics.get(key));

On Hazelcast 4.2.4 this code will output:

====== FIRST ENTRY VIEW ======
Expiry: Tue Feb 15 17:34:25 CET 2022
Update: Thu Jan 01 00:59:59 CET 1970
TTL: 2000
value
====== SECOND ENTRY VIEW ======
Expiry: Sun Aug 17 08:12:55 CET 292278994 // This is wrong
Update: Thu Jan 01 00:59:59 CET 1970
TTL: 9223372036854775807
===== AFTERMATH =====
Value:value

On Hazelcast 3.12.11 however this is shown:

====== FIRST ENTRY VIEW ======
Expiry: Tue Feb 15 17:45:12 CET 2022
Update: Tue Feb 15 17:45:10 CET 2022
TTL: 2000
value
====== SECOND ENTRY VIEW ======
Expiry: Tue Feb 15 17:45:13 CET 2022
Update: Tue Feb 15 17:45:11 CET 2022
TTL: 2000
===== AFTERMATH =====
Expired: EntryEvent{entryEventType=EXPIRED ..., name='topics', key=key, oldValue=value, value=null, mergingValue=null}
Value:null

As you can see on Hazelcast 4.2.4 the time-to-live is basically reset to a value that never really expires. (Not speaking of the last update time being weird.)

My workaround for Hazelcast 4.2.4 would be to add a MapConfig with setTimeToLiveSeconds for the topics map rather than the individual timeout on putIfAbsent. I'd appreciate if you could say that this is safe to use.

The effect of this bug is that the above mentioned HazelcastTopicPool will never receive expiry events when refreshTopic is called at anytime before the initial timeout and thus the pooling of topic names will never work. As the underlying IQueue.poll will always return null, a new random topic UUID is created. Eventually leading to more ReliableTopicProxy instances and more ringbuffers and more invocations etc....

I didn't find anything in the migration guides nor in the release notes that would describe that behaviour change - if it was intended after all. With the EntryEvictedListener change mentioned in an earlier comment not handling expiry events anymore, this makes it already two things that are not reflected in any migration guide and create an incredibly frustrating upgrade experience, if you don't mind me saying. Admittedly, our pooling is somewhat unique, but adding a local map listener or having TTLs isn't.

The workaround of adding a MapConfig is not yet deployed, so I can't confirm yet that this fixes our issues. But I wanted to let you know already.

Cheers,
Christoph

@dreis2211
Copy link
Contributor Author

The fix to configure a MapConfig with setTimeToLiveSeconds for the topics map inside our HazelcastTopicPool rather than the individual timeout on putIfAbsent seems to work.

I've also applied an additional publish of a special empty message after we call ReliableTopicProxy.removeMessageListener to workaround the discussed problem of invocations waiting for incoming data that never arrives.

@dreis2211
Copy link
Contributor Author

I should note that during this investigations two PRs were created:

@AyberkSorgun AyberkSorgun modified the milestones: 5.2, 5.2 Backlog May 10, 2022
@Holmistr
Copy link
Contributor

@dreis2211 Thanks for your continous work on this. Are you still hitting this issue?

@dreis2211
Copy link
Contributor Author

@Holmistr No, after I applied the mentioned workarounds we don't hit this anymore. But especially the special publish is quite hacky. But it works to stop invocations waiting for incoming data that never arrives.

The changes to the map expiry are somewhat surprising and you have to tell me if they are a bug or not.

@AyberkSorgun AyberkSorgun removed this from the 5.2 Backlog milestone Oct 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants