ISPN-14782 Unable to reenable rebalance after cluster scale up #10834

jabolina · 2023-04-18T18:09:52Z

https://issues.redhat.com/browse/ISPN-14782
https://issues.redhat.com/browse/ISPN-14793

We do not broadcast the topology as stable if it wasn't yet restored.

infinispanrelease · 2023-04-18T18:27:24Z

Image pushed for Jenkins build #1:

quay.io/infinispan-test/server:PR-10834

jabolina · 2023-04-19T00:55:24Z

Replayed CI.

infinispanrelease · 2023-04-19T01:09:08Z

Image pushed for Jenkins build #2:

quay.io/infinispan-test/server:PR-10834

infinispanrelease · 2023-04-19T08:44:55Z

Image pushed for Jenkins build #3:

quay.io/infinispan-test/server:PR-10834

core/src/main/java/org/infinispan/topology/ClusterCacheStatus.java

infinispanrelease · 2023-04-20T17:59:20Z

Image pushed for Jenkins build #4:

quay.io/infinispan-test/server:PR-10834

jabolina · 2023-04-20T18:04:11Z

Added a commit for https://issues.redhat.com/browse/ISPN-14793

This changes how the nodes handle the CacheStatusRequest a bit. This request is sent by the coordinator after a view MERGE. If the node has a cache that needs to be recreated from the persistent state, it will send a join request to the coordinator. The coordinator only handles this request after receiving all the states from the nodes, and, in case we are partitioned again during this, the current join mechanism should kick in.

ryanemerson

LGTM just two minor points.

I've backported to speed things up while you're on PTO: #10849

I removed the System.out comment, but I haven't changed the AggregateCompletionStage usage.

ryanemerson · 2023-04-21T07:06:57Z

core/src/main/java/org/infinispan/topology/LocalTopologyManagerImpl.java

@@ -295,6 +309,7 @@ public CompletionStage<ManagerStatusResponse> handleStatusRequest(int viewId) {
      // As long as we have an older view, we can still process topologies from the old coordinator
      return withView(viewId, getGlobalTimeout(), MILLISECONDS).thenApply(ignored -> {
         Map<String, CacheStatusResponse> caches = new HashMap<>();
+         AggregateCompletionStage<Void> joins = CompletionStages.aggregateCompletionStage();


Why do we need an Agreggate here? AFAICT we only depend on a single CompletionStage

Sorry, this was leftover from the first solution. Updating to remove it.

ryanemerson · 2023-04-21T07:07:26Z

core/src/main/java/org/infinispan/topology/LocalTopologyManagerImpl.java

@@ -535,6 +564,7 @@ private CompletionStage<Void> doHandleStableTopologyUpdate(String cacheName, Cac
         CacheTopology stableTopology = cacheStatus.getStableTopology();
         if (stableTopology == null || stableTopology.getTopologyId() < newStableTopology.getTopologyId()) {
            log.tracef("Updating stable topology for cache %s: %s", cacheName, newStableTopology);
+            //System.out.printf("[%s] Updating stable topology for cache %s: %s\n", transport.getAddress(), cacheName, newStableTopology);


Can be removed

* After receiving a CacheStatusRequest from the coordinator, the nodes will send a join request for the caches which need to be recovered from the persistent state.

jabolina · 2023-04-25T00:22:48Z

Updated with suggestions.

infinispanrelease · 2023-04-25T00:37:49Z

Image pushed for Jenkins build #5:

quay.io/infinispan-test/server:PR-10834

jabolina added the Image Required Set this label in order for a server image to be built with the PR changes and pushed to quay.io label Apr 18, 2023

ryanemerson reviewed Apr 19, 2023

View reviewed changes

core/src/main/java/org/infinispan/topology/ClusterCacheStatus.java Show resolved Hide resolved

jabolina force-pushed the ISPN-14782 branch from 0408cdd to 697379f Compare April 20, 2023 17:39

ryanemerson approved these changes Apr 21, 2023

View reviewed changes

jabolina added 2 commits April 24, 2023 21:08

ISPN-14782 Unable to reenable rebalance after cluster scale up

a394c6b

ISPN-14793 Join recovering caches after view merge

204c3fb

* After receiving a CacheStatusRequest from the coordinator, the nodes will send a join request for the caches which need to be recovered from the persistent state.

jabolina force-pushed the ISPN-14782 branch from 697379f to 204c3fb Compare April 25, 2023 00:22

ryanemerson merged commit 9f35812 into infinispan:main Apr 25, 2023
2 of 4 checks passed

jabolina deleted the ISPN-14782 branch April 25, 2023 11:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ISPN-14782 Unable to reenable rebalance after cluster scale up #10834

ISPN-14782 Unable to reenable rebalance after cluster scale up #10834

jabolina commented Apr 18, 2023 •

edited

infinispanrelease commented Apr 18, 2023

jabolina commented Apr 19, 2023

infinispanrelease commented Apr 19, 2023

infinispanrelease commented Apr 19, 2023

infinispanrelease commented Apr 20, 2023

jabolina commented Apr 20, 2023

ryanemerson left a comment •

edited

ryanemerson Apr 21, 2023

jabolina Apr 24, 2023

ryanemerson Apr 21, 2023

jabolina commented Apr 25, 2023

infinispanrelease commented Apr 25, 2023

ISPN-14782 Unable to reenable rebalance after cluster scale up #10834

ISPN-14782 Unable to reenable rebalance after cluster scale up #10834

Conversation

jabolina commented Apr 18, 2023 • edited

infinispanrelease commented Apr 18, 2023

jabolina commented Apr 19, 2023

infinispanrelease commented Apr 19, 2023

infinispanrelease commented Apr 19, 2023

infinispanrelease commented Apr 20, 2023

jabolina commented Apr 20, 2023

ryanemerson left a comment • edited

Choose a reason for hiding this comment

ryanemerson Apr 21, 2023

Choose a reason for hiding this comment

jabolina Apr 24, 2023

Choose a reason for hiding this comment

ryanemerson Apr 21, 2023

Choose a reason for hiding this comment

jabolina commented Apr 25, 2023

infinispanrelease commented Apr 25, 2023

jabolina commented Apr 18, 2023 •

edited

ryanemerson left a comment •

edited