JMX connections and decommissioning support #38

zegelin · 2018-05-22T22:00:33Z

Changes of note:

Operator container now uses the -XX:+UseCGroupMemoryLimitForHeap JVM flag to automatically set the operator JVM heap size to the container limits.
CassandraConnectionFactory for getting JMX connections to Cassandra, wrapped in the CassandraConnection class. The factory also caches JMX connections for 10 minutes (currently hardcoded, could be a config?) and watches for them to be closed, etc.
CassandraConnection for issuing JMX operations and queries to C*, behind a facade for easy use + future ability to handle version-specific differences between C* JMX interfaces.
CassandraHealthCheckService now tracks the operation mode of C* nodes and fires events to the bus when these change. i.e., we now get an event when a node goes LEAVING -> DECOMMISSIONED
The ControllerService has been updated to now call decommission via JMX before scaling the statefulset (which is done as a reconcile once the node status changes to DECOMMISSIONED. Also some refactoring.

…mmission support for Cassandra pods.

…ctor)

liwang0513

LGTM.

liwang0513 · 2018-05-22T23:20:13Z

operator/src/main/java/com/instaclustr/cassandra/operator/service/ControllerService.java

+        });
+
+
+        if (EnumSet.of(CassandraConnection.Status.OperationMode.NORMAL, CassandraConnection.Status.OperationMode.DECOMMISSIONED).equals(podCassandraOperationModes.keySet())) {


Seems if someone tries to scale statefulset down to 0, he can't achieve that because this conditional check only works for a mix set of normal and decommisioned (might be a invalid ask, except for that bug I mentioned before, can't think of any other scenarios need to do so)

zegelin · 2018-05-23T00:02:10Z

Some additional notes:

Currently, if the DC is scaled down by more than one node (e.g, 6 -> 3 nodes), then the controller will get "stuck" after it has scaled down the statefulset as it currently isn't notified when the pod has finished being deleted. I'm hoping that a watch on the statefulset objects will be enough for the controller to observe that the pod has been deleted (statefulset.status.currentReplicas will change) and trigger a reconcile to kick-off the next decommission.

For now, if you re-apply the DC definition after the node has decomissioned, keeping the same size but modifying a random-named field, it'll trigger a reconcile. (e.g, create fied spec.xyz with value 1 then change it to 2 and re-apply)

It also looks like deletion needs to be improved. Turns out the logic of scaling-down a statefulset before deleting it is required (oops) -- i'll add it back in. But we also need to remove PVs and PV claims. The PVs also need to be removed on decommission. Otherwise if a scale down then scale up is attempted, C* will fail to start as the pod gets a PV with a data directory of an already decomissioned node.

liwang0513 · 2018-05-23T01:39:33Z

Regarding PV & PVC deletion, should it be desirable that a PV is reattached to a new pod if there is any? A node/pod failure scenario would benefit from this. If I understand correctly, PVs are not able to be attached to a new pod because of some meta data conflict introduced by previous node? If a scale down then scale up on a statefulset is attempted, new pod will get the same ip as deleted pod. So ip might not be a barrier for this.

benbromhead · 2018-05-23T01:43:08Z

operator/src/main/java/com/instaclustr/cassandra/operator/jmx/CassandraConnectionFactory.java

+            .build(new CacheLoader<InetAddress, CassandraConnection>() {
+                @Override
+                public CassandraConnection load(@Nonnull final InetAddress key) throws Exception {
+                    final JMXServiceURL serviceURL = new JMXServiceURL(String.format("service:jmx:rmi:///jndi/rmi://%s:7199/jmxrmi", InetAddresses.toAddrString(key)));


port is hard coded, let's make that configurable if possible

It probably should be discovered from the Service, if possible.

ok let's address when we make jmx configurable either via configMaps or the CRD (that way its set at the service on creation).

LGTM

benbromhead · 2018-05-23T01:45:11Z

operator/src/main/java/com/instaclustr/cassandra/operator/jmx/CassandraConnectionFactory.java

+                        }
+
+                        final JMXConnectionNotification connectionNotification = (JMXConnectionNotification) notification;
+                        switch (connectionNotification.getType()) {


Cassandra will lose notifications (esp decom and repair) like no tomorrow, we will want to log this, though probably not relevant for the cache

True, but this listener is purely for observing the JMX connection closing/failing so that it can be expunged from the cache.

The current design doesn't rely on recieving decommission notifications. The CassandraHealthCheckService periodically polls the operation mode (NORMAL, JOINING,DECOMM.., etc) of every node and will broadcast an event if this mode changes. This broadcast allows the ControllerService to reconcile when the node switches to DECOMMISSIONED.

We could listen for decom notifications as a way to reduce polling (current interval is 1min) and be more "reactive" as the reconcile could occur as soon as the node switches to decommissioned, falling back to polling if we miss notifications.

Ok that makes sense, I think polling only is fine and less likely to get in a weird state / less complicated that listening for notifications then falling back.

LGTM.

benbromhead · 2018-05-23T01:55:34Z

operator/src/main/java/com/instaclustr/cassandra/operator/service/ControllerService.java

+            // scale down
+
+            // decommission the newest pod
+            cassandraConnectionFactory.connectionForPod(pods.get(0)).decommission();


iirc decom will fail when you call it on the last node left. Cassandra will throw an exception

Thinking it through a little further, it doesn't make sense to decom when we are trying to get to zero to just delete the cluster... as the last few nodes will have a ton of data (it will also be slow).

We should probably have a special path for going from N -> 0

I think you mentioned we don't handle the delete case as the moment anyway (it leaves the PVs around)?

Do we care about scale -> 0? Currently the CassandraDataCenter CRD has a minvalue for the replicas field of 1, meaning there always has to be one node.

True deletion of a DC is handled separatly -- when the DC object is deleted (e.g. kubectl delete cdc test-datacenter-1). In this case we nuke everything -- services, default config maps, generated secrets, etc.

In contrast, StatefulSets do allow a replicas value of 0, which results in 0 pods being created.

Handling scale -> 0 wouldn't be difficult, I just wonder if it's necessary.

If full deletion of the CDC resource is handled seperately, then this is less of an issue. With the min value of 1 in replicas that also works for me.

LGTM.

benbromhead

A few comments, main concern around scale down to 0 workflow.

benbromhead · 2018-05-23T01:57:39Z

operator/src/main/java/com/instaclustr/cassandra/operator/service/ControllerService.java

+            // scale down
+
+            // decommission the newest pod
+            cassandraConnectionFactory.connectionForPod(pods.get(0)).decommission();


I think you mentioned we don't handle the delete case as the moment anyway (it leaves the PVs around)?

zegelin · 2018-05-23T03:21:08Z

@liwang-pivotal Pod failure is handled by the StatefulSet stuff, which will correctly persist the PV between Pod replacements.

The current issue only occurs when we purposfully delete a Pod because the number of replicas has changed.

For example, if you have 6 replicas/pods/nodes, and one fails, the replacement should, if possible, get the same PV so that Cassandra doesn't need to restream data.

In the case of scaling 6 -> 3, we no longer care about the 3 pods and their PVs, and hence they should be deleted.

What happens currently is a 6 -> 3 -> 6 scale operation results in 3 new pods being created, but they get the old PVs attached and C* complains that they belong to decomissioned nodes -- rightfully so! Instead these 3 new pods should get new PVs.

benbromhead · 2018-05-23T14:42:15Z

I'm happy with the change set. See #40 to track PV clean up and the operator getting stuck due to not having a statefulset watch

* Adds support for cached JXM connections to Cassandra nodes. Adds decommission support for Cassandra pods. * Disables `WatchService` (pending watch refactor) * Completely disabled `WatchService` and other bits (pending watch refactor)

Adds support for cached JXM connections to Cassandra nodes. Adds deco…

9deab17

…mmission support for Cassandra pods.

zegelin requested review from benbromhead and liwang0513 May 22, 2018 22:00

zegelin added 2 commits May 22, 2018 15:03

Disables WatchService (pending watch refactor)

41d62c8

Completely disabled WatchService and other bits (pending watch refa…

f31d017

…ctor)

liwang0513 approved these changes May 22, 2018

View reviewed changes

benbromhead reviewed May 23, 2018

View reviewed changes

benbromhead approved these changes May 23, 2018

View reviewed changes

benbromhead merged commit 93b4778 into instaclustr:master May 23, 2018

benbromhead mentioned this pull request May 23, 2018

Cluster Scaling Down #16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JMX connections and decommissioning support #38

JMX connections and decommissioning support #38

zegelin commented May 22, 2018

liwang0513 left a comment

liwang0513 May 22, 2018 •

edited

Loading

zegelin commented May 23, 2018

liwang0513 commented May 23, 2018 •

edited

Loading

benbromhead May 23, 2018

zegelin May 23, 2018 •

edited

Loading

benbromhead May 23, 2018

benbromhead May 23, 2018

zegelin May 23, 2018

benbromhead May 23, 2018

benbromhead May 23, 2018 •

edited

Loading

benbromhead May 23, 2018

zegelin May 23, 2018 •

edited

Loading

benbromhead May 23, 2018

benbromhead left a comment

benbromhead May 23, 2018

zegelin commented May 23, 2018

benbromhead commented May 23, 2018

		});


		if (EnumSet.of(CassandraConnection.Status.OperationMode.NORMAL, CassandraConnection.Status.OperationMode.DECOMMISSIONED).equals(podCassandraOperationModes.keySet())) {

JMX connections and decommissioning support #38

JMX connections and decommissioning support #38

Conversation

zegelin commented May 22, 2018

liwang0513 left a comment

Choose a reason for hiding this comment

liwang0513 May 22, 2018 • edited Loading

Choose a reason for hiding this comment

zegelin commented May 23, 2018

liwang0513 commented May 23, 2018 • edited Loading

Choose a reason for hiding this comment

zegelin May 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benbromhead May 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zegelin May 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benbromhead left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zegelin commented May 23, 2018

benbromhead commented May 23, 2018

liwang0513 May 22, 2018 •

edited

Loading

liwang0513 commented May 23, 2018 •

edited

Loading

zegelin May 23, 2018 •

edited

Loading

benbromhead May 23, 2018 •

edited

Loading

zegelin May 23, 2018 •

edited

Loading