Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JMX connections and decommissioning support #38

Merged
merged 3 commits into from
May 23, 2018

Conversation

zegelin
Copy link
Contributor

@zegelin zegelin commented May 22, 2018

Changes of note:

  • Operator container now uses the -XX:+UseCGroupMemoryLimitForHeap JVM flag to automatically set the operator JVM heap size to the container limits.
  • CassandraConnectionFactory for getting JMX connections to Cassandra, wrapped in the CassandraConnection class. The factory also caches JMX connections for 10 minutes (currently hardcoded, could be a config?) and watches for them to be closed, etc.
  • CassandraConnection for issuing JMX operations and queries to C*, behind a facade for easy use + future ability to handle version-specific differences between C* JMX interfaces.
  • CassandraHealthCheckService now tracks the operation mode of C* nodes and fires events to the bus when these change. i.e., we now get an event when a node goes LEAVING -> DECOMMISSIONED
  • The ControllerService has been updated to now call decommission via JMX before scaling the statefulset (which is done as a reconcile once the node status changes to DECOMMISSIONED. Also some refactoring.

Copy link

@liwang0513 liwang0513 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

});


if (EnumSet.of(CassandraConnection.Status.OperationMode.NORMAL, CassandraConnection.Status.OperationMode.DECOMMISSIONED).equals(podCassandraOperationModes.keySet())) {
Copy link

@liwang0513 liwang0513 May 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems if someone tries to scale statefulset down to 0, he can't achieve that because this conditional check only works for a mix set of normal and decommisioned (might be a invalid ask, except for that bug I mentioned before, can't think of any other scenarios need to do so)

@zegelin
Copy link
Contributor Author

zegelin commented May 23, 2018

Some additional notes:

Currently, if the DC is scaled down by more than one node (e.g, 6 -> 3 nodes), then the controller will get "stuck" after it has scaled down the statefulset as it currently isn't notified when the pod has finished being deleted. I'm hoping that a watch on the statefulset objects will be enough for the controller to observe that the pod has been deleted (statefulset.status.currentReplicas will change) and trigger a reconcile to kick-off the next decommission.

For now, if you re-apply the DC definition after the node has decomissioned, keeping the same size but modifying a random-named field, it'll trigger a reconcile. (e.g, create fied spec.xyz with value 1 then change it to 2 and re-apply)

It also looks like deletion needs to be improved. Turns out the logic of scaling-down a statefulset before deleting it is required (oops) -- i'll add it back in. But we also need to remove PVs and PV claims. The PVs also need to be removed on decommission. Otherwise if a scale down then scale up is attempted, C* will fail to start as the pod gets a PV with a data directory of an already decomissioned node.

@liwang0513
Copy link

liwang0513 commented May 23, 2018

Regarding PV & PVC deletion, should it be desirable that a PV is reattached to a new pod if there is any? A node/pod failure scenario would benefit from this. If I understand correctly, PVs are not able to be attached to a new pod because of some meta data conflict introduced by previous node? If a scale down then scale up on a statefulset is attempted, new pod will get the same ip as deleted pod. So ip might not be a barrier for this.

.build(new CacheLoader<InetAddress, CassandraConnection>() {
@Override
public CassandraConnection load(@Nonnull final InetAddress key) throws Exception {
final JMXServiceURL serviceURL = new JMXServiceURL(String.format("service:jmx:rmi:///jndi/rmi://%s:7199/jmxrmi", InetAddresses.toAddrString(key)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

port is hard coded, let's make that configurable if possible

Copy link
Contributor Author

@zegelin zegelin May 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably should be discovered from the Service, if possible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok let's address when we make jmx configurable either via configMaps or the CRD (that way its set at the service on creation).

LGTM

}

final JMXConnectionNotification connectionNotification = (JMXConnectionNotification) notification;
switch (connectionNotification.getType()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cassandra will lose notifications (esp decom and repair) like no tomorrow, we will want to log this, though probably not relevant for the cache

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but this listener is purely for observing the JMX connection closing/failing so that it can be expunged from the cache.

The current design doesn't rely on recieving decommission notifications. The CassandraHealthCheckService periodically polls the operation mode (NORMAL, JOINING,DECOMM.., etc) of every node and will broadcast an event if this mode changes. This broadcast allows the ControllerService to reconcile when the node switches to DECOMMISSIONED.

We could listen for decom notifications as a way to reduce polling (current interval is 1min) and be more "reactive" as the reconcile could occur as soon as the node switches to decommissioned, falling back to polling if we miss notifications.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that makes sense, I think polling only is fine and less likely to get in a weird state / less complicated that listening for notifications then falling back.

LGTM.

// scale down

// decommission the newest pod
cassandraConnectionFactory.connectionForPod(pods.get(0)).decommission();
Copy link
Collaborator

@benbromhead benbromhead May 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc decom will fail when you call it on the last node left. Cassandra will throw an exception

Thinking it through a little further, it doesn't make sense to decom when we are trying to get to zero to just delete the cluster... as the last few nodes will have a ton of data (it will also be slow).

We should probably have a special path for going from N -> 0

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mentioned we don't handle the delete case as the moment anyway (it leaves the PVs around)?

Copy link
Contributor Author

@zegelin zegelin May 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care about scale -> 0? Currently the CassandraDataCenter CRD has a minvalue for the replicas field of 1, meaning there always has to be one node.

True deletion of a DC is handled separatly -- when the DC object is deleted (e.g. kubectl delete cdc test-datacenter-1). In this case we nuke everything -- services, default config maps, generated secrets, etc.

In contrast, StatefulSets do allow a replicas value of 0, which results in 0 pods being created.

Handling scale -> 0 wouldn't be difficult, I just wonder if it's necessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If full deletion of the CDC resource is handled seperately, then this is less of an issue. With the min value of 1 in replicas that also works for me.

LGTM.

Copy link
Collaborator

@benbromhead benbromhead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, main concern around scale down to 0 workflow.

// scale down

// decommission the newest pod
cassandraConnectionFactory.connectionForPod(pods.get(0)).decommission();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mentioned we don't handle the delete case as the moment anyway (it leaves the PVs around)?

@zegelin
Copy link
Contributor Author

zegelin commented May 23, 2018

@liwang-pivotal Pod failure is handled by the StatefulSet stuff, which will correctly persist the PV between Pod replacements.

The current issue only occurs when we purposfully delete a Pod because the number of replicas has changed.

For example, if you have 6 replicas/pods/nodes, and one fails, the replacement should, if possible, get the same PV so that Cassandra doesn't need to restream data.

In the case of scaling 6 -> 3, we no longer care about the 3 pods and their PVs, and hence they should be deleted.

What happens currently is a 6 -> 3 -> 6 scale operation results in 3 new pods being created, but they get the old PVs attached and C* complains that they belong to decomissioned nodes -- rightfully so! Instead these 3 new pods should get new PVs.

@benbromhead
Copy link
Collaborator

I'm happy with the change set. See #40 to track PV clean up and the operator getting stuck due to not having a statefulset watch

@benbromhead benbromhead merged commit 93b4778 into instaclustr:master May 23, 2018
@benbromhead benbromhead mentioned this pull request May 23, 2018
liwang0513 pushed a commit to liwang0513/cassandra-operator that referenced this pull request May 24, 2018
* Adds support for cached JXM connections to Cassandra nodes. Adds decommission support for Cassandra pods.

* Disables `WatchService` (pending watch refactor)

* Completely disabled `WatchService` and other bits (pending watch refactor)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants