-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JMX connections and decommissioning support #38
Conversation
…mmission support for Cassandra pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
}); | ||
|
||
|
||
if (EnumSet.of(CassandraConnection.Status.OperationMode.NORMAL, CassandraConnection.Status.OperationMode.DECOMMISSIONED).equals(podCassandraOperationModes.keySet())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems if someone tries to scale statefulset down to 0, he can't achieve that because this conditional check only works for a mix set of normal and decommisioned (might be a invalid ask, except for that bug I mentioned before, can't think of any other scenarios need to do so)
Some additional notes: Currently, if the DC is scaled down by more than one node (e.g, 6 -> 3 nodes), then the controller will get "stuck" after it has scaled down the statefulset as it currently isn't notified when the pod has finished being deleted. I'm hoping that a watch on the statefulset objects will be enough for the controller to observe that the pod has been deleted (statefulset.status.currentReplicas will change) and trigger a reconcile to kick-off the next decommission. For now, if you re-apply the DC definition after the node has decomissioned, keeping the same size but modifying a random-named field, it'll trigger a reconcile. (e.g, create fied It also looks like deletion needs to be improved. Turns out the logic of scaling-down a statefulset before deleting it is required (oops) -- i'll add it back in. But we also need to remove PVs and PV claims. The PVs also need to be removed on decommission. Otherwise if a scale down then scale up is attempted, C* will fail to start as the pod gets a PV with a data directory of an already decomissioned node. |
Regarding PV & PVC deletion, should it be desirable that a PV is reattached to a new pod if there is any? A node/pod failure scenario would benefit from this. If I understand correctly, PVs are not able to be attached to a new pod because of some meta data conflict introduced by previous node? If a scale down then scale up on a statefulset is attempted, new pod will get the same ip as deleted pod. So ip might not be a barrier for this. |
.build(new CacheLoader<InetAddress, CassandraConnection>() { | ||
@Override | ||
public CassandraConnection load(@Nonnull final InetAddress key) throws Exception { | ||
final JMXServiceURL serviceURL = new JMXServiceURL(String.format("service:jmx:rmi:///jndi/rmi://%s:7199/jmxrmi", InetAddresses.toAddrString(key))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
port is hard coded, let's make that configurable if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably should be discovered from the Service, if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok let's address when we make jmx configurable either via configMaps or the CRD (that way its set at the service on creation).
LGTM
} | ||
|
||
final JMXConnectionNotification connectionNotification = (JMXConnectionNotification) notification; | ||
switch (connectionNotification.getType()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cassandra will lose notifications (esp decom and repair) like no tomorrow, we will want to log this, though probably not relevant for the cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but this listener is purely for observing the JMX connection closing/failing so that it can be expunged from the cache.
The current design doesn't rely on recieving decommission notifications. The CassandraHealthCheckService
periodically polls the operation mode (NORMAL
, JOINING
,DECOMM..
, etc) of every node and will broadcast an event if this mode changes. This broadcast allows the ControllerService
to reconcile when the node switches to DECOMMISSIONED
.
We could listen for decom notifications as a way to reduce polling (current interval is 1min) and be more "reactive" as the reconcile could occur as soon as the node switches to decommissioned, falling back to polling if we miss notifications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok that makes sense, I think polling only is fine and less likely to get in a weird state / less complicated that listening for notifications then falling back.
LGTM.
// scale down | ||
|
||
// decommission the newest pod | ||
cassandraConnectionFactory.connectionForPod(pods.get(0)).decommission(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc decom will fail when you call it on the last node left. Cassandra will throw an exception
Thinking it through a little further, it doesn't make sense to decom when we are trying to get to zero to just delete the cluster... as the last few nodes will have a ton of data (it will also be slow).
We should probably have a special path for going from N -> 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mentioned we don't handle the delete case as the moment anyway (it leaves the PVs around)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we care about scale -> 0? Currently the CassandraDataCenter
CRD has a minvalue for the replicas
field of 1, meaning there always has to be one node.
True deletion of a DC is handled separatly -- when the DC object is deleted (e.g. kubectl delete cdc test-datacenter-1
). In this case we nuke everything -- services, default config maps, generated secrets, etc.
In contrast, StatefulSets do allow a replicas
value of 0, which results in 0 pods being created.
Handling scale -> 0 wouldn't be difficult, I just wonder if it's necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If full deletion of the CDC resource is handled seperately, then this is less of an issue. With the min value of 1 in replicas that also works for me.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments, main concern around scale down to 0 workflow.
// scale down | ||
|
||
// decommission the newest pod | ||
cassandraConnectionFactory.connectionForPod(pods.get(0)).decommission(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mentioned we don't handle the delete case as the moment anyway (it leaves the PVs around)?
@liwang-pivotal Pod failure is handled by the StatefulSet stuff, which will correctly persist the PV between Pod replacements. The current issue only occurs when we purposfully delete a Pod because the number of replicas has changed. For example, if you have 6 replicas/pods/nodes, and one fails, the replacement should, if possible, get the same PV so that Cassandra doesn't need to restream data. In the case of scaling 6 -> 3, we no longer care about the 3 pods and their PVs, and hence they should be deleted. What happens currently is a 6 -> 3 -> 6 scale operation results in 3 new pods being created, but they get the old PVs attached and C* complains that they belong to decomissioned nodes -- rightfully so! Instead these 3 new pods should get new PVs. |
I'm happy with the change set. See #40 to track PV clean up and the operator getting stuck due to not having a statefulset watch |
* Adds support for cached JXM connections to Cassandra nodes. Adds decommission support for Cassandra pods. * Disables `WatchService` (pending watch refactor) * Completely disabled `WatchService` and other bits (pending watch refactor)
Changes of note:
-XX:+UseCGroupMemoryLimitForHeap
JVM flag to automatically set the operator JVM heap size to the container limits.CassandraConnectionFactory
for getting JMX connections to Cassandra, wrapped in theCassandraConnection
class. The factory also caches JMX connections for 10 minutes (currently hardcoded, could be a config?) and watches for them to be closed, etc.CassandraConnection
for issuing JMX operations and queries to C*, behind a facade for easy use + future ability to handle version-specific differences between C* JMX interfaces.CassandraHealthCheckService
now tracks the operation mode of C* nodes and fires events to the bus when these change. i.e., we now get an event when a node goesLEAVING
->DECOMMISSIONED
ControllerService
has been updated to now call decommission via JMX before scaling the statefulset (which is done as a reconcile once the node status changes toDECOMMISSIONED
. Also some refactoring.