-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operator upgrade should not cause a rolling restart #566
Comments
Tentatively supported. I'll look forward to watching your discussion, but it sounds like you've thought this through. I do think perhaps we should log errors or something if the user has done 1-2 consecutive upgrades without a rolling restart, but I concur that they should always be in control of the timing of the restart itself. |
@burmanm , since this ticket is sized L, could you split it up in multiple subtasks? We'll make this one an epic. |
These two sound closely related. To keep things simple, maybe we could use a single annotation with different possible values: |
We could, the "once" would be removed after applying it, but always would stay in that case. |
From other operators side: If k8ssandra-operator has upgrades and cass-operator has upgrades, we might need only k8ssandra-operator to apply new CassandraDatacenter. But also, we might have a case, where cass-operator only has updates, but k8ssandra-operator doesn't. So our K8ssandraTask needs more logic than the CassandraTask. |
What is missing?
When updating the operator, there are two cases where we will actually cause a rolling restart of all Cassandra clusters managed by the same operator.
Now, while both are good updates usually, they shouldn't happen if user makes no changes. We would hope that user upgrades to newest builds of server-system-logger for example to get bugfixes / CVE free base builds. Or if we upgrade Cassandra to use better settings we usually do that for stability purposes. However, neither should happen without the knowledge of the user and shouldn't be an uncontrolled side-effect of updating the operator.
Instead, we should note in the release notes if we think user should do an rolling restart to get newer images or other improvements.
I would propose we modify the processing slightly in cass-operator. If the Generation of CassandraDatacenter is equal to the ObservedGeneration, we would stop doing updates to any spec we deploy, but keep on doing the other parts of the processing (such as restarting nodes). Thus, this event processing can't happen in the Reconcile function's beginning, but instead at the points where something would be updated. All decommissions, status updates to CassandraDatacenter/status and other maintenance should continue to work as previously.
A rolling restart should cause the upgrade to the StS specs and others we detect. While this is a side-effect, there's an additional benefit to this and that's self healing. At the moment, modifying StS is a possibility which we will not revert back, but that could let users make the cluster unstable and operator would do nothing to fix that situation. Now, current rolling restart is done by modifying the PodTemplateSpec StS annotation that says restart time (like kubectl's rollout restart does), so we need to catch even this small change. What we have to do is prevent rolling restart twice, first for annotation change and then by cass-operator change. Not sure how to do that in every case, but that should be the goal.
We will need a test for all of this using the "upgrade_operator" test (and make it work with newer Cassandra version also) to detect cases where operator upgrade would do a rolling restart and fail the build in those cases.
Possible issues:
Why is this needed?
Updating the operator should be safe for the cluster - always. We should not make people run old versions of cass-operator because they're afraid of updating or do not have a possibility of restarting all clusters. Especially multiple clusters managed by the same cass-operator could be problematic as that would cause heavy load on the underlying Kubernetes cluster.
Tasks
The text was updated successfully, but these errors were encountered: