New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix upgrade from OSC 1.4.1 #366
Conversation
The monitor pods require a custom SELinux policy to be installed. The monitor daemon set (DS) should thus be created after that, in order to avoid polluting the logs with transient errors. The current code base does this in a bit convoluted and slightly bogus way : 1) it is assumed that the SELinux policy results from the creation of the runtime class 2) the DS creation code is located in a place where only an extra check allows to know about the runtime class existence [1] is wrong since the SELinux policy resuls from the creation of the security context constraints (SCC). As already suggested in a comment, it seems that the creation of the DS should happen just after the SCC is created to address [2]. This removes a user of KataConfigStatus::RuntimeClass that will be renamed in the subsequent patch. Signed-off-by: Greg Kurz <groug@kaod.org>
The list of nodes where kata is deployed is part of the status of the KataConfig. It is expected that the controller ensures this list is up-to-date during all reconcile runs. Changes to this list usually happen during installation/deinstallation of kata on some nodes. But it is also possible that reconcile is called and the status is lacking the list of nodes for some reason, e.g. KataConfig was edited externally or KataConfig predates PR openshift#329. Stop restricting the regeneration of this list to only happen when kata is being deployed or removed. This might result in some more trafic between the client and server but idempotency should prevail. Signed-off-by: Greg Kurz <groug@kaod.org>
PR openshift#344 changed the type of KataConfigStatus::RuntimeClass from string to array of string. This change in the CRD isn't supported and prevents upgrades from an older operator if a KataConfig is present. The new CSV will stay in the `Pending` state forever and the following error is reported by the install plan : message: 'error validating existing CRs against new CRD''s schema for "kataconfigs.kataconfiguration.openshift.io": error validating custom resource against new schema for KataConfig /example-kataconfig: [].status.runtimeClass: Invalid value: "string": status.runtimeClass in body must be of type array: "string"' Simply rename the field to avoid the issue. Signed-off-by: Greg Kurz <groug@kaod.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Tested successfully on OCP 4.14
Thanks @gkurz !
if r.getInProgressConditionValue() != corev1.ConditionTrue { | ||
return nil | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the commit message, why does just editing KataConfig externally trigger this problem? I'd say predating PR #329 should be the only scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to remove the node list with something like :
oc patch --type=merge --subresource=status --patch='{"status":{"kataNodes":null}}' kataconfig/my-kataconfig
Of course, people shouldn't do that but it doesn't mean we shouldn't be able to recover 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is, to my understanding predating PR #329 is the only actual condition. If that's fulfilled then any store will cause problems, right? A store can happen in a number of ways and the user editing the CR is in no way special among them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. Install 1.5.0 on a pristine cluster, deploy kata and do the oc patch
above, you'll see in the controller logs that reconcile is called but the node list isn't rebuilt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is un-rebuilding nodes a blocker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is un-rebuilding nodes a blocker?
Not really as it doesn't prevent the operator to be functional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I suspect there would be quite a lot of things that user could do to sabotage the controller which wouldn't recover. ;-) But that's a fact indepedent of the idea of this PR - in fact, this has always been true and continues to be true even after this PR I believe.
My idea was not to mix independent facts in the message and not put them in the same context as if they were related since that could confuse a future reader. I'm not insisting though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah this is merged already... never mind.
premerge testing: see KATA-2593 |
/override ci/prow/check |
@gkurz: Overrode contexts on behalf of gkurz: ci/prow/check In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@gkurz: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
- Description of the problem which is fixed/What is the use case
This fixes KATA-2593 (Kata Operator upgrade failed 1.4.1 to 1.5.0) .
- What I did
Renamed the offending field in KataConfigStatus and improved idempotency of the controller.
- How to verify it
The following scenarios should be checked :
I believe this requires to first uninstall OSC completely, aka
oc delete ns openshift-sandboxed-containers-operator
[*] quay.io/openshift_sandboxed_containers/openshift-sandboxed-containers-operator-catalog:1.4.1-10
[**] quay.io/rhgkurz/openshift-sandboxed-containers-operator-catalog:v1.5.1
- Description for the changelog