Fix upgrade from OSC 1.4.1 #366

gkurz · 2023-12-07T17:36:29Z

- Description of the problem which is fixed/What is the use case

This fixes KATA-2593 (Kata Operator upgrade failed 1.4.1 to 1.5.0) .

- What I did

Renamed the offending field in KataConfigStatus and improved idempotency of the controller.

- How to verify it

Install OSC 1.4.1 with automatic upgrade enabled [use catalog image at *]
Create a KataConfig
Setup a catalog source with this PR [use catalog image at **]
Patch the subscription of the operator to select the new source

oc patch subscription -n openshift-sandboxed-containers-operator sandboxed-containers-operator -p '{"spec":{"source":"my-operator-catalog", "startingCSV":"sandboxed-containers-operator.v1.5.1", "channel":"candidate"}}' --type merge

Upgrade should start and complete successfully

The following scenarios should be checked :

User has 1.4.1 installed (described above)
User has 1.4.1/1.5.0 in pending state (KATA-2593).
I believe this requires to first uninstall OSC completely, aka oc delete ns openshift-sandboxed-containers-operator
User has 1.5.0 installed

[*] quay.io/openshift_sandboxed_containers/openshift-sandboxed-containers-operator-catalog:1.4.1-10
[**] quay.io/rhgkurz/openshift-sandboxed-containers-operator-catalog:v1.5.1

- Description for the changelog

The monitor pods require a custom SELinux policy to be installed. The monitor daemon set (DS) should thus be created after that, in order to avoid polluting the logs with transient errors. The current code base does this in a bit convoluted and slightly bogus way : 1) it is assumed that the SELinux policy results from the creation of the runtime class 2) the DS creation code is located in a place where only an extra check allows to know about the runtime class existence [1] is wrong since the SELinux policy resuls from the creation of the security context constraints (SCC). As already suggested in a comment, it seems that the creation of the DS should happen just after the SCC is created to address [2]. This removes a user of KataConfigStatus::RuntimeClass that will be renamed in the subsequent patch. Signed-off-by: Greg Kurz <groug@kaod.org>

The list of nodes where kata is deployed is part of the status of the KataConfig. It is expected that the controller ensures this list is up-to-date during all reconcile runs. Changes to this list usually happen during installation/deinstallation of kata on some nodes. But it is also possible that reconcile is called and the status is lacking the list of nodes for some reason, e.g. KataConfig was edited externally or KataConfig predates PR openshift#329. Stop restricting the regeneration of this list to only happen when kata is being deployed or removed. This might result in some more trafic between the client and server but idempotency should prevail. Signed-off-by: Greg Kurz <groug@kaod.org>

PR openshift#344 changed the type of KataConfigStatus::RuntimeClass from string to array of string. This change in the CRD isn't supported and prevents upgrades from an older operator if a KataConfig is present. The new CSV will stay in the `Pending` state forever and the following error is reported by the install plan : message: 'error validating existing CRs against new CRD''s schema for "kataconfigs.kataconfiguration.openshift.io": error validating custom resource against new schema for KataConfig /example-kataconfig: [].status.runtimeClass: Invalid value: "string": status.runtimeClass in body must be of type array: "string"' Simply rename the field to avoid the issue. Signed-off-by: Greg Kurz <groug@kaod.org>

littlejawa

LGTM
Tested successfully on OCP 4.14
Thanks @gkurz !

pmores · 2023-12-08T10:09:27Z

controllers/openshift_controller.go

-	if r.getInProgressConditionValue() != corev1.ConditionTrue {
-		return nil
-	}
-


Regarding the commit message, why does just editing KataConfig externally trigger this problem? I'd say predating PR #329 should be the only scenario.

It is possible to remove the node list with something like :

oc patch --type=merge --subresource=status --patch='{"status":{"kataNodes":null}}' kataconfig/my-kataconfig

Of course, people shouldn't do that but it doesn't mean we shouldn't be able to recover 😉

What I mean is, to my understanding predating PR #329 is the only actual condition. If that's fulfilled then any store will cause problems, right? A store can happen in a number of ways and the user editing the CR is in no way special among them.

Nope. Install 1.5.0 on a pristine cluster, deploy kata and do the oc patch above, you'll see in the controller logs that reconcile is called but the node list isn't rebuilt.

Is un-rebuilding nodes a blocker?

Is un-rebuilding nodes a blocker?

Not really as it doesn't prevent the operator to be functional.

Oh I suspect there would be quite a lot of things that user could do to sabotage the controller which wouldn't recover. ;-) But that's a fact indepedent of the idea of this PR - in fact, this has always been true and continues to be true even after this PR I believe.

My idea was not to mix independent facts in the message and not put them in the same context as if they were related since that could confuse a future reader. I'm not insisting though.

Ah this is merged already... never mind.

tbuskey · 2023-12-08T16:47:15Z

premerge testing: see KATA-2593

gkurz · 2023-12-08T17:26:45Z

/override ci/prow/check

openshift-ci · 2023-12-08T17:27:23Z

@gkurz: Overrode contexts on behalf of gkurz: ci/prow/check

In response to this:

/override ci/prow/check

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2023-12-08T17:27:28Z

@gkurz: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/sandboxed-containers-operator-e2e	`3e6f490`	link	false	`/test sandboxed-containers-operator-e2e`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

gkurz added 3 commits December 7, 2023 17:52

openshift-ci bot requested review from littlejawa and pmores December 7, 2023 17:38

littlejawa approved these changes Dec 8, 2023

View reviewed changes

pmores reviewed Dec 8, 2023

View reviewed changes

gkurz requested review from tbuskey and removed request for tbuskey December 8, 2023 15:30

gkurz assigned tbuskey Dec 8, 2023

gkurz merged commit e60ed4b into openshift:devel Dec 8, 2023
3 of 4 checks passed

gkurz mentioned this pull request Dec 8, 2023

Merge to main for 1.5.1 #368

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix upgrade from OSC 1.4.1 #366

Fix upgrade from OSC 1.4.1 #366

gkurz commented Dec 7, 2023 •

edited

littlejawa left a comment

pmores Dec 8, 2023

gkurz Dec 8, 2023

pmores Dec 8, 2023

gkurz Dec 8, 2023

tbuskey Dec 8, 2023

gkurz Dec 8, 2023

pmores Dec 11, 2023

pmores Dec 11, 2023

tbuskey commented Dec 8, 2023

gkurz commented Dec 8, 2023

openshift-ci bot commented Dec 8, 2023

openshift-ci bot commented Dec 8, 2023

Fix upgrade from OSC 1.4.1 #366

Fix upgrade from OSC 1.4.1 #366

Conversation

gkurz commented Dec 7, 2023 • edited

littlejawa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tbuskey commented Dec 8, 2023

gkurz commented Dec 8, 2023

openshift-ci bot commented Dec 8, 2023

openshift-ci bot commented Dec 8, 2023

gkurz commented Dec 7, 2023 •

edited