Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix upgrade from OSC 1.4.1 #366

Merged
merged 3 commits into from Dec 8, 2023
Merged

Fix upgrade from OSC 1.4.1 #366

merged 3 commits into from Dec 8, 2023

Conversation

gkurz
Copy link
Member

@gkurz gkurz commented Dec 7, 2023

- Description of the problem which is fixed/What is the use case

This fixes KATA-2593 (Kata Operator upgrade failed 1.4.1 to 1.5.0) .

- What I did

Renamed the offending field in KataConfigStatus and improved idempotency of the controller.

- How to verify it

  1. Install OSC 1.4.1 with automatic upgrade enabled [use catalog image at *]
  2. Create a KataConfig
  3. Setup a catalog source with this PR [use catalog image at **]
  4. Patch the subscription of the operator to select the new source
oc patch subscription -n openshift-sandboxed-containers-operator sandboxed-containers-operator -p '{"spec":{"source":"my-operator-catalog", "startingCSV":"sandboxed-containers-operator.v1.5.1", "channel":"candidate"}}' --type merge
  1. Upgrade should start and complete successfully

The following scenarios should be checked :

  • User has 1.4.1 installed (described above)
  • User has 1.4.1/1.5.0 in pending state (KATA-2593).
    I believe this requires to first uninstall OSC completely, aka oc delete ns openshift-sandboxed-containers-operator
  • User has 1.5.0 installed

[*] quay.io/openshift_sandboxed_containers/openshift-sandboxed-containers-operator-catalog:1.4.1-10
[**] quay.io/rhgkurz/openshift-sandboxed-containers-operator-catalog:v1.5.1

- Description for the changelog

The monitor pods require a custom SELinux policy to be installed. The
monitor daemon set (DS) should thus be created after that, in order to
avoid polluting the logs with transient errors.

The current code base does this in a bit convoluted and slightly bogus
way :
1) it is assumed that the SELinux policy results from the creation of
   the runtime class
2) the DS creation code is located in a place where only an extra
   check allows to know about the runtime class existence

[1] is wrong since the SELinux policy resuls from the creation of
the security context constraints (SCC). As already suggested in
a comment, it seems that the creation of the DS should happen
just after the SCC is created to address [2].

This removes a user of KataConfigStatus::RuntimeClass that will be
renamed in the subsequent patch.

Signed-off-by: Greg Kurz <groug@kaod.org>
The list of nodes where kata is deployed is part of the status of
the KataConfig. It is expected that the controller ensures this
list is up-to-date during all reconcile runs. Changes to this
list usually happen during installation/deinstallation of kata
on some nodes. But it is also possible that reconcile is called
and the status is lacking the list of nodes for some reason, e.g.
KataConfig was edited externally or KataConfig predates PR openshift#329.

Stop restricting the regeneration of this list to only happen when
kata is being deployed or removed. This might result in some more
trafic between the client and server but idempotency should prevail.

Signed-off-by: Greg Kurz <groug@kaod.org>
PR openshift#344 changed the type of KataConfigStatus::RuntimeClass from string
to array of string. This change in the CRD isn't supported and prevents
upgrades from an older operator if a KataConfig is present. The new CSV
will stay in the `Pending` state forever and the following error is
reported by the install plan :

message: 'error validating existing CRs against new CRD''s schema for "kataconfigs.kataconfiguration.openshift.io":
      error validating custom resource against new schema for KataConfig /example-kataconfig:
      [].status.runtimeClass: Invalid value: "string": status.runtimeClass in body
      must be of type array: "string"'

Simply rename the field to avoid the issue.

Signed-off-by: Greg Kurz <groug@kaod.org>
Copy link
Contributor

@littlejawa littlejawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Tested successfully on OCP 4.14
Thanks @gkurz !

if r.getInProgressConditionValue() != corev1.ConditionTrue {
return nil
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the commit message, why does just editing KataConfig externally trigger this problem? I'd say predating PR #329 should be the only scenario.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to remove the node list with something like :

oc patch --type=merge --subresource=status --patch='{"status":{"kataNodes":null}}' kataconfig/my-kataconfig

Of course, people shouldn't do that but it doesn't mean we shouldn't be able to recover 😉

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is, to my understanding predating PR #329 is the only actual condition. If that's fulfilled then any store will cause problems, right? A store can happen in a number of ways and the user editing the CR is in no way special among them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. Install 1.5.0 on a pristine cluster, deploy kata and do the oc patch above, you'll see in the controller logs that reconcile is called but the node list isn't rebuilt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is un-rebuilding nodes a blocker?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is un-rebuilding nodes a blocker?

Not really as it doesn't prevent the operator to be functional.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I suspect there would be quite a lot of things that user could do to sabotage the controller which wouldn't recover. ;-) But that's a fact indepedent of the idea of this PR - in fact, this has always been true and continues to be true even after this PR I believe.

My idea was not to mix independent facts in the message and not put them in the same context as if they were related since that could confuse a future reader. I'm not insisting though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this is merged already... never mind.

@gkurz gkurz requested review from tbuskey and removed request for tbuskey December 8, 2023 15:30
@tbuskey
Copy link
Contributor

tbuskey commented Dec 8, 2023

premerge testing: see KATA-2593

@gkurz
Copy link
Member Author

gkurz commented Dec 8, 2023

/override ci/prow/check

Copy link

openshift-ci bot commented Dec 8, 2023

@gkurz: Overrode contexts on behalf of gkurz: ci/prow/check

In response to this:

/override ci/prow/check

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

openshift-ci bot commented Dec 8, 2023

@gkurz: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/sandboxed-containers-operator-e2e 3e6f490 link false /test sandboxed-containers-operator-e2e

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@gkurz gkurz merged commit e60ed4b into openshift:devel Dec 8, 2023
3 of 4 checks passed
@gkurz gkurz mentioned this pull request Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants