Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge to main for 1.5 #357

Merged
merged 62 commits into from Nov 14, 2023
Merged

Merge to main for 1.5 #357

merged 62 commits into from Nov 14, 2023

Conversation

gkurz
Copy link
Member

@gkurz gkurz commented Nov 14, 2023

- Description of the problem which is fixed/What is the use case

This is merging the devel branch into main in order to prepare the v1.5.0 release.

- What I did

  1. Cloned the main branch locally as merge-to-main-for-1.5
  2. Reverted Fix kataconfig status handling to support installation updates (port to main) #339 from merge-to-main-for-1.5 (see below for justification)
  3. Merge the devel branch into merge-to-main-for-1.5

- How to verify it

  • git diff ${clone_of_this_PR} devel should be empty.
  • Run QE tests

- Description for the changelog

Goal of this PR is to merge devel into main. This required some care as plain git merge devel was failing to start with. This was the consequence of git merge not being able to correlate cherry-picks in #339 with the original commits on devel. This got workaround by reverting #339 (backport of #327) and proceeding with the merge, which brought back the original changes from #327.

pmores and others added 30 commits May 17, 2023 13:00
This fixes an omission in PR openshift#300 where the special case of KataConfig
deletion while there are no kata nodes on a cluster wasn't handled
properly and uninstallation got stuck.  This happened because the
uninstallation flow assumed incorrectly that it will always be
necessary for the MCO to reconciliate which is not the case.

Installation flow (PR openshift#291) got this right and while the idea of PR openshift#300
was basically to make uninstallation flow analogous to installation, this
aspect was omitted by mistake.

Signed-off-by: Pavel Mores <pmores@redhat.com>
Signed-off-by: Snir Sheriber <ssheribe@redhat.com>
Merge main into devel after v1.4.0 release
The DaemonSet launched by the controller doesn't put tolerations on the
monitor pods. This prevents the monitor pods to run on tainted nodes and
break metrics for sandboxed containers.

The monitor pods must run on every node where kata is deployed, no
matter any taint the user might have set. Enforce a toleration that
matches all possible taints.

Fixes https://issues.redhat.com/browse/KATA-2121

Signed-off-by: Greg Kurz <groug@kaod.org>
This is a simpler and more transparent alternative to getMcp(). It will be
useful in a subsequent commit.

Signed-off-by: Pavel Mores <pmores@redhat.com>
If the MCO considers a Node Done, we check if its current and target
MachineConfigs are the same.

If they are then the Node is actually done and we append it either
to the installation completed list or uninstallation completed list based
on whether installation or uninstallation is in progress and whether
the Node is a member of "worker" or "kata-oc".

If they aren't it means that the Node is waiting for its MC to be changed
as part of this ongoing installation or uninstallation.  We don't append
such Nodes to any list since there doesn't seem to be a good place for
them in the current KataConfig.status structure.

Signed-off-by: Pavel Mores <pmores@redhat.com>
Working and Degraded Nodes are handled structurally similarly so they are
handled by the same function.

Working Nodes are put on an InProgress list, Degraded Nodes are put on
a Failed list.

If uninstallation is in progress Nodes go to uninstallation lists.
If installation is in progress they go to installation lists on converged
clusters or if they are labeled for kata, otherwise they go to
uninstallation lists (this handles the case when kata is removed from
a Node as part of kata installation modification on a cluster).

Signed-off-by: Pavel Mores <pmores@redhat.com>
This puts together functionality added by the previous commits.

The approach to KataConfig.status updating introduced here fundamentally
differs from the existing updateStatus() function.

The existing function only handled kata installation to and uninstallation
from a cluster with no support for modifications (adding/removing
kata-enabled Nodes) in the meantime.  Thus it made sense to split status
updating at the top level into two largely independent branches, one for
installation and the other for uninstallation, each of which then iterated
over Nodes and updated status based on the Nodes' MCO 'state' and other
information.

This model however breaks when kata installation modifications are
introduced since they mean that kata may actually be uninstalled from Nodes
as part of an operation that is nominally an installation.

The approach introduced here kind of turns the existing one inside out.
It iterates over *Nodes* at the top level and only then it figures out
for each Node individually whether kata is being installed on it,
uninstalled from it, or if the Node is left alone by the ongoing operation.

Signed-off-by: Pavel Mores <pmores@redhat.com>
Any errors from the new status updating are just logged, status updating
thus cannot mess up installation/uninstallation flow anymore.  This is
appropriate as status updating now does just that, update status, so any
failures should not divert the main flow.

This also removes the last opportunity for races like the recent
uninstallation wedging due to updateStatus() failing to retrieve the
"kata-oc" MCP which has already been deleted the uninstallation flow,
and thus forcing the uninstallation flow to return early with an error and
never to be able to finish.

Signed-off-by: Pavel Mores <pmores@redhat.com>
With removal of the legacy updateStatus(), updateStatusNew() can be renamed
to the original name.

With removal of the legacy updateStatus() infrastructure getMcp() is left
with no callers and can be removed.

With removal of getMcp() the last remaining caller of
getMcpNameIfMcpExists() is removed so this function can finally be removed
too.  This concludes a refactor started in commit 3adfbd5.

Signed-off-by: Pavel Mores <pmores@redhat.com>
All predicates take the same arguments describing a node and figure out
if kata on the node is Installed, Installing, WaitingToInstall,
FailedToInstall, NotInstalled, Uninstalling, WaitingToUninstall or
FailedToUninstall.

Signed-off-by: Pavel Mores <pmores@redhat.com>
This is a preparation for the next commit where this functionality will
be needed.

Signed-off-by: Pavel Mores <pmores@redhat.com>
The first part computes inputs for the kata status predicates, the rest
runs the predicates.  For now, the result of classification is just logged.

Signed-off-by: Pavel Mores <pmores@redhat.com>
A list of nodes is added to KataConfig.status for each of Installed,
Installing, WaitingToInstall, FailedToInstall, Uninstalling,
WaitingToInstall and FailedToInstall kata installation statuses.  There's
no list for NotInstalled as that's left implicit (the NotInstalled status
is still logged though).

putNodeOnStatusList() is also changed to actually fill the lists.

Signed-off-by: Pavel Mores <pmores@redhat.com>
This is basically a counterpart of putNodeOnStatusList(), to be called
before the lists can start being populated by putNodeOnStatusList().

Signed-off-by: Pavel Mores <pmores@redhat.com>
The new functionality runs alongside of the existing status reporting
for the time being.

Signed-off-by: Pavel Mores <pmores@redhat.com>
Despite 1.4.0 was released about a month ago, there are still some mentions
of 1.3.x in the tree :

$ git grep -e '1\.3\.[0123]' -- ':!go.*'
config/manager/kustomization.yaml:  newTag: 1.3.1
config/samples/deploy.yaml: image:  quay.io/openshift_sandboxed_containers/openshift-sandboxed-containers-operator-catalog:v1.3.0
config/samples/deploy.yaml:  startingCSV: sandboxed-containers-operator.v1.3.0

This is essentially because we don't do automated upstream builds.

Let's bump the versions to latest public release : 1.4.0.

Signed-off-by: Greg Kurz <groug@kaod.org>
The data structures and semantics largely correspond to recommendations in
k8s api-conventions document.  The handling added is just what will be
needed, with no attempt to implement a complete set of conventional list
operations (i.e. no remove).

Signed-off-by: Pavel Mores <pmores@redhat.com>
A getter for InProgress Condition status is added along with a bunch of
setters to ease setting the Condition to various states.

Signed-off-by: Pavel Mores <pmores@redhat.com>
In uninstallation the InProgress Condition handling corresponds to the
legacy UnInstallationStatus.InProgress.IsInProgress boolean handling
pretty much one-to-one.

In installation the handling diverges slightly since InProgress offers
a bit more granularity, making a difference between an initial installation
and a later update, so the handling has to be a tiny bit different.
Resetting however again corresponds to the legacy
InstallationStatus.IsInProgress boolean handling.

The Failed state is detected and reported from KataConfig.status updating
code as before.

Signed-off-by: Pavel Mores <pmores@redhat.com>
This effectively moves the legacy KataConfig.status.totalNodesCount field
to a more suitable place created by addition of KataNodesStatus.

Signed-off-by: Pavel Mores <pmores@redhat.com>
Bump outdated versions of OSC components
The config file name is configuration-remote.toml. Hence renaming
kata-remote.conf file to configuration.toml.

Also the butane config used to create machineconfig yaml to apply
the Kata configuration is added.

Example butane run to create the machineconfig
```
butane -d ./local mc-remote-butane.conf > mc-40-kata-remote-config.yaml
```

Fixes: #KATA-2268

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
This updates the machineconfig to apply the new Kata configuration
which enables the following annotations

- default_vcpus
- default_memory
- machine_type

Fixes: #KATA-2268

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Update machineconfig to enable required annotations for flexible instance types
snir911 and others added 14 commits October 17, 2023 14:43
Signed-off-by: Snir Sheriber <ssheribe@redhat.com>
Signed-off-by: Snir Sheriber <ssheribe@redhat.com>
and update peer-pods-cm ConfigMap with the created image value.
also sets BOOT_FIPS=true if controller is running on FIPS enabled system

Signed-off-by: Snir Sheriber <ssheribe@redhat.com>
caa repo as it's too fragile

Signed-off-by: Snir Sheriber <ssheribe@redhat.com>
podvm: run image jobs from the controller automatically
…ogress

When KataConfig deletion processing is allowed to start while a previous
installation or update is still running both processes will interfere with
each other, resulting in a variety of possible cluster states, most of
them undesirable (most often, uninstallation blocks indefinitely).

This commit alleviates the situation by not allowing uninstallation to
start while installation or an update is still underway.  Once the current
operation finishes uninstallation should start immediately.

Signed-off-by: Pavel Mores <pmores@redhat.com>
We usually fetching cloud-provider name from either the peer-pods CM CLOUD_PROVIDER value
or from the infra object, as cloud provider is not expected to be changed at runtime
this commit moves its fetching into the image-generator initialization.
Additionally fetching the cloud-provider solely from the infra Object to avoid relying on
user input (from peer-pods CM)

Signed-off-by: Snir Sheriber <ssheribe@redhat.com>
in peer-pods ConfigMap or Secret
it's assumed essential values will not be modified once set and
kataConfig was created

Signed-off-by: Snir Sheriber <ssheribe@redhat.com>
Signed-off-by: Snir Sheriber <ssheribe@redhat.com>
peer-pods: validate CM and Secret are set
As a preparatory step for the upcoming release, this bumps the version
to 1.5.0 in Makefile, the CSV and images.

Note that the updated images aren't available at this stage. They
will be added later. Until then the image links aren't valid.

Signed-off-by: Greg Kurz <groug@kaod.org>
This reverts commit 5617c1b, reversing
changes made to 1a660d8.

Signed-off-by: Greg Kurz <groug@kaod.org>
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 14, 2023
Copy link

openshift-ci bot commented Nov 14, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Contributor

@pmores pmores left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @gkurz !

@gkurz gkurz marked this pull request as ready for review November 14, 2023 12:54
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 14, 2023
@gkurz
Copy link
Member Author

gkurz commented Nov 14, 2023

/test

Copy link

openshift-ci bot commented Nov 14, 2023

@gkurz: The /test command needs one or more targets.
The following commands are available to trigger required jobs:

  • /test check
  • /test ci-index-openshift-sandboxed-containers-operator-bundle
  • /test images

The following commands are available to trigger optional jobs:

  • /test sandboxed-containers-operator-e2e

Use /test all to run all jobs.

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gkurz
Copy link
Member Author

gkurz commented Nov 14, 2023

make test is broken as explained in #310 (comment).

/override ci/prow/check

Copy link
Contributor

@littlejawa littlejawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
Thanks @gkurz !

Copy link

openshift-ci bot commented Nov 14, 2023

@gkurz: Overrode contexts on behalf of gkurz: ci/prow/check

In response to this:

make test is broken as explained in #310 (comment).

/override ci/prow/check

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

openshift-ci bot commented Nov 14, 2023

@gkurz: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/sandboxed-containers-operator-e2e 22c683d link false /test sandboxed-containers-operator-e2e

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Member

@beraldoleal beraldoleal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Executed the git diff and tests, lgtm.

@gkurz gkurz merged commit 67adbe1 into openshift:main Nov 14, 2023
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants