Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove legacy kata config status parts #333

Merged
merged 7 commits into from Aug 21, 2023

Conversation

pmores
Copy link
Contributor

@pmores pmores commented Jul 11, 2023

This PR removes parts of KataConfig.status obsoleted by #330 (and #329 before it), totalNodesCount, installationStatus, unInstallationStatus (and the stub of upgradeStatus).

As it turns out, without this it would be rather hard (likely impossible)
to keep backwards compatibility for the "Completed" AdditionalPrinterColumn
in the KataConfig CRD.  As simple as the semantics of readyNodeCount are
(literally just the length of status.kataNodes.installed), the syntax of
crd.spec.versions.additionalPrinterColumns only allows jsonPath to define
the value of a column, and k8s' JSON Path implementation doesn't seem to
support getting a length of an array.

(The go-template output format of kubectl/oc would make this trivial but
go-template doesn't seem to be supported by CRD at all.)

Signed-off-by: Pavel Mores <pmores@redhat.com>
This implicitly improves the behaviour of the InProgress and Completed
columns which ignored uninstallation so far.

Signed-off-by: Pavel Mores <pmores@redhat.com>
Handling of uninstallation being blocked by running kata pods turned out
to fit the InProgress Condition neatly by introducing a new
"BlockedByExistingKataPods" reason for _not_ being in-progress.

The rest of the migration is pretty straightforward - setting the legacy
IsInProgress boolean to True corresponds to
setInProgressConditionToUninstalling(), setting it to False corresponds to
resetInProgressCondition().

One place that set the boolean to True was spurious (it was set to True
already) so it's removed without replacement.

Signed-off-by: Pavel Mores <pmores@redhat.com>
While the idea is simple and the same as with the uninstallation flow
in the previous commit, installation actually did use
InstallationStatus.IsInProgress in its implementation so the replacement
is a bit more tricky.

It's solved by observing that the legacy field was only set if the MCO
was updating "kata-oc" and/or "worker".  We can do the same for a newly
introduced isInstallationInProgress variable whose semantics are basically
the same as InstallationStatus.IsInProgress', just without persisting it
in the KataConfig resource.

Signed-off-by: Pavel Mores <pmores@redhat.com>
@openshift-ci openshift-ci bot requested review from bpradipt and jensfr July 11, 2023 15:03
@tbuskey
Copy link
Contributor

tbuskey commented Jul 11, 2023

Looks good with kata/peer-pods workloads

Copy link
Contributor

@snir911 snir911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks!

Signed-off-by: Pavel Mores <pmores@redhat.com>
The "is in progress" check at the top of updateStatus() is now simplified,
kata-enabled node count can now be assigned directly to
Status.KataNodes.NodeCount.  Updating the legacy structures is just turned
off and proper error handling is added to the putNodeOnStatusList() call
as this function now bears the brunt of updateStatus().

Signed-off-by: Pavel Mores <pmores@redhat.com>
They are not used by anything anymore.

Signed-off-by: Pavel Mores <pmores@redhat.com>
@pmores pmores force-pushed the remove-legacy-KataConfig-status-parts branch from 8e75a72 to 6a0a6de Compare July 12, 2023 11:41
@pmores
Copy link
Contributor Author

pmores commented Jul 12, 2023

Force-push to add a commit switching envtest to the new KataConfig.status (thanks @tbuskey for pointing this out)

}
e := r.putNodeOnStatusList(&node)
if e != nil {
err = e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of multiple errors, we're keeping only the last one right?
Can we at least log the error here before we potentially lose it?
Or as an alternative: if we meet an error, is it useful to continue the loop?

Copy link
Contributor Author

@pmores pmores Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a comment at the top of the functions makes that point.

As for error handling, yes, my idea was that it's still useful to continue. The way I see it, if there's an error our status will always end up imprecise for the moment but it would be even more imprecise if we broke off on the first error. That said, I'm not sure at the moment how rigorous we should be here, I'm guessing some practical experience will help. IIRC the legacy status update behaved the same way and I don't remember any practical problems with that.

Errors that can come up here will likely tend to be transient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed the comment at the top.
Thanks for the answer. Makes sense

@@ -85,9 +70,9 @@ type KataConfigStatus struct {
// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
// +kubebuilder:resource:path=kataconfigs,scope=Cluster
// +kubebuilder:printcolumn:name="InProgress",type=string,JSONPath=".status.installationStatus.IsInProgress",description="Status of Kata runtime installation"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not mark these fields as deprecated and leave them in? We don't know what tools people have written that rely on these fields, i.e. the .status section to me is part of the API.

We could even add logging for the case that any of the fields are non-empty (unlikely that another software has the rights to update it, but possible). After a while we can bump the version and remove the fields. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's generally the correct way to handle this.

Admittedly I'd still tend to remove the fields now. I suspect there's little automation in use that relies on the old status. The combined status with new and old parts is fairly confusing. Migration from old to new status is fairly easy and mechanical (migrations so far were apparently easy and without problems), the migration guide is included in the #330 cover letter and it's been a couple of weeks since the need to migrate has been announced.

If folks feel more comfortable taking the slow path I wouldn't block it though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I can change our automation to check both.
-o=jsonpath='{.status.conditions[?(@.type=="InProgress")].status}' is the new one
-o=jsonpath={.status.installationStatus.IsInProgress}{.status.unInstallationStatus.inProgress.status} is the current one

As long as ToLower() != false, I know things are not in progress

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tbuskey I think that checking the new one should be enough by now.

Copy link
Contributor

@jensfr jensfr Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the delay. I think we don't know who's using it and in what way (we were surprised a few times already. We may break users that we are not even aware of. To be honest I think we should take the slow path. What do others think? @bpradipt @snir911 @cpmeadors ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm considering these aspects when providing my view on this PR.
We are adding/improving new features - peer-pods, CoCo which will increase the adoption and thereby make it relativey more harder to make this sort of a change.
Further, this change will only be available in newer versions, and is only affecting the install/uninstall of the operator with no impact to the workloads. Aso it'll be documented so that users relying on earlier status field can make suitable changes to use the newer fields.

Given the above aspects I'm in favour of getting this merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmores I don't have any comments on the code.
lgtm on the code front.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a more general note, although it seems in this case that we can opt to have both the old and new stuff running alongside, this doesn't seem to be true in general. Consider the upcoming change of KataConfig.status.runtimeClass to an array to support runtimeclasses other than plain kata. How would we go about it, add the array as runtimeClassNew first and then rename it back to runtimeClass? What about users who'd ignore the runtimeClassNew addition and continued to use the single-value runtimeClass?

I'm afraid there's no avoiding breaking users who don't pay attention, unfortunately. For users who do, they've had a month now to adapt to the change.

Also, I think we should be pragmatic. As @bpradipt says, changes will be harder when we have a lot of users. If we know already we need to change something, we should consider doing it as fast as possible since it will only get harder in the future I'm afraid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree with @pmores and @bpradipt. Better to get rid of this now while we don't have too many users and migration is easy. The gain on code maintenance outperforms the inconvenience for users IMHO.

@tbuskey
Copy link
Contributor

tbuskey commented Aug 2, 2023

QE automation has already been changed to look for .status.conditions and will use it instead if it exists

Copy link
Member

@gkurz gkurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks @pmores !

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 10, 2023
@tbuskey
Copy link
Contributor

tbuskey commented Aug 16, 2023

I was able to unsubscribe & resubscribe while running a kata workload
oc get kataconfig example-kataconfig -o yaml transitioned to:

apiVersion: kataconfiguration.openshift.io/v1
kind: KataConfig
metadata:
  creationTimestamp: "2023-08-16T13:50:04Z"
  finalizers:
  - kataconfiguration.openshift.io/finalizer
  generation: 2
  name: example-kataconfig
  resourceVersion: "146822"
  uid: c22a6445-8c02-47e7-80b6-eae10d2b93a3
spec:
  checkNodeEligibility: false
  enablePeerPods: false
  kataConfigPoolSelector: null
  logLevel: info
status:
  conditions:
  - lastTransitionTime: "2023-08-16T18:21:44Z"
    message: ""
    reason: ""
    status: "False"
    type: InProgress
  kataNodes:
    nodeCount: 0
    readyNodeCount: 0
  runtimeClass: kata
  waitingForMcoToStart: false

The kataNodes statuses should be 3:

tbuskey-230816-1-bjff5-worker-eastus1-2v54t   Ready    kata-oc,worker   5h35m   v1.27.3+4aaeaec
tbuskey-230816-1-bjff5-worker-eastus2-hnkcr   Ready    kata-oc,worker   5h36m   v1.27.3+4aaeaec
tbuskey-230816-1-bjff5-worker-eastus3-mc77d   Ready    kata-oc,worker   5h35m   v1.27.3+4aaeaec

Uninstalling kataconfig did not update the node counts, but the conditions:, kataNodes: uninstalling and waitingToInstall did get updated.
I will test peer-pods & comment if it is different

@tbuskey
Copy link
Contributor

tbuskey commented Aug 16, 2023

When running with peerpodsEnabled: true, I was also able to unsub/sub while a peer-pors workload was running.
The kataconfig yaml transitioned to this:

apiVersion: kataconfiguration.openshift.io/v1
kind: KataConfig
metadata:
  creationTimestamp: "2023-08-16T18:57:12Z"
  finalizers:
  - kataconfiguration.openshift.io/finalizer
  generation: 2
  name: example-kataconfig
  resourceVersion: "170584"
  uid: 8d3ee88b-ff2f-48ff-befb-16487f2a2acf
spec:
  checkNodeEligibility: false
  enablePeerPods: true
  kataConfigPoolSelector: null
  logLevel: info
status:
  runtimeClass: kata
  waitingForMcoToStart: false

Deleting kataconfig puts the deletionTimestamp in the metatdata but the deletion doesn't proceed, even after 15 minutes.

@pmores pmores merged commit 79c4f52 into openshift:devel Aug 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants