Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📖 Proposal: Improving status in CAPI resources #10897

Conversation

fabriziopandini
Copy link
Member

@fabriziopandini fabriziopandini commented Jul 17, 2024

What this PR does / why we need it:
This is a proposal about how can we improve status in v1Beta2 Cluster API resources, addressing several feedback in our backlog (see #10852 for a great wrap up) + making an important step towards v1 API

@enxebre, @vincepri, @sbueringer, @chrischdi, @killianmuldoon what I'm looking for at this stage is a feedback about general direction

/area api

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. area/api Issues or PRs related to the APIs cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 17, 2024
Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
i can take some AI / PR changes when / if needed.

@sbueringer sbueringer mentioned this pull request Jul 18, 2024
55 tasks
@enxebre
Copy link
Member

enxebre commented Jul 18, 2024

I did a first pass. Thanks a lot @fabriziopandini this is an excellent write up!

@enxebre
Copy link
Member

enxebre commented Jul 18, 2024

I might have missed but didn't see anything related to meaning of absence of condition? Do we want to state that our core conditions must always be set either true/false/unknown meaning absence indicate a controller operational issue? thoughts?

Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First off, @fabriziopandini I gotta say this is FANTASTIC and AMAZING work, and thank you so much for starting this.

Seriously, it shows care for our users, and the proposal is a great read.

docs/proposals/improve-status-in-CAPI-resources.md Outdated Show resolved Hide resolved
docs/proposals/improve-status-in-CAPI-resources.md Outdated Show resolved Hide resolved
docs/proposals/improve-status-in-CAPI-resources.md Outdated Show resolved Hide resolved
docs/proposals/improve-status-in-CAPI-resources.md Outdated Show resolved Hide resolved
docs/proposals/improve-status-in-CAPI-resources.md Outdated Show resolved Hide resolved
docs/proposals/improve-status-in-CAPI-resources.md Outdated Show resolved Hide resolved
docs/proposals/improve-status-in-CAPI-resources.md Outdated Show resolved Hide resolved
docs/proposals/improve-status-in-CAPI-resources.md Outdated Show resolved Hide resolved
docs/proposals/improve-status-in-CAPI-resources.md Outdated Show resolved Hide resolved
@fabriziopandini
Copy link
Member Author

@enxebre

I might have missed but didn't see anything related to meaning of absence of condition? Do we want to state that our core conditions must always be set either true/false/unknown meaning absence indicate a controller operational issue? thoughts?

K8s guidelines are stating that:

Controllers should apply their conditions to a resource the first time they visit the resource, even if the status is Unknown. This allows other components in the system to know that the condition exists and the controller is making progress on reconciling that resource.

Not all controllers will observe the previous advice about reporting "Unknown" or "False" values. For known conditions, the absence of a condition status should be interpreted the same as Unknown, and typically indicates that reconciliation has not yet finished (or that the resource state may not yet be observable).

I personally think that when we will abide to first guideline above, the absense of conditions will provide a signal that a controller is not reconciling for the first time an object, but this won't be a signal to surface controller's operational issues happening after an object is reconciled for the first time (ObservedGeneration is probably a better signal for this).

@fabriziopandini
Copy link
Member Author

@enxebre @JoelSpeed @vincepri, thanks for the first round of comments, that's awesome.

I will give some more time to let other's comments to flow in and keep some discussion thread going async, then we should probably have a discussion (might be in the office hours) to address a few key points

Co-authored-by: Stefan Büringer buringerst@vmware.com
Co-authored-by: Stefan Büringer buringerst@vmware.com
@fabriziopandini fabriziopandini added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Sep 4, 2024
@fabriziopandini
Copy link
Member Author

fabriziopandini commented Sep 4, 2024

@enxebre, @vincepri, @sbueringer, @chrischdi, @killianmuldoon, @neolit123, @JoelSpeed, @peterochodo, @zjs, @nrb + everyone who provided feedback to this proposal:

Thanks!
All your awesome contributions really helped in improving the initial draft!

The doc is ready for a final pass, looking forward to start implementing this!

@fabriziopandini
Copy link
Member Author

fabriziopandini commented Sep 5, 2024

As discussed in Sep 4th office hour meeting, lazy consensus deadline set for Sep 13th

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally +1 to go ahead, don't think I've left any blocking comments but have left a few nits that can be sorted during implementation.

Did have one question about the readiness gates though, wondering why we haven't allowed negative polarity conditions there

Comment on lines +287 to +291
// Conditions represent the observations of a Machine's current state.
// +optional
// +listType=map
// +listMapKey=type
Conditions []metav1.Condition `json:"conditions,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits that don't necessarily need to be fixed here.

  • Idiomatically, in Kube API types we always put the Conditions as the first field in the struct, may be nice to follow this convention
  • It's also typical to explain the known condition types, when we implement this, could we add that list? Known condition types are Foo, Bar and Baz. kind of thing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
noted in the tracking issue for the implementation

Comment on lines +370 to +373
- `HealthCheckSucceeded` and `OwnerRemediated` (or `ExternalRemediationRequestAvailable`) conditions are set by the
MachineHealthCheck controller in case a MachineHealthCheck targets the machine.
- KubeadmControlPlane also adds additional conditions to Machines, but those conditions are not included in the table above
for sake of simplicity (however they are documented in the KubeadmControlPlane paragraph).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we thought about what might happen if an MHC adds this condition, but then is modified/removed and no longer targets the Machine? Is there something to GC these?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is MHC behaviour not impacted by this proposal, but I think it is not an issue.

  • If MHC is removed after marking a machine as unhealthy, the machine gets remediated
  • If MHC is removed after marking a machine healthy, the machine it considered healthy until another MHC instance take over

Notes:
- Both `MinReadySeconds` and `ReadinessGates` should be treated as other in-place propagated fields (changing them should not trigger rollouts).
- Similarly to Pod's `ReadinessGates`, also Machine's `ReadinessGates` accept only conditions with positive polarity;
The Cluster API project might revisit this in the future to stay aligned with Kubernetes or if there are use cases justifying this change.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A machine is ready when it is not paused, cannot currently be represented then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding support for negative polarity conditions is tracked in #11105 as a follow up - after we got a first implementation up and running

Comment on lines +857 to +861
// A Cluster is available if:
// * Cluster's `RemoteConnectionProbe` and `TopologyReconciled` conditions are true and
// * the control plane `Available` condition is true and
// * all worker resource's `Available` conditions are true and
// * all conditions defined in AvailabilityGates are true as well.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, formatting like this wont translate across when you generate it into a CRD schema, and will look awkward when you use kubectl explain. Better to use complete sentences.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tracked in #11105, so we can make a pass on all the fields (might be relevant also for #10852)

docs/proposals/improve-status-in-CAPI-resources.md Outdated Show resolved Hide resolved
@sbueringer
Copy link
Member

Great work, thx!!

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 9, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: c908bd77c697fa157becaaa374e6376ad7ce0cf3

@JoelSpeed
Copy link
Contributor

/lgtm

@chrischdi
Copy link
Member

/lgtm

🎉 Thanks!

@sbueringer
Copy link
Member

/lgtm
/approve

/hold
@fabriziopandini up to you to merge

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 16, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 16, 2024
@fabriziopandini
Copy link
Member Author

Lazy consensus deadline passed
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 16, 2024
@k8s-ci-robot k8s-ci-robot merged commit ff79510 into kubernetes-sigs:main Sep 16, 2024
19 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.9 milestone Sep 16, 2024
@vincepri
Copy link
Member

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/api Issues or PRs related to the APIs cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.