Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱Add Machine and KCP conditions to KCP controller #3674

Closed
wants to merge 12 commits into from

Conversation

sedefsavas
Copy link

@sedefsavas sedefsavas commented Sep 22, 2020

What this PR does / why we need it:
This PR adds Pod-related conditions to control-plane machines by KubeadmControlPlane controller.

Also, adds EtcdClusterHealthy condition to KubeadmControlPlane object, which shows etcd cluster health in general.

This PR is WIP, after agreeing on the conditions, tests will be added.

Some details about the conditions are here (some of them have changed): https://docs.google.com/document/d/1pEMf8AuPgUF_ClTM4uCEiDkb-iyUgwcxinwArtpa45c/edit

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #3138 with #3670

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 22, 2020
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Sep 22, 2020
@sedefsavas
Copy link
Author

cc @fabriziopandini @vincepri

@vincepri
Copy link
Member

/assign

Going to take a deeper look later today, if you're ready for a full review @sedefsavas

@fabriziopandini
Copy link
Member

As per discussion, implementing a global HealthTrackerInstance introduces a critical component, mostly due to shared memory management.
So, let's try to revisit this PR by adding health conditions to the existing controlPlane struct which is scoped for each reconciliation.

@sedefsavas sedefsavas marked this pull request as ready for review September 25, 2020 05:14
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 25, 2020
@sedefsavas
Copy link
Author

Fixed the comments, PTAL.

controlplane/kubeadm/controllers/controller.go Outdated Show resolved Hide resolved
controlplane/kubeadm/controllers/controller.go Outdated Show resolved Hide resolved
controlplane/kubeadm/controllers/controller.go Outdated Show resolved Hide resolved
controlplane/kubeadm/controllers/controller.go Outdated Show resolved Hide resolved
@sedefsavas
Copy link
Author

Rebased to release-03. We will forward-port it to the master.

@sedefsavas sedefsavas changed the base branch from master to release-0.3 October 9, 2020 15:43
@sedefsavas
Copy link
Author

/test pull-cluster-api-e2e-release-0-3

1 similar comment
@vincepri
Copy link
Member

vincepri commented Oct 9, 2020

/test pull-cluster-api-e2e-release-0-3

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign vincepri after the PR has been reviewed.
You can assign the PR to them by writing /assign @vincepri in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sedefsavas
Copy link
Author

This PR is ready PTAL.
Latest changes were changing the way to check quorum and adding health check in reconcileDelete().
cc @fabriziopandini

Comment on lines +160 to +191
// Check etcd cluster alarms
etcdClient, err := w.etcdClientGenerator.forNodes(ctx, controlPlaneNodes.Items)
if err != nil {
conditions.MarkFalse(controlPlane.KCP, controlplanev1.EtcdClusterHealthy, controlplanev1.EtcdClusterUnhealthyReason, clusterv1.ConditionSeverityWarning, "failed to get etcd client.")
return response, err
}

defer etcdClient.Close()
alarmList, err := etcdClient.Alarms(ctx)
if len(alarmList) > 0 || err != nil {
conditions.MarkFalse(controlPlane.KCP, controlplanev1.EtcdClusterHealthy, controlplanev1.EtcdClusterUnhealthyReason, clusterv1.ConditionSeverityWarning, "etcd cluster has alarms.")
return response, errors.Errorf("etcd cluster has %d alarms", len(alarmList))
}

members, err := etcdClient.Members(ctx)
if err != nil {
conditions.MarkFalse(controlPlane.KCP, controlplanev1.EtcdClusterHealthy, controlplanev1.EtcdClusterUnhealthyReason, clusterv1.ConditionSeverityWarning, "failed to get etcd members.")
return response, err
}

healthyMembers := 0
for _, m := range members {
if val, ok := response[m.Name]; ok {
if val == nil {
healthyMembers++
}
} else {
// There are members in etcd cluster that is not part of controlplane nodes.
conditions.MarkFalse(controlPlane.KCP, controlplanev1.EtcdClusterHealthy, controlplanev1.EtcdClusterUnhealthyReason, clusterv1.ConditionSeverityWarning, "unknown etcd member that is not part of control plane nodes.")
return response, err
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that should go in a different PR?

@@ -372,13 +377,42 @@ func (r *KubeadmControlPlaneReconciler) reconcile(ctx context.Context, cluster *
return ctrl.Result{}, nil
}

func (r *KubeadmControlPlaneReconciler) createControlPlane(ctx context.Context, cluster *clusterv1.Cluster, kcp *controlplanev1.KubeadmControlPlane) (*internal.ControlPlane, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is a little misleading, createControlPlane made me think that we were creating (read initializing) a new control plane, rather than the struct.

If the internal control plane struct requires more information, can we enrich internal.NewControlPlane to do it for us instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting the owned machines and then create ControlPlane, added because we will do the same thing in both reconcile and reconcileDelete. Although ControlPlane struct is enough as it is, ownedMachines are also calculated here. I think this is a good abstraction, but maybe bad naming?

@vincepri
Copy link
Member

@sedefsavas Let's try to find a way to split this PR in multiple ones, it seems there might be lots of unrelated or semi-related changes that could be going separately

@sedefsavas
Copy link
Author

@vincepri There were some additions to the KCP logic to surface the new conditions, I will create a PR that adds only those changes.
But I'd like to wait for @fabriziopandini to comment on the last 2 changes (how etcd quorum check is changed and adding conditions to reconcileDelete) so that we don't lose those 2 changes as the rest is tested by him.

@sedefsavas
Copy link
Author

/test pull-cluster-api-test-release-0-3

@k8s-ci-robot
Copy link
Contributor

@sedefsavas: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-cluster-api-e2e-full 4169caf link /test pull-cluster-api-e2e-full
pull-cluster-api-e2e-main f984dba link /test pull-cluster-api-e2e-main
pull-cluster-api-test-release-0-3 c512676 link /test pull-cluster-api-test-release-0-3

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@sedefsavas
Copy link
Author

Divided this PR into 3 PRs.

  1. KCP health refactoring: 🌱 Add etcd endpoint health check to KCP #3810
  2. etcd quorum checkL 🌱Refactor controlplane health check in KCP #3806

Will open the Conditions PR after these are merged.
Lets follow up on open PRs as comments in this PR are already addressed.

cc @fabriziopandini

@sedefsavas sedefsavas closed this Oct 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Machines conditions should provide more visibility to the underlying Node
4 participants