Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cluster failure threshold #1829

Merged
merged 1 commit into from
May 24, 2022
Merged

Conversation

dddddai
Copy link
Member

@dddddai dddddai commented May 18, 2022

What type of PR is this?
/kind feature

What this PR does / why we need it:
Currently cluster status will be set to NotReady immediately if the check failed
To make it more robust, this pr adds cluster failure threshold(default to 30s), which is the duration of failure for the cluster to be considered unhealthy.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:
Should the flag cluster-failure-threshold be the number of failures(default to 3) or the time(default to 30s)? Which is better?

Does this PR introduce a user-facing change?:

`karmada-controller-manager/karmada-agent`: Introduced `--cluster-failure-threshold` flag to specify cluster failure threshold.

@karmada-bot karmada-bot added the kind/feature Categorizes issue or PR as related to a new feature. label May 18, 2022
@karmada-bot karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 18, 2022
@RainbowMango
Copy link
Member

Should the flag cluster-failure-threshold be the number of failures(default to 3) or the time(default to 30s)? Which is better?

What's your option?

@dddddai
Copy link
Member Author

dddddai commented May 19, 2022

Either is fine with me, time is probably more intuitive but users need to keep cluster-failure-threshold a multiple of cluster-status-update-frequency

Kubefed uses the number: https://github.com/kubernetes-sigs/kubefed/blob/v0.9.2/charts/kubefed/charts/controllermanager/templates/kubefedconfig.yaml#L19

@karmada-bot karmada-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 19, 2022
@dddddai dddddai force-pushed the cluster-status branch 2 times, most recently from 5da23b8 to a291bcb Compare May 19, 2022 10:16
@karmada-bot karmada-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 20, 2022
@dddddai
Copy link
Member Author

dddddai commented May 20, 2022

Thank you, I've addressed your comments

@dddddai dddddai force-pushed the cluster-status branch 2 times, most recently from 7be1cc0 to 3ffa906 Compare May 20, 2022 03:36
@RainbowMango
Copy link
Member

I really want something like ClusterData which caches each cluster's last prob time and status. How do you think?

@RainbowMango
Copy link
Member

By the way, we should clean up the data when deleting the cluster.

// The resource may no longer exist, in which case we stop the informer.
if apierrors.IsNotFound(err) {
c.InformerManager.Stop(req.NamespacedName.Name)
return controllerruntime.Result{}, nil
}

@dddddai
Copy link
Member Author

dddddai commented May 20, 2022

I really want something like ClusterData which caches each cluster's last prob time and status. How do you think?

For "last prob time", kubefed has its own Condition but we are using metav1.Condition where we don't have "last prob time"

For "status", kubefed stores it because it will restore the old status, we just return and retry in the next health check, I think they take the same effect

So I don't see where we are gonna use it? Please correct me if I'm wrong

@RainbowMango
Copy link
Member

What I mean something like ClusterData, learn the idea, not the whole struct.
And last prob time means last success or last failure time the time will be changed only when the condition change.

@dddddai
Copy link
Member Author

dddddai commented May 21, 2022

OK, added

pkg/controllers/status/cluster_status_controller.go Outdated Show resolved Hide resolved
@@ -195,6 +201,30 @@ func (c *ClusterStatusController) syncClusterStatus(cluster *clusterv1alpha1.Clu
return c.updateStatusIfNeeded(cluster, currentClusterStatus)
}

func (c *ClusterStatusController) retryOnFailure(online, healthy bool, cluster string) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't tell but seems something doesn't feel right here.
Let me think about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have modified as you suggested

@dddddai dddddai force-pushed the cluster-status branch 2 times, most recently from e08a0a1 to a97bc9a Compare May 23, 2022 13:23
Comment on lines 80 to 142
{
// the first retry
online: false,
healthy: false,
expectedReady: metav1.ConditionTrue,
},
{
// the second retry
online: false,
healthy: false,
expectedReady: metav1.ConditionTrue,
},
{
// the third retry
// cluster is still unhealthy after the threshold, set cluster status to not ready
online: false,
healthy: false,
expectedReady: metav1.ConditionFalse,
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems these tests interact with each other?

Copy link
Member Author

@dddddai dddddai May 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes they are observed statuses ordered by time, because thresholdReadyCondition depends on previous statuses

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this test, we are testing thresholdAdjustedReadyCondition, we should set up cache, current condition, and observed condition for each case. They should not depend on each other.

Copy link
Member Author

@dddddai dddddai May 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be kind of tricky..How to set up the cache?(clusterDataMap and thresholdStartTime)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are testing clusterConditionStore.thresholdAdjustedReadyCondition where clusterConditionStore is stateful

@RainbowMango
Copy link
Member

func TestThresholdAdjustedReadyCondition2(t *testing.T) {
	tests := []struct {
		name              string
		clusterData       *clusterData
		currentCondition  *metav1.Condition
		observedCondition *metav1.Condition
		expectedCondition *metav1.Condition
	}{
		{
			name:              "cluster just joined in ready state",
			clusterData:       nil, // no cache yet
			currentCondition:  nil, // no condition was set on cluster object yet
			observedCondition: &metav1.Condition{Status: metav1.ConditionTrue},
			expectedCondition: &metav1.Condition{Status: metav1.ConditionTrue},
		},
		{
			name:              "cluster just joined in not-ready state",
			clusterData:       nil, // no cache yet
			currentCondition:  nil, // no condition was set on cluster object yet
			observedCondition: &metav1.Condition{Status: metav1.ConditionFalse},
			expectedCondition: &metav1.Condition{Status: metav1.ConditionFalse},
		},
		{
			name: "cluster becomes not ready but still not reach threshold",
			clusterData: &clusterData{
				readyCondition:     metav1.ConditionFalse,
				thresholdStartTime: time.Now(),
			},
			currentCondition:  &metav1.Condition{Status: metav1.ConditionTrue},
			observedCondition: &metav1.Condition{Status: metav1.ConditionFalse},
			expectedCondition: &metav1.Condition{Status: metav1.ConditionTrue},
		},
	}

	for _, test := range tests {
		t.Run(test.name, func(t *testing.T) {
			cache := clusterConditionStore{
				clusterDataMap:   sync.Map{},
				failureThreshold: 10,
			}

			if test.clusterData != nil {
				cache.update("foo", test.clusterData)
			}

			cluster := &clusterv1alpha1.Cluster{}
			cluster.Name = "foo"
			if test.currentCondition != nil {
				meta.SetStatusCondition(&cluster.Status.Conditions, *test.currentCondition)
			}

			thresholdReadyCondition := cache.thresholdAdjustedReadyCondition(cluster, test.observedCondition)

			if test.expectedCondition.Status != thresholdReadyCondition.Status {
				t.Fatalf("expected: %s, but got: %s", test.expectedCondition.Status, thresholdReadyCondition.Status)
			}
		})
	}
}

@RainbowMango
Copy link
Member

Just tried add some cases(not finished).

@dddddai dddddai force-pushed the cluster-status branch 3 times, most recently from b5db3eb to 339f431 Compare May 24, 2022 09:19
Signed-off-by: dddddai <dddwq@foxmail.com>
@RainbowMango
Copy link
Member

I tested it on my side, I think we are ready to move further. Thanks @dddddai

Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@karmada-bot karmada-bot added the lgtm Indicates that a PR is ready to be merged. label May 24, 2022
@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RainbowMango

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 24, 2022
@karmada-bot karmada-bot merged commit 801d187 into karmada-io:master May 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants