Add cluster failure threshold #1829

dddddai · 2022-05-18T12:43:40Z

What type of PR is this?
/kind feature

What this PR does / why we need it:
Currently cluster status will be set to NotReady immediately if the check failed
To make it more robust, this pr adds cluster failure threshold(default to 30s), which is the duration of failure for the cluster to be considered unhealthy.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:
Should the flag cluster-failure-threshold be the number of failures(default to 3) or the time(default to 30s)? Which is better?

Does this PR introduce a user-facing change?:

`karmada-controller-manager/karmada-agent`: Introduced `--cluster-failure-threshold` flag to specify cluster failure threshold.

RainbowMango · 2022-05-19T02:36:20Z

Should the flag cluster-failure-threshold be the number of failures(default to 3) or the time(default to 30s)? Which is better?

What's your option?

dddddai · 2022-05-19T02:48:31Z

Either is fine with me, time is probably more intuitive but users need to keep cluster-failure-threshold a multiple of cluster-status-update-frequency

Kubefed uses the number: https://github.com/kubernetes-sigs/kubefed/blob/v0.9.2/charts/kubefed/charts/controllermanager/templates/kubefedconfig.yaml#L19

pkg/controllers/status/cluster_status_controller.go

cmd/agent/app/agent.go

pkg/controllers/status/cluster_status_controller.go

dddddai · 2022-05-20T02:57:11Z

Thank you, I've addressed your comments

pkg/controllers/status/cluster_status_controller.go

RainbowMango · 2022-05-20T08:34:33Z

I really want something like ClusterData which caches each cluster's last prob time and status. How do you think?

RainbowMango · 2022-05-20T08:40:56Z

By the way, we should clean up the data when deleting the cluster.

karmada/pkg/controllers/status/cluster_status_controller.go

Lines 98 to 102 in 92a134b

    
           // The resource may no longer exist, in which case we stop the informer. 
        
           if apierrors.IsNotFound(err) { 
        
           	c.InformerManager.Stop(req.NamespacedName.Name) 
        
           	return controllerruntime.Result{}, nil 
        
           }

dddddai · 2022-05-20T09:16:25Z

I really want something like ClusterData which caches each cluster's last prob time and status. How do you think?

For "last prob time", kubefed has its own Condition but we are using metav1.Condition where we don't have "last prob time"

For "status", kubefed stores it because it will restore the old status, we just return and retry in the next health check, I think they take the same effect

So I don't see where we are gonna use it? Please correct me if I'm wrong

RainbowMango · 2022-05-20T09:55:54Z

What I mean something like ClusterData, learn the idea, not the whole struct.
And last prob time means last success or last failure time the time will be changed only when the condition change.

dddddai · 2022-05-21T02:02:24Z

OK, added

pkg/controllers/status/cluster_status_controller.go

RainbowMango · 2022-05-21T08:35:27Z

pkg/controllers/status/cluster_status_controller.go

@@ -195,6 +201,30 @@ func (c *ClusterStatusController) syncClusterStatus(cluster *clusterv1alpha1.Clu
 	return c.updateStatusIfNeeded(cluster, currentClusterStatus)
 }

+func (c *ClusterStatusController) retryOnFailure(online, healthy bool, cluster string) bool {


I can't tell but seems something doesn't feel right here.
Let me think about it.

Have modified as you suggested

RainbowMango · 2022-05-24T03:20:46Z

pkg/controllers/status/cluster_condition_cache_test.go

+		{
+			// the first retry
+			online:        false,
+			healthy:       false,
+			expectedReady: metav1.ConditionTrue,
+		},
+		{
+			// the second retry
+			online:        false,
+			healthy:       false,
+			expectedReady: metav1.ConditionTrue,
+		},
+		{
+			// the third retry
+			// cluster is still unhealthy after the threshold, set cluster status to not ready
+			online:        false,
+			healthy:       false,
+			expectedReady: metav1.ConditionFalse,
+		},


Seems these tests interact with each other?

Yes they are observed statuses ordered by time, because thresholdReadyCondition depends on previous statuses

In this test, we are testing thresholdAdjustedReadyCondition, we should set up cache, current condition, and observed condition for each case. They should not depend on each other.

That would be kind of tricky..How to set up the cache?(clusterDataMap and thresholdStartTime)

We are testing clusterConditionStore.thresholdAdjustedReadyCondition where clusterConditionStore is stateful

RainbowMango · 2022-05-24T03:59:35Z

func TestThresholdAdjustedReadyCondition2(t *testing.T) {
	tests := []struct {
		name              string
		clusterData       *clusterData
		currentCondition  *metav1.Condition
		observedCondition *metav1.Condition
		expectedCondition *metav1.Condition
	}{
		{
			name:              "cluster just joined in ready state",
			clusterData:       nil, // no cache yet
			currentCondition:  nil, // no condition was set on cluster object yet
			observedCondition: &metav1.Condition{Status: metav1.ConditionTrue},
			expectedCondition: &metav1.Condition{Status: metav1.ConditionTrue},
		},
		{
			name:              "cluster just joined in not-ready state",
			clusterData:       nil, // no cache yet
			currentCondition:  nil, // no condition was set on cluster object yet
			observedCondition: &metav1.Condition{Status: metav1.ConditionFalse},
			expectedCondition: &metav1.Condition{Status: metav1.ConditionFalse},
		},
		{
			name: "cluster becomes not ready but still not reach threshold",
			clusterData: &clusterData{
				readyCondition:     metav1.ConditionFalse,
				thresholdStartTime: time.Now(),
			},
			currentCondition:  &metav1.Condition{Status: metav1.ConditionTrue},
			observedCondition: &metav1.Condition{Status: metav1.ConditionFalse},
			expectedCondition: &metav1.Condition{Status: metav1.ConditionTrue},
		},
	}

	for _, test := range tests {
		t.Run(test.name, func(t *testing.T) {
			cache := clusterConditionStore{
				clusterDataMap:   sync.Map{},
				failureThreshold: 10,
			}

			if test.clusterData != nil {
				cache.update("foo", test.clusterData)
			}

			cluster := &clusterv1alpha1.Cluster{}
			cluster.Name = "foo"
			if test.currentCondition != nil {
				meta.SetStatusCondition(&cluster.Status.Conditions, *test.currentCondition)
			}

			thresholdReadyCondition := cache.thresholdAdjustedReadyCondition(cluster, test.observedCondition)

			if test.expectedCondition.Status != thresholdReadyCondition.Status {
				t.Fatalf("expected: %s, but got: %s", test.expectedCondition.Status, thresholdReadyCondition.Status)
			}
		})
	}
}

RainbowMango · 2022-05-24T04:00:35Z

Just tried add some cases(not finished).

Signed-off-by: dddddai <dddwq@foxmail.com>

RainbowMango · 2022-05-24T09:49:27Z

I tested it on my side, I think we are ready to move further. Thanks @dddddai

RainbowMango

/lgtm
/approve

karmada-bot · 2022-05-24T09:52:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RainbowMango

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [RainbowMango]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

karmada-bot added the kind/feature Categorizes issue or PR as related to a new feature. label May 18, 2022

karmada-bot requested review from lonelyCZ and pigletfly May 18, 2022 12:43

karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 18, 2022

RainbowMango reviewed May 19, 2022

View reviewed changes

pkg/controllers/status/cluster_status_controller.go Outdated Show resolved Hide resolved

dddddai force-pushed the cluster-status branch from 3875a53 to 721c0a9 Compare May 19, 2022 10:10

karmada-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 19, 2022

dddddai force-pushed the cluster-status branch 2 times, most recently from 5da23b8 to a291bcb Compare May 19, 2022 10:16

RainbowMango reviewed May 19, 2022

View reviewed changes

pkg/controllers/status/cluster_status_controller.go Show resolved Hide resolved

cmd/agent/app/agent.go Outdated Show resolved Hide resolved

pkg/controllers/status/cluster_status_controller.go Outdated Show resolved Hide resolved

dddddai force-pushed the cluster-status branch from a291bcb to 9d1d5cb Compare May 20, 2022 02:53

karmada-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 20, 2022

dddddai force-pushed the cluster-status branch 2 times, most recently from 7be1cc0 to 3ffa906 Compare May 20, 2022 03:36

RainbowMango reviewed May 20, 2022

View reviewed changes

pkg/controllers/status/cluster_status_controller.go Outdated Show resolved Hide resolved

pkg/controllers/status/cluster_status_controller.go Outdated Show resolved Hide resolved

dddddai force-pushed the cluster-status branch from 3ffa906 to 5fe5ceb Compare May 20, 2022 09:32

dddddai force-pushed the cluster-status branch from 5fe5ceb to 72cb236 Compare May 21, 2022 01:53

RainbowMango reviewed May 21, 2022

View reviewed changes

dddddai force-pushed the cluster-status branch 2 times, most recently from e08a0a1 to a97bc9a Compare May 23, 2022 13:23

dddddai force-pushed the cluster-status branch from a97bc9a to b67386c Compare May 24, 2022 01:22

RainbowMango reviewed May 24, 2022

View reviewed changes

dddddai force-pushed the cluster-status branch 3 times, most recently from b5db3eb to 339f431 Compare May 24, 2022 09:19

add cluster failure threshold

339f431

Signed-off-by: dddddai <dddwq@foxmail.com>

RainbowMango approved these changes May 24, 2022

View reviewed changes

karmada-bot assigned RainbowMango May 24, 2022

karmada-bot added the lgtm Indicates that a PR is ready to be merged. label May 24, 2022

karmada-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 24, 2022

karmada-bot merged commit 801d187 into karmada-io:master May 24, 2022

RainbowMango mentioned this pull request May 25, 2022

Add metadataSufficient status to indicate wether cluster have get all api-resource #1804

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cluster failure threshold #1829

Add cluster failure threshold #1829

dddddai commented May 18, 2022 •

edited by RainbowMango

Loading

RainbowMango commented May 19, 2022

dddddai commented May 19, 2022 •

edited

Loading

dddddai commented May 20, 2022

RainbowMango commented May 20, 2022

RainbowMango commented May 20, 2022

dddddai commented May 20, 2022

RainbowMango commented May 20, 2022

dddddai commented May 21, 2022

RainbowMango May 21, 2022

dddddai May 24, 2022

RainbowMango May 24, 2022

dddddai May 24, 2022 •

edited

Loading

RainbowMango May 24, 2022

dddddai May 24, 2022 •

edited

Loading

dddddai May 24, 2022

RainbowMango commented May 24, 2022

RainbowMango commented May 24, 2022

RainbowMango commented May 24, 2022

RainbowMango left a comment

karmada-bot commented May 24, 2022

Add cluster failure threshold #1829

Add cluster failure threshold #1829

Conversation

dddddai commented May 18, 2022 • edited by RainbowMango Loading

RainbowMango commented May 19, 2022

dddddai commented May 19, 2022 • edited Loading

dddddai commented May 20, 2022

RainbowMango commented May 20, 2022

RainbowMango commented May 20, 2022

dddddai commented May 20, 2022

RainbowMango commented May 20, 2022

dddddai commented May 21, 2022

RainbowMango May 21, 2022

Choose a reason for hiding this comment

dddddai May 24, 2022

Choose a reason for hiding this comment

RainbowMango May 24, 2022

Choose a reason for hiding this comment

dddddai May 24, 2022 • edited Loading

Choose a reason for hiding this comment

RainbowMango May 24, 2022

Choose a reason for hiding this comment

dddddai May 24, 2022 • edited Loading

Choose a reason for hiding this comment

dddddai May 24, 2022

Choose a reason for hiding this comment

RainbowMango commented May 24, 2022

RainbowMango commented May 24, 2022

RainbowMango commented May 24, 2022

RainbowMango left a comment

Choose a reason for hiding this comment

karmada-bot commented May 24, 2022

dddddai commented May 18, 2022 •

edited by RainbowMango

Loading

dddddai commented May 19, 2022 •

edited

Loading

dddddai May 24, 2022 •

edited

Loading

dddddai May 24, 2022 •

edited

Loading