[Failing Test] kubeadm-kinder-super-admin-latest keeps failing for timeout #2995

pacoxu · 2024-01-16T03:36:59Z

https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-super-admin-latest

+ docker exec kinder-super-admin-control-plane-1 kubeadm certs renew super-admin.conf
[renew] Reading configuration from the cluster...
[renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
 timeout. The task did not complete in less than 5m0s as expected

yesterday, after merging kubernetes/kubernetes#122735 and kubernetes/kubernetes#122529.

The text was updated successfully, but these errors were encountered:

pacoxu · 2024-01-16T04:02:53Z

A local test shows that this is caused by kubernetes/kubernetes#122529 @neolit123 @SataQiu

pacoxu · 2024-01-16T04:17:11Z

Commented here: https://github.com/kubernetes/kubernetes/pull/122529/files#r1452908313

neolit123 · 2024-01-16T06:49:35Z

+ docker exec kinder-super-admin-control-plane-1 kubeadm certs renew super-admin.conf
[renew] Reading configuration from the cluster...
[renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
 timeout. The task did not complete in less than 5m0s as expected

yesterday, after merging kubernetes/kubernetes#122735 and kubernetes/kubernetes#122529.

where is this error happening?
is it a failing test or a flaky test?
EDIT: NVM, saw it here:
https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-super-admin-latest

neolit123 · 2024-01-16T06:52:40Z

https://github.com/kubernetes/kubernetes/pull/122529/files#r1452908313

we should stop using exp backoff in kubeadm.

https://github.com/kubernetes/kubernetes/pull/122529/files#r1452908313

500ms / 1min is 120 retries
is it still failing, does it need more time?

pacoxu · 2024-01-16T07:02:48Z

I tried to fix in kubernetes/kubernetes#122802. I tested locally and it works.

500ms / 1min is 120 retries
Yes.

is it still failing, does it need more time?
still

neolit123 · 2024-01-16T07:07:16Z

I tried to fix in kubernetes/kubernetes#122802. I tested locally and it works.

500ms / 1min is 120 retries
Yes.

is it still failing, does it need more time?
still

i will try locally as well.
instead of exp backoff we can try KubernetesAPICall.Duration = 2m

but it's surprising that locally it fails with a 1m timeout.

SataQiu · 2024-01-16T09:25:10Z

I have test the backoff mode, it takes about ~320ms.

func TestBackoff(t *testing.T) {
	backoff := wait.Backoff{
		Steps:    4,
		Duration: 10 * time.Millisecond,
		Factor:   5.0,
		Jitter:   0.1,
	}
	start := time.Now()
	err := wait.ExponentialBackoff(backoff, func() (bool, error) {
		return false, nil
	})
	t.Logf("elapsed time: %v", time.Since(start))	
	if err != nil {
		t.Error(err)
	}
}

pacoxu · 2024-01-16T09:50:48Z

~320ms

Steps

immediate
~10ms
~50ms
~250ms

neolit123 · 2024-01-16T10:05:13Z

is the error actually due to timeout?
500ms interval in 1 minute is more attempts

pacoxu · 2024-01-16T10:08:22Z

is the error actually due to timeout?
500ms interval in 1 minute is more attempts

The task did not complete in less than 5m0s as expected

kubeadm/kinder/ci/workflows/super-admin-tasks.yaml

Line 97 in 4598c42

timeout: 5m

We have a 5 minutes timeout for each step.

pre-init will trigger 4 renew and other steps.

kubeadm/kinder/ci/workflows/super-admin-tasks.yaml

Lines 80 to 88 in 4598c42

    
                 # Make sure that the check-expiration and renew commands do not return errors 
        
                 ${CMD} kubeadm certs renew admin.conf || exit 1 
        
                 ${CMD} kubeadm certs renew super-admin.conf || exit 1 
        
                 ${CMD} kubeadm certs renew apiserver-kubelet-client || exit 1 
        
                 ${CMD} kubeadm certs check-expiration || exit 1 
        
                 # Delete super-admin.conf and make sure check-expiration and renew do not return errors 
        
                 ${CMD} rm -f "/etc/kubernetes/super-admin.conf" || exit 1 
        
                 ${CMD} kubeadm certs renew super-admin.conf || exit 1

The timeout happened in the forth renew which is almost 5minutes.

neolit123 · 2024-01-16T10:10:55Z

yes, but these certs renew calls should complete right away. they do not block the execution for 1 minute each.
a "poll function' returns as soon as it gets the job done - e.g. 500ms, 1000ms, 1500ms...

pacoxu · 2024-01-16T10:37:59Z

my local test is one minute each
I can reproduce the ci failure.

neolit123 · 2024-01-16T15:31:56Z

i understand what the problem is now:

[renew] Reading configuration from the cluster...
[renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[renew] Error reading configuration from the Cluster. Falling back to default configuration

the way some parts of kubeadm are designed is:

kubeadm checks if it can pull config from cluster
if not it uses default config

in such conditions we should pass a shorter timeout to kubeadm, but not use a expbackoff.
it's a due this PR kubernetes/kubernetes#122529
i will send an update.

pacoxu · 2024-01-16T15:34:19Z

Adding another short timeout would be a better solution.

I roughly revert that part😄 which may be not a final solution.

neolit123 · 2024-01-16T15:38:06Z

yeah, like i mentioned a few times we should no longer use exp backoff in kubeadm (just for consistency)
just PollUntilContextTimeout or PollUntilContextCancel

neolit123 · 2024-01-16T16:01:21Z

@pacoxu please LGTM
kubernetes/kubernetes#122811

your PR is correct, but i used Poll* for consistency instead.

pacoxu added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Jan 16, 2024

pacoxu mentioned this issue Jan 16, 2024

revert part of #122529 to allow a client may fail kubernetes/kubernetes#122802

Closed

pacoxu changed the title ~~[Failing Test] https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-super-admin-latest~~ [Failing Test] kubeadm-kinder-super-admin-latest keeps failing for timeout Jan 16, 2024

neolit123 added this to the v1.30 milestone Jan 16, 2024

neolit123 added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jan 16, 2024

pacoxu mentioned this issue Jan 17, 2024

kubeadm: keep a function with short timeout in idempotency.go kubernetes/kubernetes#122811

Merged

k8s-ci-robot closed this as completed in kubernetes/kubernetes#122811 Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Failing Test] kubeadm-kinder-super-admin-latest keeps failing for timeout #2995

[Failing Test] kubeadm-kinder-super-admin-latest keeps failing for timeout #2995

pacoxu commented Jan 16, 2024 •

edited

pacoxu commented Jan 16, 2024

pacoxu commented Jan 16, 2024

neolit123 commented Jan 16, 2024 •

edited

neolit123 commented Jan 16, 2024 •

edited

pacoxu commented Jan 16, 2024

neolit123 commented Jan 16, 2024 •

edited

SataQiu commented Jan 16, 2024

pacoxu commented Jan 16, 2024

neolit123 commented Jan 16, 2024

pacoxu commented Jan 16, 2024 •

edited

neolit123 commented Jan 16, 2024

pacoxu commented Jan 16, 2024 •

edited

neolit123 commented Jan 16, 2024 •

edited

pacoxu commented Jan 16, 2024

neolit123 commented Jan 16, 2024 •

edited

neolit123 commented Jan 16, 2024

[Failing Test] kubeadm-kinder-super-admin-latest keeps failing for timeout #2995

[Failing Test] kubeadm-kinder-super-admin-latest keeps failing for timeout #2995

Comments

pacoxu commented Jan 16, 2024 • edited

pacoxu commented Jan 16, 2024

pacoxu commented Jan 16, 2024

neolit123 commented Jan 16, 2024 • edited

neolit123 commented Jan 16, 2024 • edited

pacoxu commented Jan 16, 2024

neolit123 commented Jan 16, 2024 • edited

SataQiu commented Jan 16, 2024

pacoxu commented Jan 16, 2024

neolit123 commented Jan 16, 2024

pacoxu commented Jan 16, 2024 • edited

neolit123 commented Jan 16, 2024

pacoxu commented Jan 16, 2024 • edited

neolit123 commented Jan 16, 2024 • edited

pacoxu commented Jan 16, 2024

neolit123 commented Jan 16, 2024 • edited

neolit123 commented Jan 16, 2024

pacoxu commented Jan 16, 2024 •

edited

neolit123 commented Jan 16, 2024 •

edited

neolit123 commented Jan 16, 2024 •

edited

neolit123 commented Jan 16, 2024 •

edited

pacoxu commented Jan 16, 2024 •

edited

pacoxu commented Jan 16, 2024 •

edited

neolit123 commented Jan 16, 2024 •

edited

neolit123 commented Jan 16, 2024 •

edited