Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Failing Test] kubeadm-kinder-super-admin-latest keeps failing for timeout #2995

Closed
pacoxu opened this issue Jan 16, 2024 · 16 comments · Fixed by kubernetes/kubernetes#122811
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Milestone

Comments

@pacoxu
Copy link
Member

pacoxu commented Jan 16, 2024

https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-super-admin-latest

+ docker exec kinder-super-admin-control-plane-1 kubeadm certs renew super-admin.conf
[renew] Reading configuration from the cluster...
[renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
 timeout. The task did not complete in less than 5m0s as expected

yesterday, after merging kubernetes/kubernetes#122735 and kubernetes/kubernetes#122529.

@pacoxu
Copy link
Member Author

pacoxu commented Jan 16, 2024

A local test shows that this is caused by kubernetes/kubernetes#122529 @neolit123 @SataQiu

@pacoxu pacoxu added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Jan 16, 2024
@pacoxu
Copy link
Member Author

pacoxu commented Jan 16, 2024

@neolit123
Copy link
Member

neolit123 commented Jan 16, 2024

+ docker exec kinder-super-admin-control-plane-1 kubeadm certs renew super-admin.conf
[renew] Reading configuration from the cluster...
[renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
 timeout. The task did not complete in less than 5m0s as expected

yesterday, after merging kubernetes/kubernetes#122735 and kubernetes/kubernetes#122529.

where is this error happening?
is it a failing test or a flaky test?
EDIT: NVM, saw it here:
https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-super-admin-latest

@neolit123
Copy link
Member

neolit123 commented Jan 16, 2024

https://github.com/kubernetes/kubernetes/pull/122529/files#r1452908313

we should stop using exp backoff in kubeadm.

https://github.com/kubernetes/kubernetes/pull/122529/files#r1452908313

500ms / 1min is 120 retries
is it still failing, does it need more time?

@pacoxu pacoxu changed the title [Failing Test] https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-super-admin-latest [Failing Test] kubeadm-kinder-super-admin-latest keeps failing for timeout Jan 16, 2024
@pacoxu
Copy link
Member Author

pacoxu commented Jan 16, 2024

I tried to fix in kubernetes/kubernetes#122802. I tested locally and it works.

500ms / 1min is 120 retries
Yes.

is it still failing, does it need more time?
still

@neolit123
Copy link
Member

neolit123 commented Jan 16, 2024

I tried to fix in kubernetes/kubernetes#122802. I tested locally and it works.

500ms / 1min is 120 retries
Yes.

is it still failing, does it need more time?
still

i will try locally as well.
instead of exp backoff we can try KubernetesAPICall.Duration = 2m

but it's surprising that locally it fails with a 1m timeout.

@neolit123 neolit123 added this to the v1.30 milestone Jan 16, 2024
@neolit123 neolit123 added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jan 16, 2024
@SataQiu
Copy link
Member

SataQiu commented Jan 16, 2024

I have test the backoff mode, it takes about ~320ms.

func TestBackoff(t *testing.T) {
	backoff := wait.Backoff{
		Steps:    4,
		Duration: 10 * time.Millisecond,
		Factor:   5.0,
		Jitter:   0.1,
	}
	start := time.Now()
	err := wait.ExponentialBackoff(backoff, func() (bool, error) {
		return false, nil
	})
	t.Logf("elapsed time: %v", time.Since(start))	
	if err != nil {
		t.Error(err)
	}
}

@pacoxu
Copy link
Member Author

pacoxu commented Jan 16, 2024

~320ms

Steps

  1. immediate
  2. ~10ms
  3. ~50ms
  4. ~250ms

@neolit123
Copy link
Member

is the error actually due to timeout?
500ms interval in 1 minute is more attempts

@pacoxu
Copy link
Member Author

pacoxu commented Jan 16, 2024

is the error actually due to timeout?
500ms interval in 1 minute is more attempts

The task did not complete in less than 5m0s as expected

We have a 5 minutes timeout for each step.

pre-init will trigger 4 renew and other steps.

# Make sure that the check-expiration and renew commands do not return errors
${CMD} kubeadm certs renew admin.conf || exit 1
${CMD} kubeadm certs renew super-admin.conf || exit 1
${CMD} kubeadm certs renew apiserver-kubelet-client || exit 1
${CMD} kubeadm certs check-expiration || exit 1
# Delete super-admin.conf and make sure check-expiration and renew do not return errors
${CMD} rm -f "/etc/kubernetes/super-admin.conf" || exit 1
${CMD} kubeadm certs renew super-admin.conf || exit 1

The timeout happened in the forth renew which is almost 5minutes.

@neolit123
Copy link
Member

yes, but these certs renew calls should complete right away. they do not block the execution for 1 minute each.
a "poll function' returns as soon as it gets the job done - e.g. 500ms, 1000ms, 1500ms...

@pacoxu
Copy link
Member Author

pacoxu commented Jan 16, 2024

my local test is one minute each
I can reproduce the ci failure.

@neolit123
Copy link
Member

neolit123 commented Jan 16, 2024

i understand what the problem is now:

[renew] Reading configuration from the cluster...
[renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[renew] Error reading configuration from the Cluster. Falling back to default configuration

the way some parts of kubeadm are designed is:

  • kubeadm checks if it can pull config from cluster
  • if not it uses default config

in such conditions we should pass a shorter timeout to kubeadm, but not use a expbackoff.
it's a due this PR kubernetes/kubernetes#122529
i will send an update.

@pacoxu
Copy link
Member Author

pacoxu commented Jan 16, 2024

Adding another short timeout would be a better solution.

I roughly revert that part😄 which may be not a final solution.

@neolit123
Copy link
Member

neolit123 commented Jan 16, 2024

yeah, like i mentioned a few times we should no longer use exp backoff in kubeadm (just for consistency)
just PollUntilContextTimeout or PollUntilContextCancel

@neolit123
Copy link
Member

@pacoxu please LGTM
kubernetes/kubernetes#122811

your PR is correct, but i used Poll* for consistency instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
3 participants