Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaking test: setup e2e test environment #3667

Closed
RainbowMango opened this issue Jun 13, 2023 · 9 comments · Fixed by #3682 or #3699
Closed

Flaking test: setup e2e test environment #3667

RainbowMango opened this issue Jun 13, 2023 · 9 comments · Fixed by #3682 or #3699
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test.

Comments

@RainbowMango
Copy link
Member

RainbowMango commented Jun 13, 2023

Which jobs are flaking:

e2e test

Which test(s) are flaking:

e2e test(setup e2e test environment)

Reason for failure:

Anything else we need to know:

Waiting for kubeconfig file /home/runner/.kube/members.config and clusters member1 to be ready...
Waiting for running............................................................................................................................................................................................................................................................................................................
Error:  Timeout waiting for condition running
Error: Process completed with exit code 1.
@RainbowMango RainbowMango added the kind/flake Categorizes issue or PR as related to a flaky test. label Jun 13, 2023
@chaosi-zju
Copy link
Member

chaosi-zju commented Jun 13, 2023

Firstly, we record several similar but not identical errors:


  1. https://github.com/chaosi-zju/karmada/actions/runs/5243453494/jobs/9468385141
Waiting for kubeconfig file /home/runner/.kube/members.config and clusters member3 to be ready...
Waiting for running............................................................................................................................................................................................................................................................................................................
Error:  Timeout waiting for condition running
Error: Process completed with exit code 1.

corresponding kind logs:

Deleting cluster "member3" ...
failed to update kubeconfig: failed to lock config file: open /home/runner/.kube/config.lock: file exists
ERROR: failed to delete cluster "member3": failed to lock config file: open /home/runner/.kube/config.lock: file exists


  1. https://github.com/chaosi-zju/karmada/actions/runs/5252848013/jobs/9489654495
Waiting for the host clusters to be ready...
Waiting for kubeconfig file /home/runner/.kube/karmada.config and clusters karmada-host to be ready...

Error:  Timeout waiting for file exist /home/runner/.kube/karmada.config
Error: Process completed with exit code 1.

corresponding kind logs:

• Starting control-plane 🕹️ ...
✗ Starting control-plane 🕹️
Deleted nodes: ["karmada-host-control-plane"]
ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged karmada-host-control-plane kubeadm init --skip-phases=preflight --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 1
...
I0613 08:20:07.763691 250 round_trippers.go:553] GET https://karmada-host-control-plane:6443/healthz?timeout=10s in 10003 milliseconds
I0613 08:20:17.866527 250 round_trippers.go:553] GET https://karmada-host-control-plane:6443/healthz?timeout=10s in 10041 milliseconds
...
Unfortunately, an error has occurred:
timed out waiting for the condition
This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
...
couldn't initialize a Kubernetes cluster
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init.runWaitControlPlanePhase
cmd/kubeadm/app/cmd/phases/init/waitcontrolplane.go:108
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
cmd/kubeadm/app/cmd/phases/workflow/runner.go:234

@chaosi-zju
Copy link
Member

I may find the root cause for Timeout waiting for condition running

Summary in one sentence : when deleting a cluster, it is necessary to modify kubeconfig. In Kind, a file lock is used when modifying kubeconfig. When deleting clusters simultaneously, the competition for file locks fails, resulting in the failure of deleting the cluster.


My conclusion is based on the logs of kind, and combining it with reading the source code of Kind and karmada

Firstly,the failure logs of Kind are as follows:

Deleting cluster "member3" ...
failed to update kubeconfig: failed to lock config file: open /home/runner/.kube/config.lock: file exists
ERROR: failed to delete cluster "member3": failed to lock config file: open /home/runner/.kube/config.lock: file exists

we get a clue that this failure is related to deleting the cluster and file locks.

Secondly,we read the source code of Karmada:

image

image

we further learned that Karmada just called kind delete cluster command.

Thirdly,we read the source code of Kind

image

as you see, when execute kind delete clusterKind shall remove cluster info from KUBECONFIG,and it used file locks to prevent concurrent operations.

image

where does KUBECONFIG from? you can see: --kubeconfig > $KUBECONFIG > ${HOME}/.kube/config

so, in this case, neither --kubeconfig nor $KUBECONFIG is specificed, three thread want to execute kind delete cluster, they race for ${HOME}/.kube/config.lock, once some of the threads failed, the entire script will fail with ERROR: failed to delete cluster "XXX": failed to lock config file: open /home/runner/.kube/config.lock: file exists


If my inference holds,how to solve this problem?

Method one

I considered that:

  • time consuming of kind delete cluster is very short, we can execute it serially.
  • time consuming of kind create cluster relatively long, we execute it concurrently. although it also has file lock issues, writing files only takes a short time and the probability of conflicts is low.

It cannot completely solve the problem, but it can greatly reduce the probability of occurrence. The advantage is that the changes are minimal.

Method two

  1. every cluster use different kubeconfig path and specific --kubeconfig when kind delete cluster.
  2. after we install all the clusters, we help users export a environment such as KUBECONFIG={KUBECONFIG1}:{KUBECONFIG2}:{KUBECONFIG3}, prevent too many kubeconfig files from frequently modifying environment variables during user use.

It can solve the problem, but the changes may relatively major.

Thanks!

@RainbowMango
Copy link
Member Author

Nice finding!!!

I tend to method two that solves this issue fundamentally. But I don't think it's a good idea to generate a kubeconfig file for each cluster. Do we have other solutions that don't change the user experience? Can we merge the kubeconfig files after creating the cluster?

@yike21
Copy link
Member

yike21 commented Jun 14, 2023

I tend to method two that solves this issue fundamentally. But I don't think it's a good idea to generate a kubeconfig file for each cluster. Do we have other solutions that don't change the user experience? Can we merge the kubeconfig files after creating the cluster?

It sounds good! Thanks for your excellent work @chaosi-zju!

This tool kubecm seems helpful. Perhaps we could consider using it, but that would introduce a new dependency.

kubecm merge member1.config member2.config member3.config --config /root/.kube/members.config -y

This is the sample using kubecm.

  1. These are the config files to be merged: /root/.kube/members_temp.config and /root/.kube/karmada_temp.config.
    image

  2. we can get merged file merge.config specified by --config using
    kubecm merge members_temp.config karmada_temp.config --config /root/.kube/merge.config -y
    image

  3. The merged config file merge.config is below.
    image

However, we can see that the context-name NAME is modified which is not member1 or karmada-apiserver. So even if we decided to use kubecm, there are still remaining works to do.

$ cat merge.config

image

@chaosi-zju
Copy link
Member

chaosi-zju commented Jun 15, 2023

Thanks for your commendable advice ! @yike21 @RainbowMango

I have carefully considered your advice and taken into account my own concerns, I got another method !


Method three

  1. When creating cluster, we do like this:
kind create cluster --name member1 --kubeconfig="/root/.kube/member-member1.config" --image kindest/node:v1.26.0
kind create cluster --name member2 --kubeconfig="/root/.kube/member-member2.config" --image kindest/node:v1.26.0

as you can see, every cluster use individual kubeconfig (consider of file lock issues also exist in creating process)

2、Merge kubeconfig by kubectl, instead of kubecm

# export KUBECONFIG=/root/.kube/member-member1.config:/root/.kube/member-member2.config

export KUBECONFIG=$(find /root/.kube -maxdepth 1 -type f | grep member- | tr '\n' ':')

kubectl config view --flatten > /root/.kube/members.config

no additional dependency, no context-name modified issue, and maintain the same state after installation as before.

3、When deleting cluster, we do like this:

kind delete clusters --kubeconfig /root/.kube/members.config --all

or

kind delete clusters --kubeconfig /root/.kube/members.config member1 member2

it is worth noting that you must specify --kubeconfig, othersize during a new installation, clusters actually not exist, then you may encounter:

one Kind process deleting member cluster, while another deleting host cluster, they are still likely to race for ${HOME}/.kube/config which is default value of kubeconfig.

however, with --kubeconfig specified, deleting of member cluster is serialized, no longer a problem.


Here just use member cluster as an example, host cluster is the same.

Perhaps there are other problems here, please point them out, thanks !

@yike21
Copy link
Member

yike21 commented Jun 15, 2023

Thanks for your commendable advice ! @yike21 @RainbowMango

I have carefully considered your advice and taken into account my own concerns, I got another method !

Method three

  1. When creating cluster, we do like this:
kind create cluster --name member1 --kubeconfig="/root/.kube/member-member1.config" --image kindest/node:v1.26.0
kind create cluster --name member2 --kubeconfig="/root/.kube/member-member2.config" --image kindest/node:v1.26.0

as you can see, every cluster use individual kubeconfig (consider of file lock issues also exist in creating process)

2、Merge kubeconfig by kubectl, instead of kubecm

# export KUBECONFIG=/root/.kube/member-member1.config:/root/.kube/member-member2.config

export KUBECONFIG=$(find /root/.kube -maxdepth 1 -type f | grep member- | tr '\n' ':')

kubectl config view --flatten > /root/.kube/members.config

no additional dependency, no context-name modified issue, and maintain the same state after installation as before.

3、When deleting cluster, we do like this:

kind delete clusters --kubeconfig /root/.kube/members.config --all

or

kind delete clusters member1 member2

Here just use member cluster as an example, host cluster is the same.

Perhaps there are other problems here, please point them out, thanks !

Cool! I agree with you, this method is more reasonable. 👍

@RainbowMango
Copy link
Member Author

/assign @chaosi-zju
in favor of #3682

@chaosi-zju
Copy link
Member

chaosi-zju commented Jun 26, 2023

Firstly, we record several similar but not identical errors:

  1. https://github.com/chaosi-zju/karmada/actions/runs/5243453494/jobs/9468385141
Waiting for kubeconfig file /home/runner/.kube/members.config and clusters member3 to be ready...
Waiting for running............................................................................................................................................................................................................................................................................................................
Error:  Timeout waiting for condition running
Error: Process completed with exit code 1.

corresponding kind logs:

Deleting cluster "member3" ...
failed to update kubeconfig: failed to lock config file: open /home/runner/.kube/config.lock: file exists
ERROR: failed to delete cluster "member3": failed to lock config file: open /home/runner/.kube/config.lock: file exists

  1. https://github.com/chaosi-zju/karmada/actions/runs/5252848013/jobs/9489654495
Waiting for the host clusters to be ready...
Waiting for kubeconfig file /home/runner/.kube/karmada.config and clusters karmada-host to be ready...

Error:  Timeout waiting for file exist /home/runner/.kube/karmada.config
Error: Process completed with exit code 1.

corresponding kind logs:

• Starting control-plane 🕹️ ...
✗ Starting control-plane 🕹️
Deleted nodes: ["karmada-host-control-plane"]
ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged karmada-host-control-plane kubeadm init --skip-phases=preflight --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 1
...
I0613 08:20:07.763691 250 round_trippers.go:553] GET https://karmada-host-control-plane:6443/healthz?timeout=10s in 10003 milliseconds
I0613 08:20:17.866527 250 round_trippers.go:553] GET https://karmada-host-control-plane:6443/healthz?timeout=10s in 10041 milliseconds
...
Unfortunately, an error has occurred:
timed out waiting for the condition
This error is likely caused by:

  • The kubelet is not running
  • The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
    ...
    couldn't initialize a Kubernetes cluster
    k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init.runWaitControlPlanePhase
    cmd/kubeadm/app/cmd/phases/init/waitcontrolplane.go:108
    k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
    cmd/kubeadm/app/cmd/phases/workflow/runner.go:234

There are two types of errors here, I only solved the No.1 error before: Timeout waiting for condition running

However, the No.2 error occurred frequently yesterday: Timeout waiting for file exist /home/runner/.kube/karmada.config

You can see the related failed CI: https://github.com/karmada-io/karmada/actions/runs/5367319510/jobs/9737422778


@yike21

Now, I Find the reason for this:

We expect the Kind version to be v0.17.0, but the ubuntu image in our CI Runner now comes with kind@v0.20.0. The error mentioned above almost always occurs in ubuntu-20.04 when using kind@v0.20.0, but it runs normally on ubuntu-22.04.

Our CI logic checks whether there is a kind command and if so, it will not install it again, so we are using the kind@v0.20.0 that comes with the image by default . Therefore, the solution is either to upgrade to ubuntu-22.04 or force installation of kind@v0.17 .

https://github.com/kubernetes-sigs/kind/releases/tag/v0.20.0

image

@yike21
Copy link
Member

yike21 commented Jun 26, 2023

Now, I Find the reason for this:

We expect the Kind version to be v0.17.0, but the ubuntu image in our CI Runner now comes with kind@v0.20.0. The error mentioned above almost always occurs in ubuntu-20.04 when using kind@v0.20.0, but it runs normally on ubuntu-22.04.

Our CI logic checks whether there is a kind command and if so, it will not install it again, so we are using the kind@v0.20.0 that comes with the image by default . Therefore, the solution is either to upgrade to ubuntu-22.04 or force installation of kind@v0.17 .

https://github.com/kubernetes-sigs/kind/releases/tag/v0.20.0

image

I got it. Thanks for your excellent work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test.
Projects
None yet
3 participants