Flaking test: setup e2e test environment #3667

RainbowMango · 2023-06-13T07:29:47Z

Which jobs are flaking:

e2e test

Which test(s) are flaking:

e2e test(setup e2e test environment)

Reason for failure:

Anything else we need to know:

Waiting for kubeconfig file /home/runner/.kube/members.config and clusters member1 to be ready...
Waiting for running............................................................................................................................................................................................................................................................................................................
Error:  Timeout waiting for condition running
Error: Process completed with exit code 1.

The text was updated successfully, but these errors were encountered:

chaosi-zju · 2023-06-13T09:36:38Z

Firstly, we record several similar but not identical errors：

https://github.com/chaosi-zju/karmada/actions/runs/5243453494/jobs/9468385141

Waiting for kubeconfig file /home/runner/.kube/members.config and clusters member3 to be ready...
Waiting for running............................................................................................................................................................................................................................................................................................................
Error:  Timeout waiting for condition running
Error: Process completed with exit code 1.

corresponding kind logs:

Deleting cluster "member3" ...
failed to update kubeconfig: failed to lock config file: open /home/runner/.kube/config.lock: file exists
ERROR: failed to delete cluster "member3": failed to lock config file: open /home/runner/.kube/config.lock: file exists

https://github.com/chaosi-zju/karmada/actions/runs/5252848013/jobs/9489654495

Waiting for the host clusters to be ready...
Waiting for kubeconfig file /home/runner/.kube/karmada.config and clusters karmada-host to be ready...

Error:  Timeout waiting for file exist /home/runner/.kube/karmada.config
Error: Process completed with exit code 1.

corresponding kind logs:

• Starting control-plane 🕹️ ...
✗ Starting control-plane 🕹️
Deleted nodes: ["karmada-host-control-plane"]
ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged karmada-host-control-plane kubeadm init --skip-phases=preflight --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 1
...
I0613 08:20:07.763691 250 round_trippers.go:553] GET https://karmada-host-control-plane:6443/healthz?timeout=10s in 10003 milliseconds
I0613 08:20:17.866527 250 round_trippers.go:553] GET https://karmada-host-control-plane:6443/healthz?timeout=10s in 10041 milliseconds
...
Unfortunately, an error has occurred:
timed out waiting for the condition
This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
...
couldn't initialize a Kubernetes cluster
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init.runWaitControlPlanePhase
cmd/kubeadm/app/cmd/phases/init/waitcontrolplane.go:108
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
cmd/kubeadm/app/cmd/phases/workflow/runner.go:234

chaosi-zju · 2023-06-14T08:42:24Z

I may find the root cause for Timeout waiting for condition running

Summary in one sentence : when deleting a cluster, it is necessary to modify kubeconfig. In Kind, a file lock is used when modifying kubeconfig. When deleting clusters simultaneously, the competition for file locks fails, resulting in the failure of deleting the cluster.

My conclusion is based on the logs of kind, and combining it with reading the source code of Kind and karmada

Firstly，the failure logs of Kind are as follows:

Deleting cluster "member3" ...
failed to update kubeconfig: failed to lock config file: open /home/runner/.kube/config.lock: file exists
ERROR: failed to delete cluster "member3": failed to lock config file: open /home/runner/.kube/config.lock: file exists

we get a clue that this failure is related to deleting the cluster and file locks.

Secondly，we read the source code of Karmada:

we further learned that Karmada just called kind delete cluster command.

Thirdly，we read the source code of Kind

as you see, when execute kind delete cluster，Kind shall remove cluster info from KUBECONFIG，and it used file locks to prevent concurrent operations.

where does KUBECONFIG from? you can see: --kubeconfig > $KUBECONFIG > ${HOME}/.kube/config

so, in this case, neither --kubeconfig nor $KUBECONFIG is specificed, three thread want to execute kind delete cluster, they race for ${HOME}/.kube/config.lock, once some of the threads failed, the entire script will fail with ERROR: failed to delete cluster "XXX": failed to lock config file: open /home/runner/.kube/config.lock: file exists

If my inference holds，how to solve this problem?

Method one

I considered that:

time consuming of kind delete cluster is very short, we can execute it serially.
time consuming of kind create cluster relatively long, we execute it concurrently. although it also has file lock issues, writing files only takes a short time and the probability of conflicts is low.

It cannot completely solve the problem, but it can greatly reduce the probability of occurrence. The advantage is that the changes are minimal.

Method two

every cluster use different kubeconfig path and specific --kubeconfig when kind delete cluster.
after we install all the clusters, we help users export a environment such as KUBECONFIG={KUBECONFIG1}:{KUBECONFIG2}:{KUBECONFIG3}, prevent too many kubeconfig files from frequently modifying environment variables during user use.

It can solve the problem, but the changes may relatively major.

Thanks!

RainbowMango · 2023-06-14T09:58:00Z

Nice finding!!!

I tend to method two that solves this issue fundamentally. But I don't think it's a good idea to generate a kubeconfig file for each cluster. Do we have other solutions that don't change the user experience? Can we merge the kubeconfig files after creating the cluster?

yike21 · 2023-06-14T16:17:17Z

I tend to method two that solves this issue fundamentally. But I don't think it's a good idea to generate a kubeconfig file for each cluster. Do we have other solutions that don't change the user experience? Can we merge the kubeconfig files after creating the cluster?

It sounds good! Thanks for your excellent work @chaosi-zju!

This tool kubecm seems helpful. Perhaps we could consider using it, but that would introduce a new dependency.

kubecm merge member1.config member2.config member3.config --config /root/.kube/members.config -y

This is the sample using kubecm.

These are the config files to be merged: /root/.kube/members_temp.config and /root/.kube/karmada_temp.config.
we can get merged file merge.config specified by --config using
kubecm merge members_temp.config karmada_temp.config --config /root/.kube/merge.config -y
The merged config file merge.config is below.

However, we can see that the context-name NAME is modified which is not member1 or karmada-apiserver. So even if we decided to use kubecm, there are still remaining works to do.

$ cat merge.config

chaosi-zju · 2023-06-15T03:50:48Z

Thanks for your commendable advice ! @yike21 @RainbowMango

I have carefully considered your advice and taken into account my own concerns, I got another method !

Method three

When creating cluster, we do like this:

kind create cluster --name member1 --kubeconfig="/root/.kube/member-member1.config" --image kindest/node:v1.26.0
kind create cluster --name member2 --kubeconfig="/root/.kube/member-member2.config" --image kindest/node:v1.26.0

as you can see, every cluster use individual kubeconfig (consider of file lock issues also exist in creating process)

2、Merge kubeconfig by kubectl, instead of kubecm

# export KUBECONFIG=/root/.kube/member-member1.config:/root/.kube/member-member2.config

export KUBECONFIG=$(find /root/.kube -maxdepth 1 -type f | grep member- | tr '\n' ':')

kubectl config view --flatten > /root/.kube/members.config

no additional dependency, no context-name modified issue, and maintain the same state after installation as before.

3、When deleting cluster, we do like this:

kind delete clusters --kubeconfig /root/.kube/members.config --all

or

kind delete clusters --kubeconfig /root/.kube/members.config member1 member2

it is worth noting that you must specify --kubeconfig, othersize during a new installation, clusters actually not exist, then you may encounter:

one Kind process deleting member cluster, while another deleting host cluster, they are still likely to race for ${HOME}/.kube/config which is default value of kubeconfig.

however, with --kubeconfig specified, deleting of member cluster is serialized, no longer a problem.

Here just use member cluster as an example, host cluster is the same.

Perhaps there are other problems here, please point them out, thanks !

yike21 · 2023-06-15T05:18:12Z

Thanks for your commendable advice ! @yike21 @RainbowMango

I have carefully considered your advice and taken into account my own concerns, I got another method !

Method three

When creating cluster, we do like this:
kind create cluster --name member1 --kubeconfig="/root/.kube/member-member1.config" --image kindest/node:v1.26.0
kind create cluster --name member2 --kubeconfig="/root/.kube/member-member2.config" --image kindest/node:v1.26.0
as you can see, every cluster use individual kubeconfig (consider of file lock issues also exist in creating process)

2、Merge kubeconfig by kubectl, instead of kubecm
# export KUBECONFIG=/root/.kube/member-member1.config:/root/.kube/member-member2.config

export KUBECONFIG=$(find /root/.kube -maxdepth 1 -type f | grep member- | tr '\n' ':')

kubectl config view --flatten > /root/.kube/members.config
no additional dependency, no context-name modified issue, and maintain the same state after installation as before.

3、When deleting cluster, we do like this:
kind delete clusters --kubeconfig /root/.kube/members.config --all
or
kind delete clusters member1 member2
Here just use member cluster as an example, host cluster is the same.

Perhaps there are other problems here, please point them out, thanks !

Cool! I agree with you, this method is more reasonable. 👍

RainbowMango · 2023-06-20T07:46:22Z

/assign @chaosi-zju
in favor of #3682

chaosi-zju · 2023-06-26T01:57:58Z

Firstly, we record several similar but not identical errors：

https://github.com/chaosi-zju/karmada/actions/runs/5243453494/jobs/9468385141
Waiting for kubeconfig file /home/runner/.kube/members.config and clusters member3 to be ready...
Waiting for running............................................................................................................................................................................................................................................................................................................
Error:  Timeout waiting for condition running
Error: Process completed with exit code 1.
corresponding kind logs:

Deleting cluster "member3" ...
failed to update kubeconfig: failed to lock config file: open /home/runner/.kube/config.lock: file exists
ERROR: failed to delete cluster "member3": failed to lock config file: open /home/runner/.kube/config.lock: file exists

https://github.com/chaosi-zju/karmada/actions/runs/5252848013/jobs/9489654495
Waiting for the host clusters to be ready...
Waiting for kubeconfig file /home/runner/.kube/karmada.config and clusters karmada-host to be ready...

Error:  Timeout waiting for file exist /home/runner/.kube/karmada.config
Error: Process completed with exit code 1.
corresponding kind logs:

• Starting control-plane 🕹️ ...
✗ Starting control-plane 🕹️
Deleted nodes: ["karmada-host-control-plane"]
ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged karmada-host-control-plane kubeadm init --skip-phases=preflight --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 1
...
I0613 08:20:07.763691 250 round_trippers.go:553] GET https://karmada-host-control-plane:6443/healthz?timeout=10s in 10003 milliseconds
I0613 08:20:17.866527 250 round_trippers.go:553] GET https://karmada-host-control-plane:6443/healthz?timeout=10s in 10041 milliseconds
...
Unfortunately, an error has occurred:
timed out waiting for the condition
This error is likely caused by:

The kubelet is not running

The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
...
couldn't initialize a Kubernetes cluster
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init.runWaitControlPlanePhase
cmd/kubeadm/app/cmd/phases/init/waitcontrolplane.go:108
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
cmd/kubeadm/app/cmd/phases/workflow/runner.go:234

There are two types of errors here, I only solved the No.1 error before: Timeout waiting for condition running

However, the No.2 error occurred frequently yesterday: Timeout waiting for file exist /home/runner/.kube/karmada.config

You can see the related failed CI: https://github.com/karmada-io/karmada/actions/runs/5367319510/jobs/9737422778

@yike21

Now, I Find the reason for this：

We expect the Kind version to be v0.17.0, but the ubuntu image in our CI Runner now comes with kind@v0.20.0. The error mentioned above almost always occurs in ubuntu-20.04 when using kind@v0.20.0, but it runs normally on ubuntu-22.04.

Our CI logic checks whether there is a kind command and if so, it will not install it again, so we are using the kind@v0.20.0 that comes with the image by default . Therefore, the solution is either to upgrade to ubuntu-22.04 or force installation of kind@v0.17 .

https://github.com/kubernetes-sigs/kind/releases/tag/v0.20.0

yike21 · 2023-06-26T10:05:13Z

Now, I Find the reason for this：

We expect the Kind version to be v0.17.0, but the ubuntu image in our CI Runner now comes with kind@v0.20.0. The error mentioned above almost always occurs in ubuntu-20.04 when using kind@v0.20.0, but it runs normally on ubuntu-22.04.

Our CI logic checks whether there is a kind command and if so, it will not install it again, so we are using the kind@v0.20.0 that comes with the image by default . Therefore, the solution is either to upgrade to ubuntu-22.04 or force installation of kind@v0.17 .

https://github.com/kubernetes-sigs/kind/releases/tag/v0.20.0

I got it. Thanks for your excellent work!

RainbowMango added the kind/flake Categorizes issue or PR as related to a flaky test. label Jun 13, 2023

chaosi-zju mentioned this issue Jun 19, 2023

fix: repair flaking test job of setup e2e test environment #3682

Merged

karmada-bot assigned chaosi-zju Jun 20, 2023

karmada-bot closed this as completed in #3682 Jun 20, 2023

RainbowMango mentioned this issue Jun 25, 2023

feat: Support modification synchronization of custom resources as dependency #3614

Merged

This was referenced Jun 25, 2023

upgrade CI ubuntu image #3699

Merged

upgrade CI ubuntu image #3702

Merged

upgrade CI ubuntu image #3703

Merged

upgrade CI ubuntu image #3704

Merged

chaosi-zju mentioned this issue Aug 16, 2023

REQUEST: New membership for chaosi-zju karmada-io/community#50

Closed

8 tasks

chaosi-zju mentioned this issue Sep 8, 2023

Waiting for kubeconfig file /root/.kube/karmada.config and clusters karmada-host to be ready... #3308

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaking test: setup e2e test environment #3667

Flaking test: setup e2e test environment #3667

RainbowMango commented Jun 13, 2023 •

edited

Loading

chaosi-zju commented Jun 13, 2023 •

edited

Loading

chaosi-zju commented Jun 14, 2023

RainbowMango commented Jun 14, 2023

yike21 commented Jun 14, 2023

chaosi-zju commented Jun 15, 2023 •

edited

Loading

yike21 commented Jun 15, 2023

RainbowMango commented Jun 20, 2023

chaosi-zju commented Jun 26, 2023 •

edited

Loading

Firstly, we record several similar but not identical errors：

yike21 commented Jun 26, 2023

Flaking test: setup e2e test environment #3667

Flaking test: setup e2e test environment #3667

Comments

RainbowMango commented Jun 13, 2023 • edited Loading

Which jobs are flaking:

Which test(s) are flaking:

Reason for failure:

Anything else we need to know:

chaosi-zju commented Jun 13, 2023 • edited Loading

Firstly, we record several similar but not identical errors：

chaosi-zju commented Jun 14, 2023

RainbowMango commented Jun 14, 2023

yike21 commented Jun 14, 2023

chaosi-zju commented Jun 15, 2023 • edited Loading

yike21 commented Jun 15, 2023

RainbowMango commented Jun 20, 2023

chaosi-zju commented Jun 26, 2023 • edited Loading

Firstly, we record several similar but not identical errors：

yike21 commented Jun 26, 2023

RainbowMango commented Jun 13, 2023 •

edited

Loading

chaosi-zju commented Jun 13, 2023 •

edited

Loading

chaosi-zju commented Jun 15, 2023 •

edited

Loading

chaosi-zju commented Jun 26, 2023 •

edited

Loading