-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Flaking Test] chart-lint-test not stable recently #4917
Comments
cc @calvin0327 @chaosi-zju for help |
/asign |
actually I found this error occasionally for a long time (ಥ_ಥ) . |
@calvin0327 do you have any heuristic thinking? |
One more failed case: https://github.com/karmada-io/karmada/actions/runs/9153421301/job/25162342796 |
Another case: https://github.com/karmada-io/karmada/actions/runs/9168776891/job/25208102953 I0521 04:23:33.859739 1 leaderelection.go:250] attempting to acquire leader lease karmada-system/karmada-scheduler...
E0521 04:23:33.862701 1 leaderelection.go:332] error retrieving resource lock karmada-system/karmada-scheduler: Get "https://karmada-k1ah2k0r9l-apiserver.karmada-k1ah2k0r9l.svc.cluster.local:5443/apis/coordination.k8s.io/v1/namespaces/karmada-system/leases/karmada-scheduler": dial tcp 10.96.241.60:5443: connect: connection refused
E0521 04:23:37.112799 1 leaderelection.go:332] error retrieving resource lock karmada-system/karmada-scheduler: Get "https://karmada-k1ah2k0r9l-apiserver.karmada-k1ah2k0r9l.svc.cluster.local:5443/apis/coordination.k8s.io/v1/namespaces/karmada-system/leases/karmada-scheduler": dial tcp 10.96.241.60:5443: connect: connection refused |
Problem locating is in progress, here are some clue: direct reason for helm failure is $ kubectl get po -n karmada-system
NAMESPACE NAME READY STATUS RESTARTS AGE
karmada-nkuq2v3017 etcd-0 1/1 Running 0 6m23s
karmada-nkuq2v3017 karmada-nkuq2v3017-aggregated-apiserver-769fff4f58-9prbb 1/1 Running 5 (3m4s ago) 6m23s
karmada-nkuq2v3017 karmada-nkuq2v3017-apiserver-76b5b8894-6g4vw 1/1 Running 4 (3m48s ago) 6m23s
karmada-nkuq2v3017 karmada-nkuq2v3017-controller-manager-6d775ffc74-tpk4m 0/1 CrashLoopBackOff 4 (75s ago) 6m23s
karmada-nkuq2v3017 karmada-nkuq2v3017-kube-controller-manager-5877d89f57-mtqk5 1/1 Running 5 (3m41s ago) 6m23s
karmada-nkuq2v3017 karmada-nkuq2v3017-scheduler-df578498f-226dx 1/1 Running 0 6m23s
karmada-nkuq2v3017 karmada-nkuq2v3017-webhook-5f6fc69445-rfhzf 1/1 Running 0 6m23s
karmada-system etcd-0 1/1 Running 0 116m
karmada-system karmada-aggregated-apiserver-6bf466fdc4-fv86h 1/1 Running 2 (116m ago) 116m
karmada-system karmada-apiserver-756b559f84-qf2td 1/1 Running 0 116m
karmada-system karmada-controller-manager-7b9f6f5f5-v5bwp 1/1 Running 3 (116m ago) 116m
karmada-system karmada-kube-controller-manager-7b6d45cbdf-5kk8d 1/1 Running 2 (116m ago) 116m
karmada-system karmada-scheduler-64db5cf5d6-bgd85 1/1 Running 0 116m
karmada-system karmada-webhook-7b6fc7f575-chqjk 1/1 Running 0 116m error logs of $ kubectl logs -f karmada-nkuq2v3017-controller-manager-6d775ffc74-tpk4m -n karmada-nkuq2v3017
I0521 12:47:38.740245 1 feature_gate.go:249] feature gates: &{map[PropagateDeps:false]}
I0521 12:47:38.740443 1 controllermanager.go:139] karmada-controller-manager version: version.Info{GitVersion:"v1.10.0-preview4-130-g53af52e4a", GitCommit:"53af52e4a853ac04efb6c189583b5e63dff3c771", GitTreeState:"clean", BuildDate:"2024-05-21T11:48:26Z", GoVersion:"go1.21.10", Compiler:"gc", Platform:"linux/amd64"}
I0521 12:47:38.781620 1 reflector.go:351] Caches populated for *v1.Service from k8s.io/client-go/informers/factory.go:159
I0521 12:47:38.878755 1 context.go:160] Starting "endpointSlice"
I0521 12:47:38.878799 1 context.go:170] Started "endpointSlice"
I0521 12:47:38.878805 1 context.go:160] Starting "unifiedAuth"
I0521 12:47:38.878826 1 context.go:170] Started "unifiedAuth"
I0521 12:47:38.878830 1 context.go:160] Starting "endpointsliceCollect"
W0521 12:47:38.878840 1 context.go:167] Skipping "endpointsliceCollect"
I0521 12:47:38.878853 1 context.go:160] Starting "cronFederatedHorizontalPodAutoscaler"
I0521 12:47:38.878866 1 context.go:170] Started "cronFederatedHorizontalPodAutoscaler"
W0521 12:47:38.878874 1 context.go:157] "deploymentReplicasSyncer" is disabled
I0521 12:47:38.878878 1 context.go:160] Starting "remedy"
I0521 12:47:38.878893 1 context.go:170] Started "remedy"
I0521 12:47:38.878904 1 context.go:160] Starting "execution"
I0521 12:47:38.878923 1 context.go:170] Started "execution"
I0521 12:47:38.878928 1 context.go:160] Starting "workStatus"
I0521 12:47:38.879028 1 context.go:170] Started "workStatus"
I0521 12:47:38.879038 1 context.go:160] Starting "serviceImport"
I0521 12:47:38.879051 1 context.go:170] Started "serviceImport"
I0521 12:47:38.879056 1 context.go:160] Starting "gracefulEviction"
I0521 12:47:38.879078 1 context.go:170] Started "gracefulEviction"
I0521 12:47:38.879082 1 context.go:160] Starting "federatedHorizontalPodAutoscaler"
I0521 12:47:38.879151 1 context.go:170] Started "federatedHorizontalPodAutoscaler"
I0521 12:47:38.879171 1 context.go:160] Starting "workloadRebalancer"
I0521 12:47:38.879196 1 context.go:170] Started "workloadRebalancer"
I0521 12:47:38.879206 1 context.go:160] Starting "endpointsliceDispatch"
W0521 12:47:38.879213 1 context.go:167] Skipping "endpointsliceDispatch"
I0521 12:47:38.879219 1 context.go:160] Starting "namespace"
I0521 12:47:38.879239 1 context.go:170] Started "namespace"
I0521 12:47:38.879253 1 context.go:160] Starting "serviceExport"
I0521 12:47:38.879326 1 context.go:170] Started "serviceExport"
I0521 12:47:38.879337 1 context.go:160] Starting "federatedResourceQuotaSync"
I0521 12:47:38.879362 1 context.go:170] Started "federatedResourceQuotaSync"
I0521 12:47:38.879376 1 context.go:160] Starting "applicationFailover"
I0521 12:47:38.879397 1 context.go:170] Started "applicationFailover"
I0521 12:47:38.879405 1 context.go:160] Starting "multiclusterservice"
W0521 12:47:38.879412 1 context.go:167] Skipping "multiclusterservice"
W0521 12:47:38.879426 1 context.go:157] "hpaScaleTargetMarker" is disabled
I0521 12:47:38.879436 1 context.go:160] Starting "cluster"
E0521 12:47:38.897338 1 context.go:163] Error starting "cluster"
F0521 12:47:38.897365 1 controllermanager.go:821] error starting controllers: [no matches for kind "ResourceBinding" in version "work.karmada.io/v1alpha2", no matches for kind "ClusterResourceBinding" in version "work.karmada.io/v1alpha2"] pay attention to |
as for error log just like E0521 14:04:55.290515 1 leaderelection.go:336] error initially creating leader election record: namespaces "karmada-system" not found
E0521 14:04:58.278909 1 leaderelection.go:336] error initially creating leader election record: namespaces "karmada-system" not found
E0521 14:05:02.319658 1 leaderelection.go:336] error initially creating leader election record: namespaces "karmada-system" not found is because this |
@RainbowMango @XiShanYongYe-Chang I change ci step may be our install logic in helm job like |
the namespace of scheduler logs is the namespace of karmada controlplane but k8s controlplane. the namespace |
Hi @calvin0327, in the past several days, I 've made some new discovery, and I 'd like to discuss with you. I found three problems, let's elaborate one by one. Problem 1As for error: The root cause is that crd has not been installed when the controller-manager is installed, and the absence of crd will cause the controller-manager to crash. However, once the controller-manager crashes, it will not execute the post-install-job, that is to say, crd will not be installed. So it's deadlock. This is also why our previous CI, even if it runs successfully, runs for nearly 15 minutes and is on the verge of timeout. controller-manager crash is inevitable, but in many cases, before crash, a short running state may allow post-install-job to run, thus solving the deadlock. Our installation needs to be in order, just like: However, I don't know what good practice is to implement such sequencing in helm, all I know is: pre-install hooks or split sub-charts, do you have more information? I tried to use pre-install hook to achieve such install sequence, that means put etcd/karmada-apiserver/crd to pre-install stage, this error gone, the installation process was quickly executed successfully, and there was no abnormal restart of the component. However, this operation may be a little tricky. Problem 2As for
It is because we defined a karmada/charts/karmada/values.yaml Line 44 in 2220124
and in cr of components like karmada/charts/karmada/templates/karmada-scheduler.yaml Lines 44 to 53 in 2220124
I don't make sense since we have I think if we install karmada at Problem 3As for error log I don't know the root cause yet, but this should also have something to do with the installation sequence. Since I tried to fix the installation sequence, this error will not appear again. |
@chaosi-zju You are quite right. For the problem 1: Yes, we should ensure the installation order. Currently, apart from using hooks, I don't have a better way either. However, we can have components that need to watch Karmada CRDs, such as the scheduler and controller components, installed using However, I previously discovered some drawbacks to this approach, but I don't quite remember clearly. I need to research it further. For problem 2: In the very early days, we did it this way, using For problem3: Sorry, I'm not clear about it either. |
As for installation order, I browsed a lot of information and consulted others. Maybe the best way is still other feasible but not good way: as for some reference: |
it's okay to me, I can submit a PR, let we see the effect first. |
I got another reason for If I defined This is not what we expected, we expect that This can be referenced from above document link:
so, |
Which jobs are flaking:
chart-lint-test
Which test(s) are flaking:
See the logs below:
Reason for failure:
TBD
Anything else we need to know:
The text was updated successfully, but these errors were encountered: