clusterctl upgrade tests are flaky #9688

killianmuldoon · 2023-11-08T12:35:58Z

The clusterctl upgrade tests have been significantly flaky in the last couple of weeks, with flakes occurring on main release-1.4 and release-1.5.

The flakes are occurring across many forms of the clusterctl upgrade tests including v0.4=>current, v1.3=>current and v1.0=>current.

The failures take a number of forms, including but not limited to:

exec.ExitError: https://storage.googleapis.com/k8s-triage/index.html?date=2023-11-08&job=.*-cluster-api-.*&xjob=.*-provider-.*#f5ccd02ae151196a4bf1
failed to find releases: https://storage.googleapis.com/k8s-triage/index.html?date=2023-11-08&job=.*-cluster-api-.*&test=.*clusterctl%20upgrades.*&xjob=.*-provider-.*#983e849a73bad197d73b
failed to discovery ownerGraph types: https://storage.googleapis.com/k8s-triage/index.html?date=2023-11-08&job=.*-cluster-api-.*&test=.*clusterctl%20upgrades.*&xjob=.*-provider-.*#176363ebfcd19172c1ac

There's an overall triage for tests with clusterctl upgrades in the name here: https://storage.googleapis.com/k8s-triage/index.html?date=2023-11-08&job=.*-cluster-api-.*&test=.*clusterctl%20upgrades.*&xjob=.*-provider-.*

/kind flake

The text was updated successfully, but these errors were encountered:

killianmuldoon · 2023-11-08T12:36:49Z

@kubernetes-sigs/cluster-api-release-team These flakes are very disruptive to the test signal right now. It would be great if someone could prioritize investigating and fixing them out ahead of the releases.

/triage accepted

killianmuldoon · 2023-11-08T12:45:42Z

/help

k8s-ci-robot · 2023-11-08T12:45:44Z

@killianmuldoon:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

killianmuldoon · 2023-11-08T12:49:31Z

Note that each branch has a different number of variants, enumerated below, of this test which may be responsible for some unevenness in the signal:

release-1.4: 7
release-1.5: 6
main: 5

adilGhaffarDev · 2023-11-15T19:01:54Z

I am looking into this one.

furkatgofurov7 · 2023-11-15T19:42:42Z

I will be pairing up with @adilGhaffarDev on this one since it is happening more frequently.

/assign @adilGhaffarDev

adilGhaffarDev · 2023-11-20T12:04:34Z

Adding a bit more explanation regarding failures. We have three failures in clusterctl upgrade:

exec.ExitError This one happens at Applying the cluster template yaml to the cluster, I opened PR in release 1.4 and changed KubectlApply to ControllerRuntime create and also added ignore for alreadyExists as @killianmuldoon suggested, so far I haven't seen this failure on my 1.4 PR(ref: 🌱 Adding to the test framework the equivalent to kubectl create -f. #9731). It still fails but not at Apply, I think changing kubectlApply to Create and adding ignore on alreadyExists fixes this one. I will create PR on the main too.
failed to discovery ownerGraph types this one happens at Running Post-upgrade steps against the management cluster . I have looked into logs and I am seeing this error:

{"ts":1700405055471.4797,"caller":"builder/webhook.go:184","msg":"controller-runtime/builder: Conversion webhook enabled","v":0,"GVK":"infrastructure.cluster.x-k8s.io/v1beta1, Kind=DockerClusterTemplate"}
{"ts":1700405055471.7637,"caller":"builder/webhook.go:139","msg":"controller-runtime/builder: skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called","v":0,"GVK":"infrastructure.cluster.x-k8s.io/v1beta1, Kind=DockerMachinePool"}
{"ts":1700405055472.0557,"caller":"builder/webhook.go:168","msg":"controller-runtime/builder: skip registering a validating webhook, object does not implement admission.Validator or WithValidator wasn't called","v":0,"GVK":"infrastructure.cluster.x-k8s.io/v1beta1, Kind=DockerMachinePool"}

It might be something related to DockerMachinePool, we might need to backport the recent fixes related to DockerMachinePool. Another interesting thing is I don't see this failure on main this is only happening on v1.4 and v1.5.

failed to find releases this one happens at clusterctl init. I am still looking into this one.

sbueringer · 2023-11-20T15:26:20Z

I have looked into logs and I am seeing this error:

{"ts":1700405055471.4797,"caller":"builder/webhook.go:184","msg":"controller-runtime/builder: Conversion webhook enabled","v":0,"GVK":"infrastructure.cluster.x-k8s.io/v1beta1, Kind=DockerClusterTemplate"}
{"ts":1700405055471.7637,"caller":"builder/webhook.go:139","msg":"controller-runtime/builder: skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called","v":0,"GVK":"infrastructure.cluster.x-k8s.io/v1beta1, Kind=DockerMachinePool"}
{"ts":1700405055472.0557,"caller":"builder/webhook.go:168","msg":"controller-runtime/builder: skip registering a validating webhook, object does not implement admission.Validator or WithValidator wasn't called","v":0,"GVK":"infrastructure.cluster.x-k8s.io/v1beta1, Kind=DockerMachinePool"}

This is not an error. These are just info messages that surface that we are calling ctrl.NewWebhookManagedBy(mgr).For(c).Complete() for an object that has no validating or defaulting webhooks (we still get the same on main as we should)

adilGhaffarDev · 2024-01-19T13:31:46Z

Update on this issue.
I am not seeing following flakes anymore:

exec.ExitError
failed to find releases

failed to discovery ownerGraph types flake is still happening but only when upgrading from (v0.4=>current)

Ref: https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-.*&xjob=.*-provider-.*#3c64d10ff3eda504da75

sbueringer · 2024-01-22T12:07:05Z

@adilGhaffarDev So the clusterctl upgrade test is 100% stable apart from "failed to discovery ownerGraph types flake is still happening but only when upgrading from (v0.4=>current)"?

Ref: https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-.*&xjob=.*-provider-.*#3c64d10ff3eda504da75

Is not showing anything for me

adilGhaffarDev · 2024-01-22T12:24:12Z

@adilGhaffarDev So the clusterctl upgrade test is 100% stable apart from "failed to discovery ownerGraph types flake is still happening but only when upgrading from (v0.4=>current)"?

sorry for the bad link, here is more persitent link: https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-.*&test=clusterctl%20upgrades%20&xjob=.*-provider-.*

Maybe not 100% stable there are very minor flakes that happen sometimes. But failed to find releases and exec.ExitError are not happening anymore.

sbueringer · 2024-01-22T12:49:38Z

@adilGhaffarDev exec.ExitError does not occur anymore because I improved the error output here:

cluster-api/test/framework/cluster_proxy.go

Line 258 in adce020

    
           return pkgerrors.New(fmt.Sprintf("%s: stderr: %s", err.Error(), exitErr.Stderr))

(https://github.com/kubernetes-sigs/cluster-api/pull/9737/files)

That doesn't mean the underlying errors are fixed unfortunately.

adilGhaffarDev · 2024-01-22T13:06:08Z

@adilGhaffarDev exec.ExitError does not occur anymore because I improved the error output here:

exec.ExitError was happening at step: INFO: Applying the cluster template yaml to the cluster I don't see any failure that is happening at the same step where exec.ExitError was happening. Do you see any failure on triage that is related to that? I am unable to find it.

sbueringer · 2024-01-22T13:55:12Z

Sounds good! Nope I didn't see any. Just wanted to clarify that the errors would look different now. But if the same step works now, it should be fine.

Just not sure what changed as I don't remember fixing/changing anything there.

adilGhaffarDev · 2024-01-23T07:50:38Z

Just not sure what changed as I don't remember fixing/changing anything there.

This is the new error that was happening after your PR, it seems like it stopped happening after 07-12-2023.
https://storage.googleapis.com/k8s-triage/index.html?date=2023-12-10&job=.*-cluster-api-.*&xjob=.*-provider-.*#6710a9c85a9bbdb4d278

Only PR on 07-12-2023 that might have fixed this seemed to be this one: #9819 , but I am not sure.

sbueringer · 2024-01-23T10:56:55Z

#9819 Should not be related. This func is called later in clusterctl_upgrade.go (l.516). While the issue happens in l.389.

So this is the error we get there

{Expected success, but got an error:
    <*errors.fundamental | 0xc000912948>: 
    exit status 1: stderr: 
    {
        msg: "exit status 1: stderr: ",
        stack: [0x1f3507a, 0x2010aa2, 0x84e4db, 0x862a98, 0x4725a1],
    } failed [FAILED] Expected success, but got an error:
    <*errors.fundamental | 0xc000912948>: 
    exit status 1: stderr: 
    {
        msg: "exit status 1: stderr: ",
        stack: [0x1f3507a, 0x2010aa2, 0x84e4db, 0x862a98, 0x4725a1],
    }

This is the corresponding output (under "open stdout")

Running kubectl apply --kubeconfig /tmp/e2e-kubeconfig3133952171 -f -
stderr:
Unable to connect to the server: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

stdout:

So looks like the mgmt cluster was not reachable.

Thx for digging into this. I would say let's ignore this error for now as it's not occurring anymore. Good enough for me to know the issue stopped happening (I assumed it might be still there and just looks different).

adilGhaffarDev · 2024-02-09T13:28:49Z

Little more explanation to clusterctl upgrade failure. Now we are seeing only one flake when upgrading from 0.4->1.4 or 0.4->1.5, as mentioned before. Its failing with following error:

failed to discovery ownerGraph types: action failed after 9 attempts: failed to list "infrastructure.cluster.x-k8s.io/v1beta1, Kind=DockerCluster" resources: conversion webhook for infrastructure.cluster.x-k8s.io/v1alpha4, Kind=DockerCluster failed: Post "https://capd-webhook-service.capd-system.svc:443/convert?timeout=30s": x509: certificate signed by unknown authority

This failure happens in post upgrade step, where we are are calling ValidateOwnerReferencesOnUpdate . We have this post upgrade step only when we upgrading from v1alpha to v1beta. I believe @killianmuldoon you have worked on it, can you check this when you get time.

chrischdi · 2024-02-23T10:18:29Z

Little more explanation to clusterctl upgrade failure. Now we are seeing only one flake when upgrading from 0.4->1.4 or 0.4->1.5, as mentioned before. Its failing with following error:

failed to discovery ownerGraph types: action failed after 9 attempts: failed to list "infrastructure.cluster.x-k8s.io/v1beta1, Kind=DockerCluster" resources: conversion webhook for infrastructure.cluster.x-k8s.io/v1alpha4, Kind=DockerCluster failed: Post "https://capd-webhook-service.capd-system.svc:443/convert?timeout=30s": x509: certificate signed by unknown authority

This failure happens in post upgrade step, where we are are calling ValidateOwnerReferencesOnUpdate . We have this post upgrade step only when we upgrading from v1alpha to v1beta. I believe @killianmuldoon you have worked on it, can you check this when you get time.

🤔 : may be helpful to collect cert-manager resources + logs to analyse this. Or is this locally reproducible?

adilGhaffarDev · 2024-02-23T10:21:26Z

🤔 : may be helpful to collect cert-manager resources + logs to analyse this. Or is this locally reproducible?

I haven't been able to reproduce locally. I have ran it multiple times.

chrischdi · 2024-02-26T08:49:42Z

Some observation via #10193 :

I0223 19:20:31.564028       1 controller.go:162] "re-queuing item due to optimistic locking on resource" logger="cert-manager.certificates-key-manager" key="capd-system/capd-serving-cert" error="Operation cannot be fulfilled on certificates.cert-manager.io \"capd-serving-cert\": the object has been modified; please apply your changes to the latest version and try again"

Source: https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api/10193/pull-cluster-api-e2e-blocking-release-1-5/1761093535257333760/artifacts/5/clusters/clusterctl-upgrade-k13xsy/logs/cert-manager/cert-manager/cert-manager-6bbb455dc9-p88kt/cert-manager-controller.log

Maybe related cert-manager issue: cert-manager/cert-manager#6464

Edit updated #10193 to now hopefully collect the cert-manager CRs. Maybe we can implement something which waits for the certificates to be ready or similar.

chrischdi · 2024-02-26T09:52:04Z

Maybe: worth taking a look if there could be an improvement to the clusterctl upgrade flow. The question is:

Why does cert-manager need to update the ca at the CRD again? We are not deleting the CRD, only updating it.
- Edit: Answer: we delete the Certificate and CertificateSigningRequest, so the CRD contains an old CA and we need to wait for cert-manager for the new Certificate to be issued and ca-injector to inject the new certificate.

chrischdi · 2024-02-26T17:03:11Z

Testing a potential fix: 6f73c3a

at #10193

chrischdi · 2024-02-27T12:45:29Z

Fix is here for the failed to discovery ownerGraph types error:

🐛 test: retry GetOwnerGraph in owner references test on certificate errors #10201

this should catch all x509: certificate signed by unknown authority / failed to discovery ownerGraph types errors which occur in clusterctl_upgrade tests related to conversion webhooks.

Should get cherry-picked to all supported branches.

link for ownergraph errors

link for all x509 errors to check occurencies of the flake and confirm it is fixed.

adilGhaffarDev · 2024-03-11T10:36:18Z

@chrischdi thank you for working on it, now we are not seeing this flake too much, nice work. On k8s triage I can see that now ownergraph flake is only happening in (v0.4=>v1.6=>current) tests, the other flakes seem to be fixed or they are much less flaky.
ref: https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-.*&xjob=.*-provider-.*#4f4c67c927112191922f

chrischdi · 2024-03-11T12:48:34Z

@chrischdi thank you for working on it, now we are not seeing this flake too much, nice work. On k8s triage I can see that now ownergraph flake is only happening in (v0.4=>v1.6=>current) tests, the other flakes seem to be fixed or they are much less flaky. ref: storage.googleapis.com/k8s-triage/index.html?job=.-cluster-api-.&xjob=.-provider-.#4f4c67c927112191922f

Note: this is a different flake, not directly ownergraph but similar. It happens at a different place though.

cluster-api/test/e2e/clusterctl_upgrade.go

Lines 553 to 568 in 487ed95

    
           Consistently(func() bool { 
        
           	postUpgradeMachineList := &unstructured.UnstructuredList{} 
        
           	postUpgradeMachineList.SetGroupVersionKind(schema.GroupVersionKind{ 
        
           		Group:   clusterv1.GroupVersion.Group, 
        
           		Version: coreCAPIStorageVersion, 
        
           		Kind:    "MachineList", 
        
           	}) 
        
           	err = managementClusterProxy.GetClient().List( 
        
           		ctx, 
        
           		postUpgradeMachineList, 
        
           		client.InNamespace(workloadCluster.GetNamespace()), 
        
           		client.MatchingLabels{clusterv1.ClusterNameLabel: workloadCluster.GetName()}, 
        
           	) 
        
           	Expect(err).ToNot(HaveOccurred()) 
        
           	return validateMachineRollout(preUpgradeMachineList, postUpgradeMachineList) 
        
           }, "3m", "30s").Should(BeTrue(), "Machines should remain the same after the upgrade")

We could propably also ignore the x509 errors here and ensure that the last try in Consistently succeeded (by storing and checking the last error outside of Consistently)

sbueringer · 2024-03-13T12:25:19Z

We could propably also ignore the x509 errors here and ensure that the last try in Consistently succeeded (by storing and checking the last error outside of Consistently)

We could also add an Eventually before to wait until the List call works and then keep the Consistently the same

Btw, thx folks, really nice work on this issue!

adilGhaffarDev · 2024-03-14T14:56:29Z

We could also add an Eventually before to wait until the List call works and then keep the Consistently the same

I will open a PR with your suggestion

adilGhaffarDev · 2024-04-09T07:06:19Z

#10301 did not fix the issue, failure is still there for (v0.4=>v1.6=>current) tests. I will try to reproduce it locally.

fabriziopandini · 2024-04-11T18:47:29Z

/priority important-soon

chrischdi · 2024-04-23T14:38:58Z

I implemented a fix at #10469 which should fix the situation.

k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 8, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 8, 2023

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Nov 8, 2023

killianmuldoon mentioned this issue Nov 9, 2023

Replace kubectl apply in e2e test framework #9696

Open

k8s-ci-robot assigned adilGhaffarDev Nov 15, 2023

adilGhaffarDev mentioned this issue Nov 20, 2023

🌱 Adding to the test framework the equivalent to kubectl create -f. #9731

Closed

fabriziopandini added area/clusterctl Issues or PRs related to clusterctl area/e2e-testing Issues or PRs related to e2e testing labels Jan 19, 2024

adilGhaffarDev mentioned this issue Feb 13, 2024

SSA failure after removal of old API versions #10051

Open

chrischdi mentioned this issue Feb 23, 2024

🌱 [DoNotReview] [WIP] debug cert-manager #10193

Closed

chrischdi mentioned this issue Feb 27, 2024

🐛 test: retry GetOwnerGraph in owner references test on certificate errors #10201

Merged

adilGhaffarDev mentioned this issue Mar 21, 2024

🐛 Add wait for MachineList to be available #10301

Merged

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Apr 11, 2024

chrischdi mentioned this issue Apr 12, 2024

[WIP][DoNotReview]✨ clusterctl: verify cert-manager did inject the CA certificates into the objects before proceeding #10433

Closed

chrischdi mentioned this issue Apr 23, 2024

🐛 clusterctl: ensure cert-manager objects get applied before other provider objects #10469

Merged

Rozzii mentioned this issue May 6, 2024

Removal of v1alpha5 apiversion metal3-io/cluster-api-provider-metal3#971

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clusterctl upgrade tests are flaky #9688

clusterctl upgrade tests are flaky #9688

killianmuldoon commented Nov 8, 2023

killianmuldoon commented Nov 8, 2023

killianmuldoon commented Nov 8, 2023

k8s-ci-robot commented Nov 8, 2023

killianmuldoon commented Nov 8, 2023

adilGhaffarDev commented Nov 15, 2023

furkatgofurov7 commented Nov 15, 2023

adilGhaffarDev commented Nov 20, 2023

sbueringer commented Nov 20, 2023 •

edited

adilGhaffarDev commented Jan 19, 2024

sbueringer commented Jan 22, 2024 •

edited

adilGhaffarDev commented Jan 22, 2024

sbueringer commented Jan 22, 2024 •

edited

adilGhaffarDev commented Jan 22, 2024

sbueringer commented Jan 22, 2024

adilGhaffarDev commented Jan 23, 2024

sbueringer commented Jan 23, 2024 •

edited

adilGhaffarDev commented Feb 9, 2024

chrischdi commented Feb 23, 2024

adilGhaffarDev commented Feb 23, 2024

chrischdi commented Feb 26, 2024 •

edited

chrischdi commented Feb 26, 2024 •

edited

chrischdi commented Feb 26, 2024

chrischdi commented Feb 27, 2024 •

edited

adilGhaffarDev commented Mar 11, 2024

chrischdi commented Mar 11, 2024

sbueringer commented Mar 13, 2024 •

edited

adilGhaffarDev commented Mar 14, 2024

adilGhaffarDev commented Apr 9, 2024

fabriziopandini commented Apr 11, 2024

chrischdi commented Apr 23, 2024

clusterctl upgrade tests are flaky #9688

clusterctl upgrade tests are flaky #9688

Comments

killianmuldoon commented Nov 8, 2023

killianmuldoon commented Nov 8, 2023

killianmuldoon commented Nov 8, 2023

k8s-ci-robot commented Nov 8, 2023

Guidelines

killianmuldoon commented Nov 8, 2023

adilGhaffarDev commented Nov 15, 2023

furkatgofurov7 commented Nov 15, 2023

adilGhaffarDev commented Nov 20, 2023

sbueringer commented Nov 20, 2023 • edited

adilGhaffarDev commented Jan 19, 2024

sbueringer commented Jan 22, 2024 • edited

adilGhaffarDev commented Jan 22, 2024

sbueringer commented Jan 22, 2024 • edited

adilGhaffarDev commented Jan 22, 2024

sbueringer commented Jan 22, 2024

adilGhaffarDev commented Jan 23, 2024

sbueringer commented Jan 23, 2024 • edited

adilGhaffarDev commented Feb 9, 2024

chrischdi commented Feb 23, 2024

adilGhaffarDev commented Feb 23, 2024

chrischdi commented Feb 26, 2024 • edited

chrischdi commented Feb 26, 2024 • edited

chrischdi commented Feb 26, 2024

chrischdi commented Feb 27, 2024 • edited

adilGhaffarDev commented Mar 11, 2024

chrischdi commented Mar 11, 2024

sbueringer commented Mar 13, 2024 • edited

adilGhaffarDev commented Mar 14, 2024

adilGhaffarDev commented Apr 9, 2024

fabriziopandini commented Apr 11, 2024

chrischdi commented Apr 23, 2024

sbueringer commented Nov 20, 2023 •

edited

sbueringer commented Jan 22, 2024 •

edited

sbueringer commented Jan 22, 2024 •

edited

sbueringer commented Jan 23, 2024 •

edited

chrischdi commented Feb 26, 2024 •

edited

chrischdi commented Feb 26, 2024 •

edited

chrischdi commented Feb 27, 2024 •

edited

sbueringer commented Mar 13, 2024 •

edited