e2e conversion test intermittently times out when cleaning up resources #4937

nrb · 2024-04-18T20:23:19Z

/kind flake

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

The test [It] [unmanaged] [Cluster API Framework] Clusterctl Upgrade Spec [from latest v1beta1 release to v1beta2] Should create a management cluster and then upgrade all the providers often fails due to timeout in CI.

The upgrade test itself seems to pass, but teardown fails.

The output looks like something:

   STEP: THE UPGRADED MANAGEMENT CLUSTER WORKS! @ 04/16/24 02:49:40.808
  STEP: PASSED! @ 04/16/24 02:49:40.808
  STEP: Dumping logs from the "clusterctl-upgrade-wcfhw0" workload cluster @ 04/16/24 02:49:40.818
  STEP: Dumping all the Cluster API resources in the "clusterctl-upgrade" namespace @ 04/16/24 02:49:40.818
  STEP: Deleting all cluster.x-k8s.io/v1beta1 clusters in namespace clusterctl-upgrade in management cluster clusterctl-upgrade-wcfhw0 @ 04/16/24 02:49:43.521
  STEP: Deleting cluster clusterctl-upgrade/clusterctl-upgrade-nm67wf @ 04/16/24 02:49:43.623
  INFO: Waiting for the Cluster clusterctl-upgrade/clusterctl-upgrade-nm67wf to be deleted
  STEP: Waiting for cluster clusterctl-upgrade/clusterctl-upgrade-nm67wf to be deleted @ 04/16/24 02:49:43.685
  [FAILED] in [AfterEach] - /home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/test@v1.6.1/framework/cluster_helpers.go:176 @ 04/16/24 03:09:43.687
  STEP: Node 8 released resources: {ec2-normal:0, vpc:2, eip:2, ngw:2, igw:2, classiclb:2, ec2-GPU:0, volume-gp2:0, eventBridge-rules:50} @ 04/16/24 03:09:44.688 
  << Timeline
  [FAILED] Timed out after 1200.001s.
  Expected
      <bool>: false
  to be true
  In [AfterEach] at: /home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/test@v1.6.1/framework/cluster_helpers.go:176 @ 04/16/24 03:09:43.687
  Full Stack Trace
    sigs.k8s.io/cluster-api/test/framework.WaitForClusterDeleted({0x374cf98?, 0x5104ec0}, {{0x7f16fc3aa7a0?, 0xc000f16a20?}, 0xc002856700?}, {0xc001110ce0, 0x2, 0x2})
    	/home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/test@v1.6.1/framework/cluster_helpers.go:176 +0x1c3
    sigs.k8s.io/cluster-api/test/framework.DeleteAllClustersAndWait({0x374cf98?, 0x5104ec0}, {{0x375fd40?, 0xc000f16a20?}, {0xc000d77770?, 0x7?}}, {0xc001110ce0, 0x2, 0x2})
    	/home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/test@v1.6.1/framework/cluster_helpers.go:272 +0x426
    sigs.k8s.io/cluster-api/test/e2e.ClusterctlUpgradeSpec.func3()
    	/home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/test@v1.6.1/e2e/clusterctl_upgrade.go:552 +0x4ba
------------------------------
[SynchronizedAfterSuite] PASSED [0.000 seconds]
[SynchronizedAfterSuite] 
/home/prow/go/src/sigs.k8s.io/cluster-api-provider-aws/test/e2e/suites/unmanaged/unmanaged_suite_test.go:57
------------------------------
[SynchronizedAfterSuite] PASSED [1165.726 seconds]
[SynchronizedAfterSuite] 
/home/prow/go/src/sigs.k8s.io/cluster-api-provider-aws/test/e2e/suites/unmanaged/unmanaged_suite_test.go:57
  Timeline >>
  STEP: Dumping all the Cluster API resources in the "functional-gpu-cluster-wqgqck" namespace @ 04/16/24 02:57:48.829
  STEP: Dumping all EC2 instances in the "functional-gpu-cluster-wqgqck" namespace @ 04/16/24 02:57:49.149
  STEP: Deleting all clusters in the "functional-gpu-cluster-wqgqck" namespace with intervals ["20m" "10s"] @ 04/16/24 02:58:17.15
  STEP: Deleting cluster functional-gpu-cluster-wqgqck/functional-gpu-cluster-44q1nj @ 04/16/24 02:58:17.157
  INFO: Waiting for the Cluster functional-gpu-cluster-wqgqck/functional-gpu-cluster-44q1nj to be deleted
  STEP: Waiting for cluster functional-gpu-cluster-wqgqck/functional-gpu-cluster-44q1nj to be deleted @ 04/16/24 02:58:17.164
  STEP: Deleting namespace used for hosting the "" test spec @ 04/16/24 03:04:47.359
  INFO: Deleting namespace functional-gpu-cluster-wqgqck
  folder created for eks clusters: /logs/artifacts/clusters/bootstrap/aws-resources
  STEP: Tearing down the management cluster @ 04/16/24 03:16:09.45
  INFO: Error getting pod capi-kubeadm-control-plane-system/capi-kubeadm-control-plane-controller-manager-66d8956f77-2n4zj, container manager: Get "https://127.0.0.1:38745/api/v1/namespaces/capi-kubeadm-control-plane-system/pods/capi-kubeadm-control-plane-controller-manager-66d8956f77-2n4zj": dial tcp 127.0.0.1:38745: connect: connection refused - error from a previous attempt: read tcp 127.0.0.1:58460->127.0.0.1:38745: read: connection reset by peer
  INFO: Error getting pod capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager-78d8cb7cf6-kdr97, container manager: Get "https://127.0.0.1:38745/api/v1/namespaces/capi-kubeadm-bootstrap-system/pods/capi-kubeadm-bootstrap-controller-manager-78d8cb7cf6-kdr97": dial tcp 127.0.0.1:38745: connect: connection refused - error from a previous attempt: EOF
  INFO: Error getting pod capa-system/capa-controller-manager-6b8f8b488c-9dcnc, container manager: Get "https://127.0.0.1:38745/api/v1/namespaces/capa-system/pods/capa-controller-manager-6b8f8b488c-9dcnc": dial tcp 127.0.0.1:38745: connect: connection refused - error from a previous attempt: read tcp 127.0.0.1:58472->127.0.0.1:38745: read: connection reset by peer
  INFO: Error getting pod capi-system/capi-controller-manager-656b74646d-djwxs, container manager: Get "https://127.0.0.1:38745/api/v1/namespaces/capi-system/pods/capi-controller-manager-656b74646d-djwxs": dial tcp 127.0.0.1:38745: connect: connection refused - error from a previous attempt: read tcp 127.0.0.1:58458->127.0.0.1:38745: read: connection reset by peer
  STEP: Deleting cluster-api-provider-aws-sigs-k8s-io CloudFormation stack @ 04/16/24 03:16:13.393

The above output may be a red herring, however - there's also this log line that indicates that we're not gett the collection logs from the cluster made to do the upgrade test, while the cluster being torn down in the pasted section is functional-gpu-cluster-wqgqck/functional-gpu-cluster-44q1nj.

What did you expect to happen:

Test resources are cleaned up without timeout.

Anything else you would like to add:

The text was updated successfully, but these errors were encountered:

nrb · 2024-04-18T20:23:29Z

/triage accepted

k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 18, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 18, 2024

This was referenced Apr 18, 2024

✨ Introduce edge subnets to support AWS Local Zones #4882

Merged

✨ Add support for additionalControlPlaneIngressRule on AWSManaged… #4783

Open

✨ Feat: ELBv2/TGs - Add health check customization #4849

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e conversion test intermittently times out when cleaning up resources #4937

e2e conversion test intermittently times out when cleaning up resources #4937

nrb commented Apr 18, 2024

nrb commented Apr 18, 2024

e2e conversion test intermittently times out when cleaning up resources #4937

e2e conversion test intermittently times out when cleaning up resources #4937

Comments

nrb commented Apr 18, 2024

nrb commented Apr 18, 2024