Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e conversion test intermittently times out when cleaning up resources #4937

Open
nrb opened this issue Apr 18, 2024 · 1 comment
Open

e2e conversion test intermittently times out when cleaning up resources #4937

nrb opened this issue Apr 18, 2024 · 1 comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. needs-priority triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@nrb
Copy link
Contributor

nrb commented Apr 18, 2024

/kind flake

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

The test [It] [unmanaged] [Cluster API Framework] Clusterctl Upgrade Spec [from latest v1beta1 release to v1beta2] Should create a management cluster and then upgrade all the providers often fails due to timeout in CI.

The upgrade test itself seems to pass, but teardown fails.

The output looks like something:

   STEP: THE UPGRADED MANAGEMENT CLUSTER WORKS! @ 04/16/24 02:49:40.808
  STEP: PASSED! @ 04/16/24 02:49:40.808
  STEP: Dumping logs from the "clusterctl-upgrade-wcfhw0" workload cluster @ 04/16/24 02:49:40.818
  STEP: Dumping all the Cluster API resources in the "clusterctl-upgrade" namespace @ 04/16/24 02:49:40.818
  STEP: Deleting all cluster.x-k8s.io/v1beta1 clusters in namespace clusterctl-upgrade in management cluster clusterctl-upgrade-wcfhw0 @ 04/16/24 02:49:43.521
  STEP: Deleting cluster clusterctl-upgrade/clusterctl-upgrade-nm67wf @ 04/16/24 02:49:43.623
  INFO: Waiting for the Cluster clusterctl-upgrade/clusterctl-upgrade-nm67wf to be deleted
  STEP: Waiting for cluster clusterctl-upgrade/clusterctl-upgrade-nm67wf to be deleted @ 04/16/24 02:49:43.685
  [FAILED] in [AfterEach] - /home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/test@v1.6.1/framework/cluster_helpers.go:176 @ 04/16/24 03:09:43.687
  STEP: Node 8 released resources: {ec2-normal:0, vpc:2, eip:2, ngw:2, igw:2, classiclb:2, ec2-GPU:0, volume-gp2:0, eventBridge-rules:50} @ 04/16/24 03:09:44.688 
  << Timeline
  [FAILED] Timed out after 1200.001s.
  Expected
      <bool>: false
  to be true
  In [AfterEach] at: /home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/test@v1.6.1/framework/cluster_helpers.go:176 @ 04/16/24 03:09:43.687
  Full Stack Trace
    sigs.k8s.io/cluster-api/test/framework.WaitForClusterDeleted({0x374cf98?, 0x5104ec0}, {{0x7f16fc3aa7a0?, 0xc000f16a20?}, 0xc002856700?}, {0xc001110ce0, 0x2, 0x2})
    	/home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/test@v1.6.1/framework/cluster_helpers.go:176 +0x1c3
    sigs.k8s.io/cluster-api/test/framework.DeleteAllClustersAndWait({0x374cf98?, 0x5104ec0}, {{0x375fd40?, 0xc000f16a20?}, {0xc000d77770?, 0x7?}}, {0xc001110ce0, 0x2, 0x2})
    	/home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/test@v1.6.1/framework/cluster_helpers.go:272 +0x426
    sigs.k8s.io/cluster-api/test/e2e.ClusterctlUpgradeSpec.func3()
    	/home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/test@v1.6.1/e2e/clusterctl_upgrade.go:552 +0x4ba
------------------------------
[SynchronizedAfterSuite] PASSED [0.000 seconds]
[SynchronizedAfterSuite] 
/home/prow/go/src/sigs.k8s.io/cluster-api-provider-aws/test/e2e/suites/unmanaged/unmanaged_suite_test.go:57
------------------------------
[SynchronizedAfterSuite] PASSED [1165.726 seconds]
[SynchronizedAfterSuite] 
/home/prow/go/src/sigs.k8s.io/cluster-api-provider-aws/test/e2e/suites/unmanaged/unmanaged_suite_test.go:57
  Timeline >>
  STEP: Dumping all the Cluster API resources in the "functional-gpu-cluster-wqgqck" namespace @ 04/16/24 02:57:48.829
  STEP: Dumping all EC2 instances in the "functional-gpu-cluster-wqgqck" namespace @ 04/16/24 02:57:49.149
  STEP: Deleting all clusters in the "functional-gpu-cluster-wqgqck" namespace with intervals ["20m" "10s"] @ 04/16/24 02:58:17.15
  STEP: Deleting cluster functional-gpu-cluster-wqgqck/functional-gpu-cluster-44q1nj @ 04/16/24 02:58:17.157
  INFO: Waiting for the Cluster functional-gpu-cluster-wqgqck/functional-gpu-cluster-44q1nj to be deleted
  STEP: Waiting for cluster functional-gpu-cluster-wqgqck/functional-gpu-cluster-44q1nj to be deleted @ 04/16/24 02:58:17.164
  STEP: Deleting namespace used for hosting the "" test spec @ 04/16/24 03:04:47.359
  INFO: Deleting namespace functional-gpu-cluster-wqgqck
  folder created for eks clusters: /logs/artifacts/clusters/bootstrap/aws-resources
  STEP: Tearing down the management cluster @ 04/16/24 03:16:09.45
  INFO: Error getting pod capi-kubeadm-control-plane-system/capi-kubeadm-control-plane-controller-manager-66d8956f77-2n4zj, container manager: Get "https://127.0.0.1:38745/api/v1/namespaces/capi-kubeadm-control-plane-system/pods/capi-kubeadm-control-plane-controller-manager-66d8956f77-2n4zj": dial tcp 127.0.0.1:38745: connect: connection refused - error from a previous attempt: read tcp 127.0.0.1:58460->127.0.0.1:38745: read: connection reset by peer
  INFO: Error getting pod capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager-78d8cb7cf6-kdr97, container manager: Get "https://127.0.0.1:38745/api/v1/namespaces/capi-kubeadm-bootstrap-system/pods/capi-kubeadm-bootstrap-controller-manager-78d8cb7cf6-kdr97": dial tcp 127.0.0.1:38745: connect: connection refused - error from a previous attempt: EOF
  INFO: Error getting pod capa-system/capa-controller-manager-6b8f8b488c-9dcnc, container manager: Get "https://127.0.0.1:38745/api/v1/namespaces/capa-system/pods/capa-controller-manager-6b8f8b488c-9dcnc": dial tcp 127.0.0.1:38745: connect: connection refused - error from a previous attempt: read tcp 127.0.0.1:58472->127.0.0.1:38745: read: connection reset by peer
  INFO: Error getting pod capi-system/capi-controller-manager-656b74646d-djwxs, container manager: Get "https://127.0.0.1:38745/api/v1/namespaces/capi-system/pods/capi-controller-manager-656b74646d-djwxs": dial tcp 127.0.0.1:38745: connect: connection refused - error from a previous attempt: read tcp 127.0.0.1:58458->127.0.0.1:38745: read: connection reset by peer
  STEP: Deleting cluster-api-provider-aws-sigs-k8s-io CloudFormation stack @ 04/16/24 03:16:13.393

The above output may be a red herring, however - there's also this log line that indicates that we're not gett the collection logs from the cluster made to do the upgrade test, while the cluster being torn down in the pasted section is functional-gpu-cluster-wqgqck/functional-gpu-cluster-44q1nj.

What did you expect to happen:

Test resources are cleaned up without timeout.

Anything else you would like to add:

@k8s-ci-robot k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 18, 2024
@nrb
Copy link
Contributor Author

nrb commented Apr 18, 2024

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. needs-priority triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

2 participants