Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Failing Test] gce-cos-k8sbeta-serial (ci-kubernetes-e2e-gce-cos-k8sbeta-serial) #85814

Closed
hasheddan opened this issue Dec 2, 2019 · 8 comments · Fixed by #85916

Comments

@hasheddan
Copy link
Contributor

@hasheddan hasheddan commented Dec 2, 2019

Which jobs are failing:
gce-cos-k8sbeta-serial (ci-kubernetes-e2e-gce-cos-k8sbeta-serial)

Which test(s) are failing:
Various CSI Driver tests.

Since when has it been failing:
Earliest observed: 11/17
Consistently failing since: 11/29

Testgrid link:
https://testgrid.k8s.io/sig-release-1.17-informing#gce-cos-k8sbeta-serial

Reason for failure:

I1202 09:51:54.336] Dec  2 09:51:54.335: INFO: Define cluster role csi-gce-pd-resizer-role-volume-expand-4470
I1202 09:51:54.372] Dec  2 09:51:54.372: INFO: creating *v1.ClusterRoleBinding: csi-gce-pd-resizer-binding-volume-expand-4470
I1202 09:51:54.409] Dec  2 09:51:54.409: INFO: creating *v1.ClusterRoleBinding: psp-csi-controller-driver-registrar-role-volume-expand-4470
I1202 09:51:54.445] Dec  2 09:51:54.444: INFO: creating *v1.DaemonSet: volume-expand-4470/csi-gce-pd-node
I1202 09:51:54.482] Dec  2 09:51:54.482: INFO: creating *v1.StatefulSet: volume-expand-4470/csi-gce-pd-controller
I1202 09:52:54.742] Dec  2 09:52:54.741: FAIL: waiting for csi driver node registration on: timed out waiting for the condition
I1202 09:52:54.743] 
I1202 09:52:54.744] Full Stack Trace
I1202 09:52:54.744] k8s.io/kubernetes/test/e2e/storage/drivers.(*gcePDCSIDriver).PrepareTest(0xc00061fa20, 0xc000b6c780, 0x101000000000000, 0x7d1d6e0)
I1202 09:52:54.744] 	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/drivers/csi.go:452 +0x204
I1202 09:52:54.744] k8s.io/kubernetes/test/e2e/storage/testsuites.(*volumeExpandTestSuite).defineTests.func2()
skipped 179 lines unfold_more
I1202 09:52:55.928] Dec  2 09:52:55.927: INFO: 
I1202 09:52:55.928] Latency metrics for node test-34cf3ed1e3-minion-group-h8g8
I1202 09:52:55.929] Dec  2 09:52:55.927: INFO: Waiting up to 3m0s for all (but 0) nodes to be ready
I1202 09:52:55.965] STEP: Destroying namespace "volume-expand-4470" for this suite.
I1202 09:52:56.002] 
I1202 09:52:56.003] • Failure [62.488 seconds]
I1202 09:52:56.003] [sig-storage] CSI Volumes
I1202 09:52:56.004] /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/utils/framework.go:23
I1202 09:52:56.004]   [Driver: pd.csi.storage.gke.io][Serial]
I1202 09:52:56.004]   /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/csi_volumes.go:55
I1202 09:52:56.004]     [Testpattern: Dynamic PV (default fs)(allowExpansion)] volume-expand
I1202 09:52:56.004]     /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/testsuites/base.go:100
I1202 09:52:56.005]       Verify if offline PVC expansion works [It]
I1202 09:52:56.005]       /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/testsuites/volume_expand.go:159
I1202 09:52:56.005] 
I1202 09:52:56.005]       Dec  2 09:52:54.741: waiting for csi driver node registration on: timed out waiting for the condition
I1202 09:52:56.005] 
I1202 09:52:56.006]       /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/drivers/csi.go:452
I1202 09:52:56.006] ------------------------------
I1202 09:52:56.006] {"msg":"FAILED [sig-storage] CSI Volumes [Driver: pd.csi.storage.gke.io][Serial] [Testpattern: Dynamic PV (default fs)(allowExpansion)] volume-expand Verify if offline PVC expansion works","total":725,"completed":14,"skipped":271,"failed":1,"failures":["[sig-storage] CSI Volumes [Driver: pd.csi.storage.gke.io][Serial] [Testpattern: Dynamic PV (default fs)(allowExpansion)] volume-expand Verify if offline PVC expansion works"]}

Anything else we need to know:
/cc @kubernetes/ci-signal

/milestone v1.17
/sig storage
/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added this to the v1.17 milestone Dec 2, 2019
@hasheddan hasheddan added this to New (no response yet) in 1.17 CI Signal Dec 2, 2019
@hasheddan hasheddan changed the title [Failing Test] gce-cos-k8sbeta-serial [Failing Test] gce-cos-k8sbeta-serial (ci-kubernetes-e2e-gce-cos-k8sbeta-serial) Dec 2, 2019
@msau42

This comment has been minimized.

Copy link
Member

@msau42 msau42 commented Dec 3, 2019

@msau42

This comment has been minimized.

Copy link
Member

@msau42 msau42 commented Dec 3, 2019

/assign @jingxu97
to investigate.
@gnufied are the expansion test flakes supposed to be fixed?

@hasheddan

This comment has been minimized.

Copy link
Contributor Author

@hasheddan hasheddan commented Dec 3, 2019

@gnufied

This comment has been minimized.

Copy link
Member

@gnufied gnufied commented Dec 3, 2019

@msau42 The only failures I see in CSI+expansion is from gce-pd CSI driver. We need to fix kubernetes-sigs/gcp-compute-persistent-disk-csi-driver#433 for that.

@ahg-g

This comment has been minimized.

Copy link
Member

@ahg-g ahg-g commented Dec 3, 2019

We should move "SchedulerPredicates [Serial] validates MaxPods limit number of pods that are allowed to run [Slow]" out of e2e tests. It relies on a particular state of the cluster before the test start that is hard to validate. It should be more than enough to have an integration test for this, even a unit test (which we already hav ) should be fine .

@alenkacz

This comment has been minimized.

Copy link
Contributor

@alenkacz alenkacz commented Dec 4, 2019

@ahg-g will you be willing to work on that? it's flaking A LOT right now, definitely more than 50%

@ahg-g

This comment has been minimized.

Copy link
Member

@ahg-g ahg-g commented Dec 4, 2019

I created #85900, should be addressed soon.

@jingxu97

This comment has been minimized.

Copy link
Contributor

@jingxu97 jingxu97 commented Dec 4, 2019

There are several tests flaky. Summarize them as below

  1. volume expansion test
    See comments from #85814 (comment)

  2. subPath and a few other tests
    The cause of test failure is node is OutOfPod resource. See one of the log message {kubelet test-

34cf3ed1e3-minion-group-bdqr} OutOfpods: Node didn't have enough resource: pods, requested: 1, used: 110, capacity: 110

This is due to the SchedulerPredicates test mentioned above #85814 (comment), which creates more than 100 Pods at the same time. We need to move this test out to make sure other tests have enough resources.

1.17 CI Signal automation moved this from New (no response yet) to Observing (observe test failure/flake before marking as resolved) Dec 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
1.17 CI Signal
  
Observing (observe test failure/flake...
7 participants
You can’t perform that action at this time.