Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Failing Test: timeout in deployment on gci-gke-serial #69597
This test job has always been flaky, but is now failing more often than it's passing. Diagnosing it is difficult, though, because even the "successful" runs have multiple failures in the log, and the reported test failures seem not to be independant failures in the logs, but rather related to multiple failures to deploy pods and storage in the test setup.
Does anyone have more insight into what's going on here? This feels more like a google cloud or test config failure than anything else.
also looks like https://k8s-testgrid.appspot.com/sig-release-master-blocking#gci-gce-serial has similar issues, so this is actually not a gke specific problem.
one of the kubelet log has
not sure if that's helpful at all...
@AishSundar: GitHub didn't allow me to assign the following users: artemvmin.
The suites are failing on both gce and gke, so I don't think it's a gke specific issue.
Looking at one occurrence: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-serial/4196
The first test that started failing was:
Because it couldn't schedule the pod. Looking at the scheduler logs, I only see these two lines referring to that pod and nothing else:
This is odd because there is usually at least one more log message if scheduling failed or succeeded. This implies that the pod is stuck somewhere in the scheduler and there could potentially be a deadlock. There are no other scheduling logs about other pods after this.
An odd pattern that I noticed was that for all the serial runs where a ton of tests fail, the test case that always runs before the subsequent failing tests is the equivalence cache tests. And there's always a line in the scheduler.log like:
I wonder if there is some race condition regarding the pod deletion case.
I have seen this pattern in the most recent runs where a bunch of tests all fail:
The first failing test case is different across all these runs, but the commonality between them all is that the equivalence cache tests run right before it.