New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
e2e-mixer-noauth-v1alpha3-v2 flake: command failed: "Error from server (BadRequest): container \"prometheus\" in pod \"prometheus-7b9868c46b-trl67\" is waiting to start: PodInitializing\n" exit status 1 #12431
Comments
I'm not sure what to do here. This is an infrastructure flake. Prometheus not starting up is not something specific to the test itself. |
@duderino in the failures you linked, the actual failure happens earlier and is not prometheus specific:
and
In both cases, the test ran for Look at the pod status:
Galley, Egressgateway, Ingressgateway, Pilot, and sidecar-injector all failed to even start in time. This looks like a CircleCI issue -- maybe scheduling on a bad VM or something. I don't believe I'm the right person to diagnose this issue. |
When a pod get stuck in Look at my PR and notice that
I suspect most of the |
The test is still pretty flaky. Can we increase resources for CircleCI? @costinm do you know? |
The test failed a lot of times for the last 10 days in the presubmit and also sometimes in postsubmit. As showed in the log in #12431 (comment), I think the tests failed because the pilot/galley pod doesn't start correctly. Looking at the pilot-discovery log, it seems pilot cannot connect to galley:
Looking at the galley log:
It seems galley didn't get the istio-system/istio-galley service endpoints for some reason, could this cause galley fail to start and then block pilot to work? The comment for the function in galley seems implying there might be some race condition here: istio/galley/pkg/crd/validation/webhook.go Line 310 in 36eaeee
@ayj @ozevren, @douglas-reid , do you have any insights about galley's behavior in this case? The other suspicious thing is the galley liveness/readiness probe errored:
|
The comment in the code is about describing the inherent raciness and explains why the code is that way (i.e. it has to deal with the race). The validation and distribution parts are not tied together. The Galley logs show that MCP server is in ready state and waiting for incoming connections (which doesn't arrive). The Pilot logs corroborate this:
It looks like this is a network connectivity problem. |
It's CircleCI, the plan is to move away from CircleCI and use GKE instead where we can get more resources. |
@utka I'm still wondering is this caused by some bug in istio or is it really network connectivity problem, I found the following log in multiple failed tests which might be related to #6085
|
@fpesce do you still see this flake after we increased circle ci resources? If no, please close this out. If yes, please reassign to me |
Hey, there's no updates for this test flakes for 3 days. |
Hey, there's no updates for this test flakes for 3 days. |
🤔 ❄️ Hey there's no update for this test flakes for 3 days. |
1 similar comment
🤔 ❄️ Hey there's no update for this test flakes for 3 days. |
I have checked the logs for this particular tests over the last 30 runs and 5 failures, all of them are legit. It does not look flaky at all. |
Describe the bug
e2e-mixer-noauth-v1alpha3-v2 fails about 1/5 days with a:
https://circleci.com/gh/istio/istio/352136
https://circleci.com/gh/istio/istio/348356
The text was updated successfully, but these errors were encountered: