Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e-mixer-noauth-v1alpha3-v2 flake: command failed: "Error from server (BadRequest): container \"prometheus\" in pod \"prometheus-7b9868c46b-trl67\" is waiting to start: PodInitializing\n" exit status 1 #12431

Closed
duderino opened this issue Mar 12, 2019 · 15 comments

Comments

@duderino
Copy link
Contributor

Describe the bug

e2e-mixer-noauth-v1alpha3-v2 fails about 1/5 days with a:

* command failed: "Error from server (BadRequest): container \"prometheus\" in pod \"prometheus-7b9868c46b-trl67\" is waiting to start: PodInitializing\n" exit status 1

https://circleci.com/gh/istio/istio/352136
https://circleci.com/gh/istio/istio/348356

@douglas-reid
Copy link
Contributor

I'm not sure what to do here. This is an infrastructure flake. Prometheus not starting up is not something specific to the test itself.

@douglas-reid
Copy link
Contributor

@duderino in the failures you linked, the actual failure happens earlier and is not prometheus specific:

2019-03-12T19:31:55.698092Z	error	Failed to deploy Istio.
2019-03-12T19:31:55.698105Z	error	Failed to complete Init. Error context deadline exceeded

and

2019-03-07T21:53:07.805528Z	info	Deployment rollout ends after [16m0.000192082s] with err [context deadline exceeded]
2019-03-07T21:53:07.805621Z	error	Failed to deploy Istio.
2019-03-07T21:53:07.805634Z	error	Failed to complete Init. Error context deadline exceeded

In both cases, the test ran for 16m without being able to deploy Istio.

Look at the pod status:

istio-citadel-5776967f54-p4dp5                                    1/1       Running    0          16m
istio-cleanup-secrets-f41e07de2e0f65ae621fdaf51bf914e8-kf2fn      1/1       Running    0          16m
istio-egressgateway-799dc8984c-jsh9l                              0/1       Running    0          16m
istio-galley-8698b6c764-sn2jh                                     0/1       Running    0          16m
istio-ingressgateway-6544478674-thd49                             0/1       Running    0          16m
istio-pilot-675644c765-2ft7l                                      1/2       Running    0          16m
istio-policy-86f8d5849d-hb7zb                                     2/2       Running    1          16m
istio-policy-86f8d5849d-zzk25                                     2/2       Running    1          16m
istio-security-post-install-f41e07de2e0f65ae621fdaf51bf9144c9h4   1/1       Running    0          16m
istio-sidecar-injector-6d4bc9bff-tpl6z                            0/1       Running    0          16m
istio-telemetry-855f6d5486-ndwxc                                  2/2       Running    1          16m
prometheus-7b9868c46b-nwrj9                                       0/1       Init:0/1   0          16m

Galley, Egressgateway, Ingressgateway, Pilot, and sidecar-injector all failed to even start in time.

This looks like a CircleCI issue -- maybe scheduling on a bad VM or something. I don't believe I'm the right person to diagnose this issue.

@esnible
Copy link
Contributor

esnible commented Mar 13, 2019

When a pod get stuck in Init:0/1 it would help to do a kubectl describe instead of the kubectl logs the test currently does. I have a PR for that at #11904 . Can anyone help me get that PR merged?

Look at my PR and notice that ci/circleci: e2e-pilot-auth-v1alpha3-v2-non-mcp failed. I wanted it to fail! https://circleci.com/gh/istio/istio/341725?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link shows the reason that Prometheus got stuck in Init:0/1 is

Warning  FailedScheduling  44s (x57 over 16m)  default-scheduler  0/1 nodes are available: 1 Insufficient cpu.

I suspect most of the Init:0/1 failures for Prometheus are the same, but we won't know until we can get the events into the test log.

@utka
Copy link
Contributor

utka commented Mar 20, 2019

The test is still pretty flaky. Can we increase resources for CircleCI? @costinm do you know?

@geeknoid geeknoid added this to the 1.2 milestone May 12, 2019
@yangminzhu
Copy link
Contributor

yangminzhu commented May 22, 2019

The test failed a lot of times for the last 10 days in the presubmit and also sometimes in postsubmit.

As showed in the log in #12431 (comment), I think the tests failed because the pilot/galley pod doesn't start correctly. Looking at the pilot-discovery log, it seems pilot cannot connect to galley:

2019-05-15T22:24:12.615946Z	info	grpc: addrConn.createTransport failed to connect to {istio-galley.istio-system.svc:9901 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.104.135.73:9901: i/o timeout". Reconnecting...
2019-05-15T22:24:12.616005Z	info	pickfirstBalancer: HandleSubConnStateChange: 0xc00063dba0, TRANSIENT_FAILURE
2019-05-15T22:24:12.616061Z	error	mcp	Failed to create a new MCP sink stream: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.104.135.73:9901: i/o timeout"

Looking at the galley log:

2019-05-15T22:24:51.582090Z	warn	validation	istio-system/istio-galley endpoint not ready: no ready addresses

It seems galley didn't get the istio-system/istio-galley service endpoints for some reason, could this cause galley fail to start and then block pilot to work? The comment for the function in galley seems implying there might be some race condition here:

// During initial Istio installation its possible for custom

@ayj @ozevren, @douglas-reid , do you have any insights about galley's behavior in this case?

The other suspicious thing is the galley liveness/readiness probe errored:

Liveness probe errored: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Readiness probe errored: rpc error: code = DeadlineExceeded desc = context deadline exceeded

@ozevren
Copy link
Contributor

ozevren commented May 22, 2019

The comment in the code is about describing the inherent raciness and explains why the code is that way (i.e. it has to deal with the race). The validation and distribution parts are not tied together. The Galley logs show that MCP server is in ready state and waiting for incoming connections (which doesn't arrive). The Pilot logs corroborate this:

2019-05-15T22:23:07.368658Z	info	pickfirstBalancer: HandleSubConnStateChange: 0xc00063dba0, TRANSIENT_FAILURE
2019-05-15T22:23:07.368803Z	error	mcp	Failed to create a new MCP sink stream: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.104.135.73:9901: i/o timeout"

It looks like this is a network connectivity problem.

@douglas-reid
Copy link
Contributor

@utka if we are seeing an increased rate flakes resulting from networking connectivity issues (as @ozevren points out), what is our path forward? Is this something we should be taking up with the GKE team? I'm not sure how to proceed.

@utka
Copy link
Contributor

utka commented May 23, 2019

It's CircleCI, the plan is to move away from CircleCI and use GKE instead where we can get more resources.

@yangminzhu
Copy link
Contributor

@utka
It seems the test is failing more frequently on prow: https://testgrid.k8s.io/istio-presubmits-master#e2e-mixer-no_auth&width=20

I'm still wondering is this caused by some bug in istio or is it really network connectivity problem, I found the following log in multiple failed tests which might be related to #6085

[2019-05-23 21:02:04.346][20][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:86] gRPC config stream closed: 14, no healthy upstream

@duderino duderino assigned fpesce and unassigned utka Jun 12, 2019
@duderino
Copy link
Contributor Author

@fpesce do you still see this flake after we increased circle ci resources? If no, please close this out. If yes, please reassign to me

@istio-testing
Copy link
Collaborator

Hey, there's no updates for this test flakes for 3 days.

@istio-testing
Copy link
Collaborator

Hey, there's no updates for this test flakes for 3 days.

@istio-testing
Copy link
Collaborator

🤔 ❄️ Hey there's no update for this test flakes for 3 days.

1 similar comment
@istio-testing
Copy link
Collaborator

🤔 ❄️ Hey there's no update for this test flakes for 3 days.

@fpesce
Copy link
Contributor

fpesce commented Jul 10, 2019

I have checked the logs for this particular tests over the last 30 runs and 5 failures, all of them are legit. It does not look flaky at all.

@fpesce fpesce closed this as completed Jul 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants