e2e-mixer-noauth-v1alpha3-v2 flake: command failed: "Error from server (BadRequest): container \"prometheus\" in pod \"prometheus-7b9868c46b-trl67\" is waiting to start: PodInitializing\n" exit status 1 #12431

duderino · 2019-03-12T20:16:16Z

Describe the bug

e2e-mixer-noauth-v1alpha3-v2 fails about 1/5 days with a:

* command failed: "Error from server (BadRequest): container \"prometheus\" in pod \"prometheus-7b9868c46b-trl67\" is waiting to start: PodInitializing\n" exit status 1

https://circleci.com/gh/istio/istio/352136
https://circleci.com/gh/istio/istio/348356

The text was updated successfully, but these errors were encountered:

douglas-reid · 2019-03-12T20:39:09Z

I'm not sure what to do here. This is an infrastructure flake. Prometheus not starting up is not something specific to the test itself.

douglas-reid · 2019-03-12T20:44:10Z

@duderino in the failures you linked, the actual failure happens earlier and is not prometheus specific:

2019-03-12T19:31:55.698092Z	error	Failed to deploy Istio.
2019-03-12T19:31:55.698105Z	error	Failed to complete Init. Error context deadline exceeded

and

2019-03-07T21:53:07.805528Z	info	Deployment rollout ends after [16m0.000192082s] with err [context deadline exceeded]
2019-03-07T21:53:07.805621Z	error	Failed to deploy Istio.
2019-03-07T21:53:07.805634Z	error	Failed to complete Init. Error context deadline exceeded

In both cases, the test ran for 16m without being able to deploy Istio.

Look at the pod status:

istio-citadel-5776967f54-p4dp5                                    1/1       Running    0          16m
istio-cleanup-secrets-f41e07de2e0f65ae621fdaf51bf914e8-kf2fn      1/1       Running    0          16m
istio-egressgateway-799dc8984c-jsh9l                              0/1       Running    0          16m
istio-galley-8698b6c764-sn2jh                                     0/1       Running    0          16m
istio-ingressgateway-6544478674-thd49                             0/1       Running    0          16m
istio-pilot-675644c765-2ft7l                                      1/2       Running    0          16m
istio-policy-86f8d5849d-hb7zb                                     2/2       Running    1          16m
istio-policy-86f8d5849d-zzk25                                     2/2       Running    1          16m
istio-security-post-install-f41e07de2e0f65ae621fdaf51bf9144c9h4   1/1       Running    0          16m
istio-sidecar-injector-6d4bc9bff-tpl6z                            0/1       Running    0          16m
istio-telemetry-855f6d5486-ndwxc                                  2/2       Running    1          16m
prometheus-7b9868c46b-nwrj9                                       0/1       Init:0/1   0          16m

Galley, Egressgateway, Ingressgateway, Pilot, and sidecar-injector all failed to even start in time.

This looks like a CircleCI issue -- maybe scheduling on a bad VM or something. I don't believe I'm the right person to diagnose this issue.

esnible · 2019-03-13T14:31:08Z

When a pod get stuck in Init:0/1 it would help to do a kubectl describe instead of the kubectl logs the test currently does. I have a PR for that at #11904 . Can anyone help me get that PR merged?

Look at my PR and notice that ci/circleci: e2e-pilot-auth-v1alpha3-v2-non-mcp failed. I wanted it to fail! https://circleci.com/gh/istio/istio/341725?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link shows the reason that Prometheus got stuck in Init:0/1 is

Warning  FailedScheduling  44s (x57 over 16m)  default-scheduler  0/1 nodes are available: 1 Insufficient cpu.

I suspect most of the Init:0/1 failures for Prometheus are the same, but we won't know until we can get the events into the test log.

utka · 2019-03-20T18:28:40Z

The test is still pretty flaky. Can we increase resources for CircleCI? @costinm do you know?

yangminzhu · 2019-05-22T02:52:02Z

The test failed a lot of times for the last 10 days in the presubmit and also sometimes in postsubmit.

As showed in the log in #12431 (comment), I think the tests failed because the pilot/galley pod doesn't start correctly. Looking at the pilot-discovery log, it seems pilot cannot connect to galley:

2019-05-15T22:24:12.615946Z	info	grpc: addrConn.createTransport failed to connect to {istio-galley.istio-system.svc:9901 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.104.135.73:9901: i/o timeout". Reconnecting...
2019-05-15T22:24:12.616005Z	info	pickfirstBalancer: HandleSubConnStateChange: 0xc00063dba0, TRANSIENT_FAILURE
2019-05-15T22:24:12.616061Z	error	mcp	Failed to create a new MCP sink stream: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.104.135.73:9901: i/o timeout"

Looking at the galley log:

2019-05-15T22:24:51.582090Z	warn	validation	istio-system/istio-galley endpoint not ready: no ready addresses

It seems galley didn't get the istio-system/istio-galley service endpoints for some reason, could this cause galley fail to start and then block pilot to work? The comment for the function in galley seems implying there might be some race condition here:

istio/galley/pkg/crd/validation/webhook.go

Line 310 in 36eaeee

// During initial Istio installation its possible for custom

@ayj @ozevren, @douglas-reid , do you have any insights about galley's behavior in this case?

The other suspicious thing is the galley liveness/readiness probe errored:

Liveness probe errored: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Readiness probe errored: rpc error: code = DeadlineExceeded desc = context deadline exceeded

ozevren · 2019-05-22T21:35:51Z

The comment in the code is about describing the inherent raciness and explains why the code is that way (i.e. it has to deal with the race). The validation and distribution parts are not tied together. The Galley logs show that MCP server is in ready state and waiting for incoming connections (which doesn't arrive). The Pilot logs corroborate this:

2019-05-15T22:23:07.368658Z	info	pickfirstBalancer: HandleSubConnStateChange: 0xc00063dba0, TRANSIENT_FAILURE
2019-05-15T22:23:07.368803Z	error	mcp	Failed to create a new MCP sink stream: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.104.135.73:9901: i/o timeout"

It looks like this is a network connectivity problem.

douglas-reid · 2019-05-23T15:26:07Z

@utka if we are seeing an increased rate flakes resulting from networking connectivity issues (as @ozevren points out), what is our path forward? Is this something we should be taking up with the GKE team? I'm not sure how to proceed.

utka · 2019-05-23T21:41:00Z

It's CircleCI, the plan is to move away from CircleCI and use GKE instead where we can get more resources.

yangminzhu · 2019-05-23T22:24:41Z

@utka
It seems the test is failing more frequently on prow: https://testgrid.k8s.io/istio-presubmits-master#e2e-mixer-no_auth&width=20

I'm still wondering is this caused by some bug in istio or is it really network connectivity problem, I found the following log in multiple failed tests which might be related to #6085

[2019-05-23 21:02:04.346][20][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:86] gRPC config stream closed: 14, no healthy upstream

duderino · 2019-06-12T17:09:45Z

@fpesce do you still see this flake after we increased circle ci resources? If no, please close this out. If yes, please reassign to me

istio-testing · 2019-06-21T18:26:27Z

Hey, there's no updates for this test flakes for 3 days.

istio-testing · 2019-06-21T23:07:35Z

Hey, there's no updates for this test flakes for 3 days.

istio-testing · 2019-07-09T00:00:05Z

🤔 ❄️ Hey there's no update for this test flakes for 3 days.

istio-testing · 2019-07-10T00:00:14Z

🤔 ❄️ Hey there's no update for this test flakes for 3 days.

fpesce · 2019-07-10T16:32:27Z

I have checked the logs for this particular tests over the last 30 runs and 5 failures, all of them are legit. It does not look flaky at all.

duderino added area/extensions and telemetry kind/test failure labels Mar 12, 2019

duderino assigned douglas-reid Mar 12, 2019

duderino mentioned this issue Mar 12, 2019

Update istio/proxy for 1.1rc4 #12428

Merged

douglas-reid removed their assignment Mar 12, 2019

douglas-reid added area/test and release ci/circleci and removed area/extensions and telemetry labels Mar 12, 2019

esnible mentioned this issue Mar 15, 2019

Bump up integration2 k8s istio deployment timeout duration. #12482

Closed

geeknoid added this to the 1.2 milestone May 12, 2019

geeknoid added the code mauve label May 12, 2019

fpesce assigned utka Jun 6, 2019

duderino assigned fpesce and unassigned utka Jun 12, 2019

philrud mentioned this issue Jun 12, 2019

e2e-pilot-noauth-v1alpha3-v2 cleanup failures #13249

Closed

yangminzhu mentioned this issue Jun 12, 2019

Flaky Test: CircleCI e2e-pilot-auth-v1alpha3-v2 #11929

Closed

geeknoid removed the code mauve label Jun 15, 2019

istio deleted a comment from istio-testing Jun 21, 2019

incfly mentioned this issue Jun 21, 2019

Initial version for the test flakes bots. istio/bots#17

Merged

fpesce closed this as completed Jul 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e-mixer-noauth-v1alpha3-v2 flake: command failed: "Error from server (BadRequest): container \"prometheus\" in pod \"prometheus-7b9868c46b-trl67\" is waiting to start: PodInitializing\n" exit status 1 #12431

e2e-mixer-noauth-v1alpha3-v2 flake: command failed: "Error from server (BadRequest): container \"prometheus\" in pod \"prometheus-7b9868c46b-trl67\" is waiting to start: PodInitializing\n" exit status 1 #12431

duderino commented Mar 12, 2019

douglas-reid commented Mar 12, 2019

douglas-reid commented Mar 12, 2019

esnible commented Mar 13, 2019

utka commented Mar 20, 2019

yangminzhu commented May 22, 2019 •

edited

ozevren commented May 22, 2019

douglas-reid commented May 23, 2019

utka commented May 23, 2019

yangminzhu commented May 23, 2019

duderino commented Jun 12, 2019

istio-testing commented Jun 21, 2019

istio-testing commented Jun 21, 2019

istio-testing commented Jul 9, 2019

istio-testing commented Jul 10, 2019

fpesce commented Jul 10, 2019

e2e-mixer-noauth-v1alpha3-v2 flake: command failed: "Error from server (BadRequest): container \"prometheus\" in pod \"prometheus-7b9868c46b-trl67\" is waiting to start: PodInitializing\n" exit status 1 #12431

e2e-mixer-noauth-v1alpha3-v2 flake: command failed: "Error from server (BadRequest): container \"prometheus\" in pod \"prometheus-7b9868c46b-trl67\" is waiting to start: PodInitializing\n" exit status 1 #12431

Comments

duderino commented Mar 12, 2019

douglas-reid commented Mar 12, 2019

douglas-reid commented Mar 12, 2019

esnible commented Mar 13, 2019

utka commented Mar 20, 2019

yangminzhu commented May 22, 2019 • edited

ozevren commented May 22, 2019

douglas-reid commented May 23, 2019

utka commented May 23, 2019

yangminzhu commented May 23, 2019

duderino commented Jun 12, 2019

istio-testing commented Jun 21, 2019

istio-testing commented Jun 21, 2019

istio-testing commented Jul 9, 2019

istio-testing commented Jul 10, 2019

fpesce commented Jul 10, 2019

yangminzhu commented May 22, 2019 •

edited