Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

istio-integ-k8s-tests timeout on Prow #13280

Closed
howardjohn opened this issue Apr 12, 2019 · 14 comments
Closed

istio-integ-k8s-tests timeout on Prow #13280

howardjohn opened this issue Apr 12, 2019 · 14 comments

Comments

@howardjohn
Copy link
Member

Tests are failing after 2 hours. Example: https://k8s-gubernator.appspot.com/build/istio-prow/pr-logs/pull/istio_istio/13182/istio-integ-k8s-tests/6701

istio.io/istio/tests/integration/security/sds_citadel_flow and istio.io/istio/tests/integration/security/healthcheck are killed after taking 30min each

@howardjohn
Copy link
Member Author

@incfly you wrote health check, any insight?
@lei-tang you wrote sds citadel flow, any ideas?

@incfly
Copy link

incfly commented Apr 12, 2019

The mtls health check test itself has been passed stably for a while, looking from testgrid, https://testgrid.k8s.io/istio-postsubmits#integration-k8s-tests&show-stale-tests=
I think this has something to do with the fact that we recently added more tests, and MtlsHealthCheck and SdsVaultTest both require different Helm flag installation, deploying Istio is the time killer and thus time-out...

Solution, for mtls health check test, recently someone from community helped to support per-annotation app prober rewrite. We can get rid of special Istio installation. I can send a PR to update that.

Question, why I can't see it running now on testgrid? I didn't see it's skipped in the code. Any ideas? Just curious.

@incfly
Copy link

incfly commented Apr 12, 2019

We might not be able to do the same thing for sds related tests.. @lei-tang unless those workflows support per deployment test case.

@howardjohn
Copy link
Member Author

I assume they don't show in testgrid because the tests get killed instead of failing? Should be fixed though

@lei-tang
Copy link
Contributor

I tried to run the failured tests on a GKE cluster a few times and they all passed. My thoughts:

  • With whom can we consult about the difference between running the tests on prow and running on a GKE cluster through "go test -v TEST-DIRRECTORY"?
  • Is the prow go tests running without "-p 1"? According to the integration test documents, "-p 1" is needed to avoid the interference from multiple concurrent tests.

@howardjohn
Copy link
Member Author

You can look at build log to see arguments used:

+ T=-v
+ make test.integration.kube
mkdir -p /logs/artifacts/
set -o pipefail; \
/usr/local/go/bin/go test -p 1 -v istio.io/istio/tests/integration/citadel istio.io/istio/tests/integration/echo istio.io/istio/tests/integration/framework istio.io/istio/tests/integration/galley/conversion istio.io/istio/tests/integration/galley/validation istio.io/istio/tests/integration/mixer istio.io/istio/tests/integration/pilot istio.io/istio/tests/integration/pilot/locality istio.io/istio/tests/integration/security/authn istio.io/istio/tests/integration/security/healthcheck istio.io/istio/tests/integration/security/reachability istio.io/istio/tests/integration/security/sds_citadel_flow istio.io/istio/tests/integration/security/sds_vault_flow  --istio.test.ci -timeout 30m \
--istio.test.env kube \
--istio.test.kube.config ~/.kube/config \
--istio.test.hub=gcr.io/istio-testing \
--istio.test.tag=5a48a834e9f48861db56d2107f4e02f31c578a72 \
--istio.test.pullpolicy=IfNotPresent       \
 \
--istio.test.work_dir /logs/artifacts \
2>&1 | tee >(/opt/go/bin/go-junit-report > /logs/artifacts/junit_unit-tests.xml)

Looks like -p 1 is passed

@lei-tang
Copy link
Contributor

lei-tang commented Apr 12, 2019

@incfly When running the failed tests on a GKE cluster, the Istio deployment only takes around 2 minutes, far less than 30 minutes. Not sure about the Istio deployment time on prow. For the integration tests that require helm template to configure the Istio deployment, I think new Istio depolyment is needed because Istio components (e.g. node agent, Citadel) needed to be restarted with new configuration and testing in a new Istio deployment also ensures that a previous test does not contaminate the current test.

@lei-tang
Copy link
Contributor

lei-tang commented Apr 12, 2019

@howardjohn Thanks, "-p 1" is used so in theory these tests will run one by one without interfering each other. However, I have a few more speculations about the 30 minute timeout failures on prow test environments.

  • The cluster used in prow tests may run out of resources to create new test deployment. It may have problems on releasing the resources of previous tests, e.g., a new test is being deployed while the resources of previous test have not been released yet.
  • In prow, the rests are running by "go test -p 1 test-directory-1 test-directory-2 ...", which may be different from running "go test test-directory-1", then running "go test test-directory-2". Even though "-p 1" in theory should provide equivalent results but I am not sure in practice it is the case (e.g., the timing of consecutive tests and the resource releasing).

@howardjohn
Copy link
Member Author

Looking at logs, Istio is successfully deployed. I am pretty sure it gets stuck on this:

_, err := deployment.New(ctx, deployment.Config{
Yaml: `apiVersion: "authentication.istio.io/v1alpha1"
kind: "Policy"
metadata:
name: "mtls-strict-for-healthcheck"
spec:
targets:
- name: "healthcheck"
peers:
- mtls:
mode: STRICT
`,
Namespace: ns,
})

I think we can change these to use galley maybe? Not sure if that will fix the problem though

These tests are consistently failing for days; we just never hit the timeout window until recently it seems. I think we should disable these tests until they are fixed.

@howardjohn
Copy link
Member Author

PR to disable: #13305

@howardjohn
Copy link
Member Author

Now with #13305 I see the other 2 security tests also timing out: https://k8s-gubernator.appspot.com/build/istio-prow/pr-logs/pull/istio_istio/13305/istio-integ-k8s-tests/6735

Not sure what it is about the security tests that make them different..

@lei-tang
Copy link
Contributor

I am trying to debug why these tests fail in Prow environment while succeed when running through commands "go test integration-test-directory-name" on a GKE cluster. But the Prow test scripts (e.g., prow/istio-integ-k8s-tests.sh) fails to run on a desktop terminal (failed at getting a Boskos resource).
I created the issue #13307 for instructions/documents on running Prow scripts for the purpose of debugging Prow test failures.

@howardjohn
Copy link
Member Author

Pretty sure security tests are not the issue, rather they happen to be the ones to fail because they are run last. See https://k8s-gubernator.appspot.com/build/istio-prow/pr-logs/pull/istio_istio/13348/istio-integ-k8s-tests/6845?log#log

I ran only the security tests, and added some more debug statements (shouldn't change anything), and the tests pass.

@howardjohn
Copy link
Member Author

I am almost certain the issue is with the new locality LB tests recently added. Supporting evidence:

For now lets disable the locality test. It is broken anyways.

@liamawhite

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants