Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracing test is flakey #7538

Open
olix0r opened this issue Dec 28, 2021 · 6 comments
Open

Tracing test is flakey #7538

olix0r opened this issue Dec 28, 2021 · 6 comments

Comments

@olix0r
Copy link
Member

olix0r commented Dec 28, 2021

The tracing test frequently fails with spurious errors like:

--- FAIL: TestTracing (179.94s)
    --- FAIL: TestTracing/expect_full_trace (120.03s)
        tracing_test.go:224: No trace found with processes: [nginx web emoji voting linkerd-proxy]
FAIL
FAIL	github.com/linkerd/linkerd2/test/integration/tracing	180.065s
FAIL

That is, after running other tests, we wait 3 minutes for the test to fail, invalidating the whole CI run.

I'm going to disable this test until we can get at the root of this flakiness.

See also #7403

olix0r added a commit that referenced this issue Dec 28, 2021
As #7538 describes, the tracing test is especially flakey.

This change skips running this test unless the `RUN_FLAKEY_TEST`
environment variable is set.
@olix0r olix0r added this to the stable-2.12.0 milestone Dec 28, 2021
olix0r added a commit that referenced this issue Jan 3, 2022
As #7538 describes, the tracing test is especially flakey.

This change skips running this test unless the `RUN_FLAKEY_TEST`
environment variable is set.
@Pothulapati
Copy link
Contributor

Pothulapati commented Jan 6, 2022

So, I tried to dig more on this test, and have the following findings:

  • Currently, It seems to be one of the slowest test suite taking more than 3m. Deploying and waiting for linkerd-emojivoto seems to be the main reason here. I see that we seem to deploy emojivoto across the policyintegration test. It would be better instead to deploy once during the install_test.go and use it across. (Tests that need to have additional inject configuration will need to do the update, and perform a rollout)
  • For flakiness, The reason seems to be that we always expect the full trace across [nginx web emoji voting linkerd-proxy] to be present. We could do the following to improve this
    • Increase the time out. Currently we wait for 120s but due to multiple components being involved. I think improving this might help.
    • Test only for the linkerd-proxy span. So, I don't think we should be waiting for the full trace as some of the application spans maybe missing for various reasons.

@mateiidavid WDYT?

@mateiidavid
Copy link
Member

mateiidavid commented Jan 11, 2022

@Pothulapati do we know how long it usually takes for emojivoto to deploy? I wonder why it would take so long for it to start-up, we make use of it in other tests and I don't think they take as long as tracing? 🤔 We also seem to install linkerd-jaeger. As part of #7403, I want to create a viz suite. I wonder if it would be better to install jaeger when we install the control plane and viz for the test suite, that might shave some seconds off.

@Pothulapati I looked through the tracing test and there are a couple of things I think we can do:

  1. With all resources in the test, we seem to be reading them from a file and then injecting. We could perhaps optimize here a bit and inject them before replacing the configuration parameters.
  2. linkerd-jaeger is deployed as part of the test. We should maybe pull that out; we could install it in install_test.go. As part of ci: re-organize and isolate integration test suites #7403 we'll create a viz suite so I think it would make sense to install jaeger at the same time as viz.
  3. If you're comfortable with only testing for linkerd-proxy span, then we can probably do that. It seems sufficient to me, I think you may know more on the topic though.

We could move emojivoto to install_test but one advantage of having it structured this way is that all resources are properly cleaned up after the test runs. We'd have to clean up the resources on our own and I think we might want to avoid an explicit clean-up test in install.go. I'm surprised it takes so long to install it though, perhaps we can offset by trying to make use of a cache for external dependencies? It was suggested by @alpeb but I never got around to trying it out properly. If we can cache the images for our dependencies (e.g emojivoto, rabbitmq, etc.) then we don't have to deploy it ahead of time maybe.

One solution from my side would be to get rid of emojivoto and just rely on slow-cooker and nginx (the latter is already used in the test).

Edit: raised #7587 to investigate the possibility of caching the images for emojivoto, maybe that might help?

@Pothulapati
Copy link
Contributor

I wonder why it would take so long for it to start-up, we make use of it in other tests and I don't think they take as long as tracing

I had a similar question too and I think the reason here is that the tracing test seems to spin them up for the first time, and hence the image pulls end up taking some time? . Pulling the images AOT might definitely help I think.

We could move emojivoto to install_test but one advantage of having it structured this way is that all resources are properly cleaned up after the test runs. We'd have to clean up the resources on our own and I think we might want to avoid an explicit clean-up test in install.go

Yeah, As each test seems to have its own inject configuration, this does not seem like a great way to fix this. We should probably start with AOT image loading, and then see if its still slow?

@alpeb
Copy link
Member

alpeb commented Jan 12, 2022

I did some testing and realized the bottleneck is the loading of the nginx-ingress image. Unpacked it weights 555MB. Upgrading to 0.33.0 results in 327MB, but that'd require updating other things in the test. OTOH, preloading the image didn't make much of a difference to me in terms of total time, but maybe that can help avoiding the flakiness. Replacing that ingress with another one might also be another avenue to improve times...

@Pothulapati
Copy link
Contributor

@alpeb That's a good idea! I will start working on this i.e performing the same test without using nginx-ingress

Pothulapati added a commit that referenced this issue Jan 13, 2022
Part of #7538

This PR removes the `nginx-ingress` from the tracing integration
test, and also replaces the emojivoto manifest with the official
manifest from the getting started guide.

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>
Pothulapati added a commit that referenced this issue Jan 17, 2022
Part of #7538

This PR removes the `nginx-ingress` from the tracing integration
test, and also replaces the emojivoto manifest with the official
manifest from the getting started guide.

This should increase the speed of this test, as initializing `nginx-ingress`
[seems to take a while](#7538 (comment)).

```bash
# Before
Test script: [tracing] Params: []
ok      github.com/linkerd/linkerd2/test/integration/tracing    222.845s
# After
Test script: [tracing] Params: []
ok      github.com/linkerd/linkerd2/test/integration/tracing    170.030s
```

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>
@olix0r
Copy link
Member Author

olix0r commented Jan 18, 2022

Notably, even the merge of #7601 (f2ad5a9 log) failed on this test:

--- FAIL: TestPolicy (494.98s)
    --- FAIL: TestPolicy/linkerd_viz_stat_srv_-n_linkerd-stat-authz-test (180.35s)
        policy_test.go:146: Unexpected success rate for [emoji-grpc], got [-]
FAIL
FAIL	github.com/linkerd/linkerd2/test/integration/policy	495.078s
FAIL

I've seen other branches fail on main as well, like #7623 (0704f1b log):

--- FAIL: TestTracing (207.01s)
    --- FAIL: TestTracing/expect_full_trace (180.04s)
        tracing_test.go:169: No trace found with processes: linkerd-proxy
FAIL
FAIL	github.com/linkerd/linkerd2/test/integration/tracing	207.116s
FAIL

adleong added a commit that referenced this issue Jan 28, 2022
Skip tracing test for now until we can make it less flakey. See #7538

Signed-off-by: Alex Leong <alex@buoyant.io>
olix0r added a commit that referenced this issue Feb 16, 2022
Per #7538, the tracing test is especially flakey.

This change skips the test when certain failures are encountered

Signed-off-by: Oliver Gould <ver@buoyant.io>
olix0r added a commit that referenced this issue Feb 16, 2022
Per #7538, the tracing test is especially flakey.

This change skips the test when certain failures are encountered

Signed-off-by: Oliver Gould <ver@buoyant.io>
olix0r added a commit that referenced this issue Feb 16, 2022
Per #7538, the tracing test is especially flakey.

This change skips the test when certain failures are encountered

Signed-off-by: Oliver Gould <ver@buoyant.io>
mateiidavid pushed a commit that referenced this issue Feb 17, 2022
Per #7538, the tracing test is especially flakey.

This change skips the test when certain failures are encountered

Signed-off-by: Oliver Gould <ver@buoyant.io>
olix0r pushed a commit that referenced this issue Apr 14, 2022
Part of #7538

This PR removes the `nginx-ingress` from the tracing integration
test, and also replaces the emojivoto manifest with the official
manifest from the getting started guide.

This should increase the speed of this test, as initializing `nginx-ingress`
[seems to take a while](#7538 (comment)).

```bash
# Before
Test script: [tracing] Params: []
ok      github.com/linkerd/linkerd2/test/integration/tracing    222.845s
# After
Test script: [tracing] Params: []
ok      github.com/linkerd/linkerd2/test/integration/tracing    170.030s
```

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>
@adleong adleong closed this as completed Jul 7, 2022
@adleong adleong reopened this Jul 7, 2022
@adleong adleong removed this from the stable-2.12.0 milestone Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants