Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky Test - load-test (TestLog10kDPS) #9094

Closed
TylerHelmuth opened this issue Apr 5, 2022 · 9 comments · Fixed by #9105
Closed

Flaky Test - load-test (TestLog10kDPS) #9094

TylerHelmuth opened this issue Apr 5, 2022 · 9 comments · Fixed by #9105
Labels
bug Something isn't working

Comments

@TylerHelmuth
Copy link
Member

Describe the bug
The TestLog10kDPS test within the load-test workflow is failing intermittently. Here are some examples:

Steps to reproduce
Run the load-test workflow

What did you expect to see?
The TestLog10kDPS tests passing unless code had been modified that would deprecate performance.

What did you see instead?
The TestLog10kDPS tests failing intermittently.

Additional context
There are other tests in load-test workflow that are failing, but I noticed TestLog10kDPS failing the most.

@TylerHelmuth TylerHelmuth added the bug Something isn't working label Apr 5, 2022
@jpkrohling
Copy link
Member

@open-telemetry/collector-contrib-approvers, @open-telemetry/collector-approvers, the load tests have always been a source of test failures for us. All we always do is increase the limits once we start hitting them more frequently. I don't remember ever seeing a true positive out of those tests. Are those tests providing value to us? Or are they just sources of noise?

@mx-psi
Copy link
Member

mx-psi commented Apr 6, 2022

I agree that the current tests don't add a lot of value and are a frequent source of flakes.

For something unrelated, I recently had a look at what the Rust compiler process is for this, which can maybe be interesting for this discussion. The process is as follows:

  1. PRs are benchmarked out-of-band and a bot post a comments when there are regressions and tags the PR with ‘perf-regression’ (example)
  2. The benchmark compares with the previous commit and reports the relative change (example). It has a defined criteria for when a change is considered relevant, which is based on basic statistics
  3. Someone goes through all PRs tagged as regressions to investigate if they are justified (pressumably before a release?)

Reproducing this in our setup would need additional infra to run benchmarks and store historical data, so I don't think this is something we can do without significant effort, but still I think it's interesting to see what other projects do for inspiration.

@tigrannajaryan
Copy link
Member

I don't remember ever seeing a true positive out of those tests. Are those tests providing value to us? Or are they just sources of noise?

I remember they once caught a performance regression when the Protobuf dependency was updated to a new version and caused significant slow down.

CNCF provided us with dedicated EC2 instances where we could run the perf tests in a more stable environment, but we never had time to work on this.

@djaglowski
Copy link
Member

Responding narrowly to the immediate issue - this specific test expectation was "fixed" in #9023, with the standard approach of raising the limit.

@TylerHelmuth
Copy link
Member Author

TylerHelmuth commented Apr 6, 2022

@djaglowski it looks like that PR "fixed" the test named "filelog" within the TestLog10kDPS suite but the list of failures in this issue is the test named "OTLP"

Should I submit a PR to up the OTLP test limit and continue the discussion around a more proper solution in a different issue?

@djaglowski
Copy link
Member

@TylerHelmuth, you're right, I was too broad in my assertion there.

I think what you've suggested makes sense. A larger discussion on these tests deserves its own issue, but we can reduce loosen these test constraints immediately.

@TylerHelmuth
Copy link
Member Author

Ok I will do that. Should I go ahead an up all the values (memory and CPU) for all the tests in this suite? Would probably go a long way to making PRs have green checks instead of red Xs.

@djaglowski
Copy link
Member

The rule of thumb that has been followed so far is that when establishing the limits, set them at roughly 15% above observed values. We should probably stick to that until there's consensus on the larger question.

@djaglowski
Copy link
Member

I've opened #9107 to capture the higher level discussion here. This issue will be closed by #9105.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants