Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect CI PR benchmark comparison #240

Closed
ptoffy opened this issue Apr 1, 2024 · 3 comments
Closed

Incorrect CI PR benchmark comparison #240

ptoffy opened this issue Apr 1, 2024 · 3 comments

Comments

@ptoffy
Copy link

ptoffy commented Apr 1, 2024

Hi,

while trying to set up benchmarking in CI for Vapor's JWTKit, I'm running into some issues I can't get a hold of. The main issue is the wrongly measured comparison between branches, of which you can see an example here (it's a fork of the repo to keep the main repo "clean" until I find a solution). This PR brings absolutely no changes to the benchmarked code, it's just a test README update and as you can see the measurements are completely wrong. The same happens on the main repo too: https://github.com/vapor/jwt-kit/actions/runs/8492609046/job/23265790197?pr=152
The result is always either a benchmarkThresholdRegression or benchmarkThresholdImprovement error, that's why I tried to increase the thresholds for the benchmarks over here, however this doesn't seem to work as the threshold in the results (you can check the first linked PR results) is always 5. I also tried setting the thresholds on the single benchmarks directly but no luck. It's not a caching issue or similar as I tried enabling and disabling various metrics and that seems to work (as you can see the only thing enabled right now is throughput): something else must be wrong when updating the metrics. Finally, here is the workflow we're using to benchmark PRs, we tried following the one in this repo.

I've also considered that it might be solved with a private GH runner. What do you think?

Thanks!

@hassila
Copy link
Contributor

hassila commented Apr 2, 2024

Thanks for the report, unfortunately away for a couple of weeks and won't have time to check on this until then, sorry for delay.

@vsarunas
Copy link

vsarunas commented Apr 2, 2024

Hi @ptoffy,

Yes I believe it is due to running on GitHub's CI infrastructure and "noise neighbours" for the most part.

Can verify locally by creating 3 runs on where development was done:

swift package --disable-sandbox benchmark baseline update run1 --target Signing
swift package --disable-sandbox benchmark baseline update run2 --target Signing
swift package --disable-sandbox benchmark baseline update run3 --target Signing

And comparing them:

swift package benchmark baseline compare run1 run2
swift package benchmark baseline compare run1 run3
swift package benchmark baseline compare run2 run3

The metrics on macOS have less variations than on Linux for me it seems; but they're a lot less than the linked workflow result.

You may find that increasing the number of samples by increasing the test duration will also create more stability:

Benchmark.defaultConfiguration.maxDuration = .seconds(10)

@hassila
Copy link
Contributor

hassila commented Apr 20, 2024

Hi @ptoffy , yeah @vsarunas is correct, running time/throughput measurements on GitHub infrastructure isn't a good proposition (which is why we recommend a standalone runner for such metrics if you want them in CI) - the environment is simply not controlled enough. We are considering adding e.g instruction count also (in #242) which could be a better proposition (if we can get those stats on the runner, remains to be seen..) as it then is a proxy of the runtime in practice. Other metrics such as malloc count and sys calls works fine on GitHub ci though. Closing this for now, feel free to open up follow up issues as needed.

@hassila hassila closed this as completed Apr 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants