-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test flake: omicron-nexus::test_all integration_tests::metrics::test_instance_watcher_metrics #5752
Comments
another: https://github.com/oxidecomputer/omicron/pull/5765/checks?check_run_id=24966472198
|
It looks like these tests are asking an instance to transition through various states; activating the instance-check background task; and then asserting the state of the metrics in ClickHouse. There's a race here between the collection task and the BG task publishing the metrics. While the code appears to wait until the BG task itself finishes, there's no guarantee that oximeter scrapes and inserts those metrics by the time the check is done. Usually the trick here is to force oximeter to collect any outstanding metrics. The test context has access to the oximeter collector instance, and the cc @hawkw |
Ah, thanks, I think I missed that. I'll go and fix it shortly. |
Presently, `test_instance_watcher_metrics` will wait for the `instance_watcher` background task to have run before making assertions about metrics, but it does *not* ensure that oximeter has actually collected those metrics. This can result in flaky failures --- see #5752. This commit adds explicit calls to `oximeter.force_collect()` prior to making assertions, to ensure that the latest metrics have been collected. Fixes #5752
Presently, `test_instance_watcher_metrics` will wait for the `instance_watcher` background task to have run before making assertions about metrics, but it does *not* ensure that oximeter has actually collected those metrics. This can result in flaky failures --- see #5752. This commit adds explicit calls to `oximeter.force_collect()` prior to making assertions, to ensure that the latest metrics have been collected. Fixes #5752
Presently, `test_instance_watcher_metrics` will wait for the `instance_watcher` background task to have run before making assertions about metrics, but it does *not* ensure that oximeter has actually collected those metrics. This can result in flaky failures --- see #5752. This commit adds explicit calls to `oximeter.force_collect()` prior to making assertions, to ensure that the latest metrics have been collected. Fixes #5752
well, hm. https://github.com/oxidecomputer/omicron/runs/24977844070
|
The test `integration_tests::metrics::test_instance_watcher_metrics` remains flaky even after adding an explicit call to `Oximeter::force_collect` to ensure that metrics have been collected. I believe this is due to the fact that, if the test runs long enough, the `instance_watcher` background may be activated by its timer, causing metrics to be collected another time, in addition to the test's explicit activations. This can cause flaky failures when we then assert that there is exactly a certain number of timeseries counted. This branch changes the test to make assertions based on inequality, instead. Now, we assert that the timeseries has *at least* the expected count, so if the `instance_watcher` task has collected instance metrics an additional time, we can tolerate that. We're still able to assert that at least the expected counts are present. This is based on the approach suggested by @bnaecker in [this comment][1]. I've re-run the test five times on my machine, and it appears to always pass. Hopefully, this should actually fix #5752, but we probably shouldn't close the issue until this has made it through a few CI runs... [1] #5768 (comment)
Example:
https://buildomat.eng.oxide.computer/wg/0/details/01HXT52WBZWPKP4HRFE5BZEJ90/TzkBQ5mcdy8cXJtRPqbP8sTUggD48AxTs1uwGGPeMg15q7e7/01HXT53AB77VSYQFJ33G6K5FSE
I feel like I saw something similar go by at some point.
The text was updated successfully, but these errors were encountered: