test flake: omicron-nexus::test_all integration_tests::metrics::test_instance_watcher_metrics #5752

sunshowers · 2024-05-14T02:57:27Z

Example:

https://buildomat.eng.oxide.computer/wg/0/details/01HXT52WBZWPKP4HRFE5BZEJ90/TzkBQ5mcdy8cXJtRPqbP8sTUggD48AxTs1uwGGPeMg15q7e7/01HXT53AB77VSYQFJ33G6K5FSE

I feel like I saw something similar go by at some point.

sunshowers · 2024-05-14T02:58:08Z

Ahh looks like it was #5645.

cc @bnaecker maybe you have thoughts?

iliana · 2024-05-14T20:15:14Z

another: https://github.com/oxidecomputer/omicron/pull/5765/checks?check_run_id=24966472198

--- creating instance 1 ---
 --- activating instance watcher ---
[nexus/tests/integration_tests/metrics.rs:464:14] count_state(&checks, instance1_uuid, STATE_STARTING) = 1
--- creating instance 2 ---
 --- activating instance watcher ---
[nexus/tests/integration_tests/metrics.rs:480:15] count_state(&checks, instance1_uuid, STATE_STARTING) = 2
[nexus/tests/integration_tests/metrics.rs:481:15] count_state(&checks, instance2_uuid, STATE_STARTING) = 1
--- starting instance 1 ---
 --- activating instance watcher ---
[nexus/tests/integration_tests/metrics.rs:498:9] count_state(&checks, instance1_uuid, STATE_STARTING) = 2
[nexus/tests/integration_tests/metrics.rs:499:23] count_state(&checks, instance1_uuid, STATE_RUNNING) = 1
[nexus/tests/integration_tests/metrics.rs:500:15] count_state(&checks, instance2_uuid, STATE_STARTING) = 2
--- starting instance 2 ---
--- start stopping instance 1 ---
 --- activating instance watcher ---
[nexus/tests/integration_tests/metrics.rs:524:9] count_state(&checks, instance1_uuid, STATE_STARTING) = 2
[nexus/tests/integration_tests/metrics.rs:525:23] count_state(&checks, instance1_uuid, STATE_RUNNING) = 1
[nexus/tests/integration_tests/metrics.rs:527:9] count_state(&checks, instance1_uuid, STATE_STOPPING) = 2
[nexus/tests/integration_tests/metrics.rs:529:9] count_state(&checks, instance2_uuid, STATE_STARTING) = 2
[nexus/tests/integration_tests/metrics.rs:530:23] count_state(&checks, instance2_uuid, STATE_RUNNING) = 2
thread 'integration_tests::metrics::test_instance_watcher_metrics' panicked at nexus/tests/integration_tests/metrics.rs:533:5:
assertion `left == right` failed
  left: 2
 right: 1

bnaecker · 2024-05-14T21:34:06Z

It looks like these tests are asking an instance to transition through various states; activating the instance-check background task; and then asserting the state of the metrics in ClickHouse. There's a race here between the collection task and the BG task publishing the metrics. While the code appears to wait until the BG task itself finishes, there's no guarantee that oximeter scrapes and inserts those metrics by the time the check is done.

Usually the trick here is to force oximeter to collect any outstanding metrics. The test context has access to the oximeter collector instance, and the Oximeter::force_collect() method will ensure that it polls all its producers. It will not return until those collections complete, either because they failed or because the data is in ClickHouse. There are several examples of that in the same test file where these failing tests are.

cc @hawkw

hawkw · 2024-05-14T22:27:42Z

Usually the trick here is to force oximeter to collect any outstanding metrics. The test context has access to the oximeter collector instance, and the Oximeter::force_collect() method will ensure that it polls all its producers. It will not return until those collections complete, either because they failed or because the data is in ClickHouse. There are several examples of that in the same test file where these failing tests are.

cc @hawkw

Ah, thanks, I think I missed that. I'll go and fix it shortly.

Presently, `test_instance_watcher_metrics` will wait for the `instance_watcher` background task to have run before making assertions about metrics, but it does *not* ensure that oximeter has actually collected those metrics. This can result in flaky failures --- see #5752. This commit adds explicit calls to `oximeter.force_collect()` prior to making assertions, to ensure that the latest metrics have been collected. Fixes #5752

hawkw · 2024-05-14T23:04:38Z

#5768 should fix this but it's waiting for #5767 to debreak the build first :/

Presently, `test_instance_watcher_metrics` will wait for the `instance_watcher` background task to have run before making assertions about metrics, but it does *not* ensure that oximeter has actually collected those metrics. This can result in flaky failures --- see #5752. This commit adds explicit calls to `oximeter.force_collect()` prior to making assertions, to ensure that the latest metrics have been collected. Fixes #5752

iliana · 2024-05-15T02:51:54Z

well, hm. https://github.com/oxidecomputer/omicron/runs/24977844070

https://buildomat.eng.oxide.computer/wg/0/details/01HXWWGSG3442MM3JRMCY29Q72/ylQBEwmodRd8cuE6MefDsEoP7HdpgJcoPUrQjKYzxhrgvXIt/01HXWWH2AW89XR7MRJMGVFXCY6#S4469

thread 'integration_tests::metrics::test_instance_watcher_metrics' panicked at nexus/tests/integration_tests/metrics.rs:563:5:
assertion `left == right` failed
  left: 3
 right: 2

@bnaecker

The test `integration_tests::metrics::test_instance_watcher_metrics` remains flaky even after adding an explicit call to `Oximeter::force_collect` to ensure that metrics have been collected. I believe this is due to the fact that, if the test runs long enough, the `instance_watcher` background may be activated by its timer, causing metrics to be collected another time, in addition to the test's explicit activations. This can cause flaky failures when we then assert that there is exactly a certain number of timeseries counted. This branch changes the test to make assertions based on inequality, instead. Now, we assert that the timeseries has *at least* the expected count, so if the `instance_watcher` task has collected instance metrics an additional time, we can tolerate that. We're still able to assert that at least the expected counts are present. This is based on the approach suggested by @bnaecker in [this comment][1]. I've re-run the test five times on my machine, and it appears to always pass. Hopefully, this should actually fix #5752, but we probably shouldn't close the issue until this has made it through a few CI runs... [1] #5768 (comment)

sunshowers added the Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. label May 14, 2024

hawkw mentioned this issue May 14, 2024

[nexus] deflake test_instance_watcher_metrics #5768

Merged

hawkw closed this as completed in #5768 May 15, 2024

iliana reopened this May 15, 2024

hawkw mentioned this issue May 16, 2024

[nexus] really deflake test_instance_watcher_metrics #5784

Merged

hawkw closed this as completed in #5784 May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test flake: omicron-nexus::test_all integration_tests::metrics::test_instance_watcher_metrics #5752

test flake: omicron-nexus::test_all integration_tests::metrics::test_instance_watcher_metrics #5752

sunshowers commented May 14, 2024

sunshowers commented May 14, 2024

iliana commented May 14, 2024 •

edited

Loading

bnaecker commented May 14, 2024

hawkw commented May 14, 2024

hawkw commented May 14, 2024

iliana commented May 15, 2024

test flake: omicron-nexus::test_all integration_tests::metrics::test_instance_watcher_metrics #5752

test flake: omicron-nexus::test_all integration_tests::metrics::test_instance_watcher_metrics #5752

Comments

sunshowers commented May 14, 2024

sunshowers commented May 14, 2024

iliana commented May 14, 2024 • edited Loading

bnaecker commented May 14, 2024

hawkw commented May 14, 2024

hawkw commented May 14, 2024

iliana commented May 15, 2024

iliana commented May 14, 2024 •

edited

Loading