Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test flake: omicron-nexus::test_all integration_tests::metrics::test_instance_watcher_metrics #5752

Closed
sunshowers opened this issue May 14, 2024 · 6 comments · Fixed by #5768 or #5784
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.

Comments

@sunshowers
Copy link
Contributor

Example:

https://buildomat.eng.oxide.computer/wg/0/details/01HXT52WBZWPKP4HRFE5BZEJ90/TzkBQ5mcdy8cXJtRPqbP8sTUggD48AxTs1uwGGPeMg15q7e7/01HXT53AB77VSYQFJ33G6K5FSE

I feel like I saw something similar go by at some point.

@sunshowers sunshowers added the Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. label May 14, 2024
@sunshowers
Copy link
Contributor Author

Ahh looks like it was #5645.

cc @bnaecker maybe you have thoughts?

@iliana
Copy link
Contributor

iliana commented May 14, 2024

another: https://github.com/oxidecomputer/omicron/pull/5765/checks?check_run_id=24966472198

--- creating instance 1 ---
 --- activating instance watcher ---
[nexus/tests/integration_tests/metrics.rs:464:14] count_state(&checks, instance1_uuid, STATE_STARTING) = 1
--- creating instance 2 ---
 --- activating instance watcher ---
[nexus/tests/integration_tests/metrics.rs:480:15] count_state(&checks, instance1_uuid, STATE_STARTING) = 2
[nexus/tests/integration_tests/metrics.rs:481:15] count_state(&checks, instance2_uuid, STATE_STARTING) = 1
--- starting instance 1 ---
 --- activating instance watcher ---
[nexus/tests/integration_tests/metrics.rs:498:9] count_state(&checks, instance1_uuid, STATE_STARTING) = 2
[nexus/tests/integration_tests/metrics.rs:499:23] count_state(&checks, instance1_uuid, STATE_RUNNING) = 1
[nexus/tests/integration_tests/metrics.rs:500:15] count_state(&checks, instance2_uuid, STATE_STARTING) = 2
--- starting instance 2 ---
--- start stopping instance 1 ---
 --- activating instance watcher ---
[nexus/tests/integration_tests/metrics.rs:524:9] count_state(&checks, instance1_uuid, STATE_STARTING) = 2
[nexus/tests/integration_tests/metrics.rs:525:23] count_state(&checks, instance1_uuid, STATE_RUNNING) = 1
[nexus/tests/integration_tests/metrics.rs:527:9] count_state(&checks, instance1_uuid, STATE_STOPPING) = 2
[nexus/tests/integration_tests/metrics.rs:529:9] count_state(&checks, instance2_uuid, STATE_STARTING) = 2
[nexus/tests/integration_tests/metrics.rs:530:23] count_state(&checks, instance2_uuid, STATE_RUNNING) = 2
thread 'integration_tests::metrics::test_instance_watcher_metrics' panicked at nexus/tests/integration_tests/metrics.rs:533:5:
assertion `left == right` failed
  left: 2
 right: 1

@bnaecker
Copy link
Collaborator

It looks like these tests are asking an instance to transition through various states; activating the instance-check background task; and then asserting the state of the metrics in ClickHouse. There's a race here between the collection task and the BG task publishing the metrics. While the code appears to wait until the BG task itself finishes, there's no guarantee that oximeter scrapes and inserts those metrics by the time the check is done.

Usually the trick here is to force oximeter to collect any outstanding metrics. The test context has access to the oximeter collector instance, and the Oximeter::force_collect() method will ensure that it polls all its producers. It will not return until those collections complete, either because they failed or because the data is in ClickHouse. There are several examples of that in the same test file where these failing tests are.

cc @hawkw

@hawkw
Copy link
Member

hawkw commented May 14, 2024

Usually the trick here is to force oximeter to collect any outstanding metrics. The test context has access to the oximeter collector instance, and the Oximeter::force_collect() method will ensure that it polls all its producers. It will not return until those collections complete, either because they failed or because the data is in ClickHouse. There are several examples of that in the same test file where these failing tests are.

cc @hawkw

Ah, thanks, I think I missed that. I'll go and fix it shortly.

hawkw added a commit that referenced this issue May 14, 2024
Presently, `test_instance_watcher_metrics` will wait for the
`instance_watcher` background task to have run before making assertions
about metrics, but it does *not* ensure that oximeter has actually
collected those metrics. This can result in flaky failures --- see
#5752.

This commit adds explicit calls to `oximeter.force_collect()` prior to
making assertions, to ensure that the latest metrics have been
collected.

Fixes #5752
@hawkw
Copy link
Member

hawkw commented May 14, 2024

#5768 should fix this but it's waiting for #5767 to debreak the build first :/

hawkw added a commit that referenced this issue May 14, 2024
Presently, `test_instance_watcher_metrics` will wait for the
`instance_watcher` background task to have run before making assertions
about metrics, but it does *not* ensure that oximeter has actually
collected those metrics. This can result in flaky failures --- see
#5752.

This commit adds explicit calls to `oximeter.force_collect()` prior to
making assertions, to ensure that the latest metrics have been
collected.

Fixes #5752
hawkw added a commit that referenced this issue May 15, 2024
Presently, `test_instance_watcher_metrics` will wait for the
`instance_watcher` background task to have run before making assertions
about metrics, but it does *not* ensure that oximeter has actually
collected those metrics. This can result in flaky failures --- see
#5752.

This commit adds explicit calls to `oximeter.force_collect()` prior to
making assertions, to ensure that the latest metrics have been
collected.

Fixes #5752
@iliana
Copy link
Contributor

iliana commented May 15, 2024

well, hm. https://github.com/oxidecomputer/omicron/runs/24977844070

https://buildomat.eng.oxide.computer/wg/0/details/01HXWWGSG3442MM3JRMCY29Q72/ylQBEwmodRd8cuE6MefDsEoP7HdpgJcoPUrQjKYzxhrgvXIt/01HXWWH2AW89XR7MRJMGVFXCY6#S4469

thread 'integration_tests::metrics::test_instance_watcher_metrics' panicked at nexus/tests/integration_tests/metrics.rs:563:5:
assertion `left == right` failed
  left: 3
 right: 2

@iliana iliana reopened this May 15, 2024
hawkw added a commit that referenced this issue May 16, 2024
The test `integration_tests::metrics::test_instance_watcher_metrics`
remains flaky even after adding an explicit call to
`Oximeter::force_collect` to ensure that metrics have been collected. I
believe this is due to the fact that, if the test runs long enough, the
`instance_watcher` background may be activated by its timer, causing
metrics to be collected another time, in addition to the test's explicit
activations. This can cause flaky failures when we then assert that
there is exactly a certain number of timeseries counted.

This branch changes the test to make assertions based on inequality,
instead. Now, we assert that the timeseries has *at least* the expected
count, so if the `instance_watcher` task has collected instance metrics
an additional time, we can tolerate that. We're still able to assert
that at least the expected counts are present. This is based on the
approach suggested by @bnaecker in [this comment][1].

I've re-run the test five times on my machine, and it appears to always
pass. Hopefully, this should actually fix #5752, but we probably
shouldn't close the issue until this has made it through a few CI
runs...

[1] #5768 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.
Projects
None yet
4 participants