Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCI metrics intermittent failure(s) #6112

Closed
tomas-langer opened this issue Feb 7, 2023 · 5 comments
Closed

OCI metrics intermittent failure(s) #6112

tomas-langer opened this issue Feb 7, 2023 · 5 comments
Assignees
Labels
Projects

Comments

@tomas-langer
Copy link
Member

Error:  Failures: 
Error:    OciMetricsSupportTest.testEndpoint:196 
Expected: is "https://telemetry.dummyendpoint.com/"
     but: was "https://telemetry-ingestion.dummyendpoint.com/"
Error:    OciMetricsSupportTest.testMetricScope:320->validateMetricCount:423->validateMetricCount:439 
Expected: is <1>
     but: was <6>

I have seen the second issue at least 3 times

@github-actions github-actions bot added this to Triage in Backlog Feb 7, 2023
@klustria klustria added the 4.x Version 4.x label Feb 7, 2023
@klustria
Copy link
Member

klustria commented Feb 8, 2023

Been running from 100 to 200 continuous retries of the OciMetricsSupportTest like below:

 for i in {1..200}; do mvn clean test -Dtest=OciMetricsSupportTest; done 

and so far have not reproduced this issue. Will continue to attempt to reproduce this.

@m0mus m0mus added the P2 label Feb 9, 2023
@m0mus m0mus moved this from Triage to High priority in Backlog Feb 9, 2023
@klustria
Copy link
Member

The use of @RepeatedTest(4000) on testEndpoint helped reproduce the problem and I think I have now an understanding of the behavior:

  1. The failure in OciMetricsSupportTest.testEndpoint is a race condition wherein the restore of the endpoint in the OciMetricsSupport has not been executed yet before the test assert call is made to validate it. I think that the test should wait for a certain period of time (maybe a few seconds even though it probably just missed it by a few milliseconds) to validate that the endPoint has been restored.
  2. The failure in OciMetricsSupportTest.testMetricScope is a chain reaction effect of the above problem. Because it failed, the webServer was not stopped and so the ociMetricsSupport executor thread will continue to be running and at a certain interval, will be retriggered, and will call countDownLatch1.countDown() which will affect one of the validations in OciMetricsSupportTest.testMetricScope. Because countDownLatch1 is a global static variable, the one used by the failed test will be reused by the new test causing this failure. We can ignore this and fix the above issue and this will not be a problem anymore or make countDownLatch1 as test method local variable.

@klustria
Copy link
Member

klustria commented Feb 10, 2023

Problem can be reproduced consistently by using these steps:

  1. Replace @Test with @RepeatedTest(4000) in OciMetricsSupportTest.testEndpoint
  2. Run a mvn test executing OciMetricsSupportTest.testEndpoint and OciMetricsSupportTest.testMetricScope 5 times or until it fails using this one liner bash command:
    for i in {1..5}; do mvn clean test -Dtest=OciMetricsSupportTest#testEndpoint+testMetricScope || break && echo "Finished test #$i"; done;

klustria added a commit to klustria/helidon that referenced this issue Feb 10, 2023
RCA is described here: helidon-io#6112 (comment)

Changes include the following:
1. In OciMetricsSupportTest.testEndpoint, extend the amount of validation time to 3 seconds for checking that the metric endpoint has been restored. Intermittently, a race condition exist where the validation happens before the endpoint is restored.
2. Modify all countdownLatch to be locally defined in the test methods rather than being a static variable, which is causing chain reaction failure to other tests if a previous test fails because they share the same countdownLatch.
3. Always check that countDownLatch.await() is verified to have completed or otherwise, assert a failure.
4. Remove the use of fixed port when starting a WebServer.
5. Reset postingEndPoint to its original value before each test, so @RepeatedTest can be used in the future for debugging purposes.
6. Apply Helidon Code Style on both OciMetricsSupportTest and OciMetricsCdiExtensionTest. This would include making the tests's class and methods package local  rather than public, rearranging variable fields order based on whether they are static, final, etc.
7. Note that OciMetricsCdiExtensionTest only involves Code Style change and removal of delay method which is never used, so logic in that test class will be the same as before. Only OciMetricsSupportTest contain significant change to resolve the issue reported.
@klustria
Copy link
Member

PR #6151 has been created as potential fix for this issue.

klustria added a commit that referenced this issue Feb 11, 2023
* Fix intermittent issue on OciMetricsSupportTest

RCA is described here: #6112 (comment)

Changes include the following:
1. In OciMetricsSupportTest.testEndpoint, extend the amount of validation time to 10 seconds for checking that the metric endpoint has been restored. Intermittently, a race condition exist where the validation happens before the endpoint is restored.
2. Modify all countdownLatch to be locally defined in the test methods rather than being a static variable, which is causing chain reaction failure to other tests if a previous test fails because they share the same countdownLatch.
3. Always check that countDownLatch.await() is verified to have completed or otherwise, assert a failure.
4. Remove the use of fixed port when starting a WebServer.
5. Reset postingEndPoint to its original value before each test, so @RepeatedTest can be used in the future for debugging purposes.
6. Apply Helidon Code Style on both OciMetricsSupportTest and OciMetricsCdiExtensionTest. This would include making the tests's class and methods package local  rather than public, rearranging variable fields order based on whether they are static, final, etc.
7. Note that OciMetricsCdiExtensionTest only involves Code Style change and removal of delay method which is never used, so logic in that test class will be the same as before. Only OciMetricsSupportTest contain significant change to resolve the issue reported.
8. Fail the OciMetricsCdiExtensionTest if enabled OCI Metrics validation times out on countDownLatch.await()
@klustria
Copy link
Member

PR #6151 merged, hence closing.

Backlog automation moved this from High priority to Closed Feb 13, 2023
klustria added a commit to klustria/helidon that referenced this issue Feb 13, 2023
RCA is described here: helidon-io#6112 (comment)

Changes include the following:
1. In OciMetricsSupportTest.testEndpoint, extend the amount of validation time to 10 seconds for checking that the metric endpoint has been restored. Intermittently, a race condition exist where the validation happens before the endpoint is restored.
2. Modify all countdownLatch to be locally defined in the test methods rather than being a static variable, which is causing chain reaction failure to other tests if a previous test fails because they share the same countdownLatch.
3. Always check that countDownLatch.await() is verified to have completed or otherwise, assert a failure.
4. Remove the use of fixed port when starting a WebServer.
5. Reset postingEndPoint to its original value before each test, so @RepeatedTest can be used in the future for debugging purposes.
6. Apply Helidon Code Style on both OciMetricsSupportTest and OciMetricsCdiExtensionTest. This would include making the tests's class and methods package local  rather than public, rearranging variable fields order based on whether they are static, final, etc.
7. Note that OciMetricsCdiExtensionTest only involves Code Style change and removal of delay method which is never used, so logic in that test class will be the same as before. Only OciMetricsSupportTest contain significant change to resolve the issue reported.
8. Fail the OciMetricsCdiExtensionTest if enabled OCI Metrics validation times out on countDownLatch.await()
klustria added a commit to klustria/helidon that referenced this issue Feb 13, 2023
RCA is described here: helidon-io#6112 (comment), but in this version, OciMetricsSupportTest.testEndpoint is not applicable. Hence, only peripheral changes outside of the ociMetricsSupportTest.testEndpoint race condition issue will be included.

Changes include the following:
1. Modify all countdownLatch to be locally defined in the test methods rather than being a static variable, which is causing chain reaction failure to other tests if a previous test fails because they share the same countdownLatch.
2. Always check that countDownLatch.await() is verified to have completed or otherwise, assert a failure.
3. Remove the use of fixed port when starting a WebServer.
4. Reset postingEndPoint to its original value before each test, so @RepeatedTest can be used in the future for debugging purposes.
5. Apply Helidon Code Style on both OciMetricsSupportTest and OciMetricsCdiExtensionTest. This would include making the tests's class and methods package local  rather than public, rearranging variable fields order based on whether they are static, final, etc.
6. Note that OciMetricsCdiExtensionTest only involves Code Style change and removal of delay method which is never used, so logic in that test class will be the same as before. Only OciMetricsSupportTest contain significant change to resolve the issue reported.
7. Fail the OciMetricsCdiExtensionTest if enabled OCI Metrics validation times out on countDownLatch.await()
klustria added a commit that referenced this issue Feb 14, 2023
RCA is described here: #6112 (comment)

Changes include the following:
1. In OciMetricsSupportTest.testEndpoint, extend the amount of validation time to 10 seconds for checking that the metric endpoint has been restored. Intermittently, a race condition exist where the validation happens before the endpoint is restored.
2. Modify all countdownLatch to be locally defined in the test methods rather than being a static variable, which is causing chain reaction failure to other tests if a previous test fails because they share the same countdownLatch.
3. Always check that countDownLatch.await() is verified to have completed or otherwise, assert a failure.
4. Remove the use of fixed port when starting a WebServer.
5. Reset postingEndPoint to its original value before each test, so @RepeatedTest can be used in the future for debugging purposes.
6. Apply Helidon Code Style on both OciMetricsSupportTest and OciMetricsCdiExtensionTest. This would include making the tests's class and methods package local  rather than public, rearranging variable fields order based on whether they are static, final, etc.
7. Note that OciMetricsCdiExtensionTest only involves Code Style change and removal of delay method which is never used, so logic in that test class will be the same as before. Only OciMetricsSupportTest contain significant change to resolve the issue reported.
8. Fail the OciMetricsCdiExtensionTest if enabled OCI Metrics validation times out on countDownLatch.await()
klustria added a commit that referenced this issue Feb 14, 2023
RCA is described here: #6112 (comment), but in this version, OciMetricsSupportTest.testEndpoint is not applicable. Hence, only peripheral changes outside of the ociMetricsSupportTest.testEndpoint race condition issue will be included.

Changes include the following:
1. Modify all countdownLatch to be locally defined in the test methods rather than being a static variable, which is causing chain reaction failure to other tests if a previous test fails because they share the same countdownLatch.
2. Always check that countDownLatch.await() is verified to have completed or otherwise, assert a failure.
3. Remove the use of fixed port when starting a WebServer.
4. Reset postingEndPoint to its original value before each test, so @RepeatedTest can be used in the future for debugging purposes.
5. Apply Helidon Code Style on both OciMetricsSupportTest and OciMetricsCdiExtensionTest. This would include making the tests's class and methods package local  rather than public, rearranging variable fields order based on whether they are static, final, etc.
6. Note that OciMetricsCdiExtensionTest only involves Code Style change and removal of delay method which is never used, so logic in that test class will be the same as before. Only OciMetricsSupportTest contain significant change to resolve the issue reported.
7. Fail the OciMetricsCdiExtensionTest if enabled OCI Metrics validation times out on countDownLatch.await()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Backlog
  
Closed
Development

No branches or pull requests

3 participants