-
Notifications
You must be signed in to change notification settings - Fork 562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCI metrics intermittent failure(s) #6112
Comments
Been running from 100 to 200 continuous retries of the OciMetricsSupportTest like below: for i in {1..200}; do mvn clean test -Dtest=OciMetricsSupportTest; done and so far have not reproduced this issue. Will continue to attempt to reproduce this. |
The use of
|
Problem can be reproduced consistently by using these steps:
|
RCA is described here: helidon-io#6112 (comment) Changes include the following: 1. In OciMetricsSupportTest.testEndpoint, extend the amount of validation time to 3 seconds for checking that the metric endpoint has been restored. Intermittently, a race condition exist where the validation happens before the endpoint is restored. 2. Modify all countdownLatch to be locally defined in the test methods rather than being a static variable, which is causing chain reaction failure to other tests if a previous test fails because they share the same countdownLatch. 3. Always check that countDownLatch.await() is verified to have completed or otherwise, assert a failure. 4. Remove the use of fixed port when starting a WebServer. 5. Reset postingEndPoint to its original value before each test, so @RepeatedTest can be used in the future for debugging purposes. 6. Apply Helidon Code Style on both OciMetricsSupportTest and OciMetricsCdiExtensionTest. This would include making the tests's class and methods package local rather than public, rearranging variable fields order based on whether they are static, final, etc. 7. Note that OciMetricsCdiExtensionTest only involves Code Style change and removal of delay method which is never used, so logic in that test class will be the same as before. Only OciMetricsSupportTest contain significant change to resolve the issue reported.
PR #6151 has been created as potential fix for this issue. |
* Fix intermittent issue on OciMetricsSupportTest RCA is described here: #6112 (comment) Changes include the following: 1. In OciMetricsSupportTest.testEndpoint, extend the amount of validation time to 10 seconds for checking that the metric endpoint has been restored. Intermittently, a race condition exist where the validation happens before the endpoint is restored. 2. Modify all countdownLatch to be locally defined in the test methods rather than being a static variable, which is causing chain reaction failure to other tests if a previous test fails because they share the same countdownLatch. 3. Always check that countDownLatch.await() is verified to have completed or otherwise, assert a failure. 4. Remove the use of fixed port when starting a WebServer. 5. Reset postingEndPoint to its original value before each test, so @RepeatedTest can be used in the future for debugging purposes. 6. Apply Helidon Code Style on both OciMetricsSupportTest and OciMetricsCdiExtensionTest. This would include making the tests's class and methods package local rather than public, rearranging variable fields order based on whether they are static, final, etc. 7. Note that OciMetricsCdiExtensionTest only involves Code Style change and removal of delay method which is never used, so logic in that test class will be the same as before. Only OciMetricsSupportTest contain significant change to resolve the issue reported. 8. Fail the OciMetricsCdiExtensionTest if enabled OCI Metrics validation times out on countDownLatch.await()
PR #6151 merged, hence closing. |
RCA is described here: helidon-io#6112 (comment) Changes include the following: 1. In OciMetricsSupportTest.testEndpoint, extend the amount of validation time to 10 seconds for checking that the metric endpoint has been restored. Intermittently, a race condition exist where the validation happens before the endpoint is restored. 2. Modify all countdownLatch to be locally defined in the test methods rather than being a static variable, which is causing chain reaction failure to other tests if a previous test fails because they share the same countdownLatch. 3. Always check that countDownLatch.await() is verified to have completed or otherwise, assert a failure. 4. Remove the use of fixed port when starting a WebServer. 5. Reset postingEndPoint to its original value before each test, so @RepeatedTest can be used in the future for debugging purposes. 6. Apply Helidon Code Style on both OciMetricsSupportTest and OciMetricsCdiExtensionTest. This would include making the tests's class and methods package local rather than public, rearranging variable fields order based on whether they are static, final, etc. 7. Note that OciMetricsCdiExtensionTest only involves Code Style change and removal of delay method which is never used, so logic in that test class will be the same as before. Only OciMetricsSupportTest contain significant change to resolve the issue reported. 8. Fail the OciMetricsCdiExtensionTest if enabled OCI Metrics validation times out on countDownLatch.await()
RCA is described here: helidon-io#6112 (comment), but in this version, OciMetricsSupportTest.testEndpoint is not applicable. Hence, only peripheral changes outside of the ociMetricsSupportTest.testEndpoint race condition issue will be included. Changes include the following: 1. Modify all countdownLatch to be locally defined in the test methods rather than being a static variable, which is causing chain reaction failure to other tests if a previous test fails because they share the same countdownLatch. 2. Always check that countDownLatch.await() is verified to have completed or otherwise, assert a failure. 3. Remove the use of fixed port when starting a WebServer. 4. Reset postingEndPoint to its original value before each test, so @RepeatedTest can be used in the future for debugging purposes. 5. Apply Helidon Code Style on both OciMetricsSupportTest and OciMetricsCdiExtensionTest. This would include making the tests's class and methods package local rather than public, rearranging variable fields order based on whether they are static, final, etc. 6. Note that OciMetricsCdiExtensionTest only involves Code Style change and removal of delay method which is never used, so logic in that test class will be the same as before. Only OciMetricsSupportTest contain significant change to resolve the issue reported. 7. Fail the OciMetricsCdiExtensionTest if enabled OCI Metrics validation times out on countDownLatch.await()
RCA is described here: #6112 (comment) Changes include the following: 1. In OciMetricsSupportTest.testEndpoint, extend the amount of validation time to 10 seconds for checking that the metric endpoint has been restored. Intermittently, a race condition exist where the validation happens before the endpoint is restored. 2. Modify all countdownLatch to be locally defined in the test methods rather than being a static variable, which is causing chain reaction failure to other tests if a previous test fails because they share the same countdownLatch. 3. Always check that countDownLatch.await() is verified to have completed or otherwise, assert a failure. 4. Remove the use of fixed port when starting a WebServer. 5. Reset postingEndPoint to its original value before each test, so @RepeatedTest can be used in the future for debugging purposes. 6. Apply Helidon Code Style on both OciMetricsSupportTest and OciMetricsCdiExtensionTest. This would include making the tests's class and methods package local rather than public, rearranging variable fields order based on whether they are static, final, etc. 7. Note that OciMetricsCdiExtensionTest only involves Code Style change and removal of delay method which is never used, so logic in that test class will be the same as before. Only OciMetricsSupportTest contain significant change to resolve the issue reported. 8. Fail the OciMetricsCdiExtensionTest if enabled OCI Metrics validation times out on countDownLatch.await()
RCA is described here: #6112 (comment), but in this version, OciMetricsSupportTest.testEndpoint is not applicable. Hence, only peripheral changes outside of the ociMetricsSupportTest.testEndpoint race condition issue will be included. Changes include the following: 1. Modify all countdownLatch to be locally defined in the test methods rather than being a static variable, which is causing chain reaction failure to other tests if a previous test fails because they share the same countdownLatch. 2. Always check that countDownLatch.await() is verified to have completed or otherwise, assert a failure. 3. Remove the use of fixed port when starting a WebServer. 4. Reset postingEndPoint to its original value before each test, so @RepeatedTest can be used in the future for debugging purposes. 5. Apply Helidon Code Style on both OciMetricsSupportTest and OciMetricsCdiExtensionTest. This would include making the tests's class and methods package local rather than public, rearranging variable fields order based on whether they are static, final, etc. 6. Note that OciMetricsCdiExtensionTest only involves Code Style change and removal of delay method which is never used, so logic in that test class will be the same as before. Only OciMetricsSupportTest contain significant change to resolve the issue reported. 7. Fail the OciMetricsCdiExtensionTest if enabled OCI Metrics validation times out on countDownLatch.await()
I have seen the second issue at least 3 times
The text was updated successfully, but these errors were encountered: