Object retrier #514

ajkavanagh · 2021-03-03T14:58:59Z

This adds a wrapper class that detects if a callable object in any of
the descendent objects raises an Exception. If so, then it retries that
exception.

This is to attempt to make the zaza tests a little more robust in the
face of small network failures or strange restarts. This is a test, and
robust logging a reporting should be used to determine whether it is
covering up actual bugs rather than CI system issues.

Related Bug: (zot repo)#348

codecov-io · 2021-03-03T15:04:46Z

Codecov Report

Merging #514 (000c977) into master (802e0e0) will increase coverage by 0.33%.
The diff coverage is 90.69%.

@@            Coverage Diff             @@
##           master     #514      +/-   ##
==========================================
+ Coverage   18.49%   18.82%   +0.33%     
==========================================
  Files         160      160              
  Lines        9525     9565      +40     
==========================================
+ Hits         1762     1801      +39     
- Misses       7763     7764       +1

Impacted Files	Coverage Δ
zaza/openstack/charm_tests/octavia/tests.py	`0.00% <0.00%> (ø)`
zaza/openstack/utilities/__init__.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 802e0e0...000c977. Read the comment docs.

* Update charm-octavia from branch 'master' to 0c07a1f7959dfca82e2e2f12070dd4c8f5168988 - Merge "Ensure octavia-driver-agents gets installed" - Ensure octavia-driver-agents gets installed Due to a 'quirk' being fixed, the octavia-driver-agents don't get installed if the ovsdb-subordinate relation is the last to be made. This is due to the update-status hook gating the reactive handler from running which 'fixed' the issue. Looking deeper, if the handler isn't the first to run, then the charm's packages won't get initialised properly. This patch fixes that by using the dynamic properties 'all_packages' and 'full_service_list' available in charms.openstack core for the Charm class. This means that it doesn't matter when the flag is set, as long as it is before the property is accessed. Looking at the handler, this will be in the right place for the install. func-test-pr: openstack-charmers/zaza-openstack-tests#514 Change-Id: I5fd75c9d371390bca402d6a3a264421a44fd092a Closes-Bug: #1916764

Due to a 'quirk' being fixed, the octavia-driver-agents don't get installed if the ovsdb-subordinate relation is the last to be made. This is due to the update-status hook gating the reactive handler from running which 'fixed' the issue. Looking deeper, if the handler isn't the first to run, then the charm's packages won't get initialised properly. This patch fixes that by using the dynamic properties 'all_packages' and 'full_service_list' available in charms.openstack core for the Charm class. This means that it doesn't matter when the flag is set, as long as it is before the property is accessed. Looking at the handler, this will be in the right place for the install. func-test-pr: openstack-charmers/zaza-openstack-tests#514 Change-Id: I5fd75c9d371390bca402d6a3a264421a44fd092a Closes-Bug: #1916764

ajkavanagh · 2021-03-09T15:05:02Z

Tested with https://review.opendev.org/c/openstack/charm-octavia/+/778003 (which has now merged)

This adds a wrapper class that detects if a callable object in any of the descendent objects raises an Exception. If so, then it retries that exception. This is to attempt to make the zaza tests a little more robust in the face of small network failures or strange restarts. This is a test, and robust logging a reporting should be used to determine whether it is covering up actual bugs rather than CI system issues. Related Bug: (zot repo)openstack-charmers#348

The main failure that seems to occur with clients is the ConnectFailure, which according to the docs, is retry-able. Thus provide a function that adds that exception condition automatically. Also fix the import problem in octavia test for the object retrier that is being used to validate the ObjectRetrier feature.

thedac

Not blocking but a few thoughts. The ObjectRetrier call really feels like a re-implementation of tenacity. Is there a reason you are not using tenacity here?

I was a bit horrified to see the first usage was at the setUpClass level. Are we truly seeing failures at this stage? That would seem to indicate the overcloud is not really ready yet. Does it not?

It feels like we will want this kind of thing everywhere. Should we consider wrapping the openstack clients and importing the wrapped versions rather than the direct clients?

thedac · 2021-03-23T18:20:42Z

zaza/openstack/utilities/__init__.py

+            try:
+                return obj(*args, **kwargs)
+            except Exception as e:
+                # if retry_exceptions is None, or the type of the exception is


is not None correct?

Yup, typo; thanks for picking this up.

thedac · 2021-03-23T18:22:51Z

zaza/openstack/charm_tests/octavia/tests.py

+        cls.keystone_client = ObjectRetrier(
+            openstack_utils.get_keystone_session_client(cls.keystone_session))
+        cls.neutron_client = ObjectRetrier(
+            openstack_utils.get_neutron_session_client(cls.keystone_session))
+        cls.octavia_client = ObjectRetrier(
+            openstack_utils.get_octavia_session_client(cls.keystone_session))


I am a bit surprised to see these at the setUpClass level. Are we seeing failures here?

Ah, I may not have explained how it works properly. The __init__(...) method does the wrapping of the object, but it's when then client is used that the retry pattern is activated. e.g.

# wrap a client that is used to make calls. client = ObjectRetrier(some_object_instance) ... # later ... # This is where the retries are done if needed client.do_call()

This makes complete sense. I am not sure why I didn't see it at review time.

ajkavanagh · 2021-03-24T08:03:47Z

Not blocking but a few thoughts. The ObjectRetrier call really feels like a re-implementation of tenacity. Is there a reason you are not using tenacity here?

Why didn't I use tenacity? It's a fair question. It was mostly about having more obvious (in the code) control over how the exceptions from the wrapped code are handled and determine whether a retry will happen and then either eating or propagating that exception; I felt that tenacity was hiding it too much and also it was quite awkward to set it up.

I was a bit horrified to see the first usage was at the setUpClass level. Are we truly seeing failures at this stage? That would seem to indicate the overcloud is not really ready yet. Does it not?

No, that's just the setup code to wrap the object; no retries happening at that point - please see the inline comment; I can see how the naming of the class might make people think that that is when the retries happen. Maybe the class should be renamed to SetupObjectRetrier or something like that that telegraphs that it's a setup phase, rather than when the retries happen. Alternatively, I can void the __init__() to a noop and use a class method (say _setup_) to do the wrap. However, I was trying to be minimally invasive of the namespace of the wrapped object as all 'calls' go to it via the retry mechanism. Hence renaming the class may be the better idea.

It feels like we will want this kind of thing everywhere. Should we consider wrapping the openstack clients and importing the wrapped versions rather than the direct clients?

Valid point. My purpose here (in this PR) was to introduce the code for the ObjectRetrier, show how it can be used, and test as fix for the Octavia tests. I didn't want to introduce it on everything (yet) as I wanted to see how it might impact our 'worst' offender set of tests first! :) What do you think?

thedac · 2021-03-24T14:45:23Z

Not blocking but a few thoughts. The ObjectRetrier call really feels like a re-implementation of tenacity. Is there a reason you are not using tenacity here?

Why didn't I use tenacity? It's a fair question. It was mostly about having more obvious (in the code) control over how the exceptions from the wrapped code are handled and determine whether a retry will happen and then either eating or propagating that exception; I felt that tenacity was hiding it too much and also it was quite awkward to set it up.

I am torn on this since we have the pattern of using tenacity everywhere else. But as I as, I am not blocking on this.

I was a bit horrified to see the first usage was at the setUpClass level. Are we truly seeing failures at this stage? That would seem to indicate the overcloud is not really ready yet. Does it not?

No, that's just the setup code to wrap the object; no retries happening at that point - please see the inline comment; I can see how the naming of the class might make people think that that is when the retries happen. Maybe the class should be renamed to SetupObjectRetrier or something like that that telegraphs that it's a setup phase, rather than when the retries happen. Alternatively, I can void the __init__() to a noop and use a class method (say _setup_) to do the wrap. However, I was trying to be minimally invasive of the namespace of the wrapped object as all 'calls' go to it via the retry mechanism. Hence renaming the class may be the better idea.

Makes complete sense now.

It feels like we will want this kind of thing everywhere. Should we consider wrapping the openstack clients and importing the wrapped versions rather than the direct clients?

Valid point. My purpose here (in this PR) was to introduce the code for the ObjectRetrier, show how it can be used, and test as fix for the Octavia tests. I didn't want to introduce it on everything (yet) as I wanted to see how it might impact our 'worst' offender set of tests first! :) What do you think?

I appreciate this is the first round and is targeted. It makes sense to land a lighter weight change first.

thedac

To be clear we are just waiting on the typo fix to land this.

Fixes a typo (None -> not None) and renames the class from ObjectRetrier -> ObjectRetrierWraps to make it clearer that the class instantiation is to wrap and object with the retrier code rather than do retries at that moment.

thedac

Looks good. Landing. Thanks for the work.

ajkavanagh force-pushed the object-retrier branch 2 times, most recently from bff1a7b to cbce35c Compare March 8, 2021 14:22

ajkavanagh marked this pull request as ready for review March 9, 2021 15:04

ajkavanagh added 3 commits March 10, 2021 09:40

Add object retrier to the LB tests for octavia

ed8528b

ajkavanagh force-pushed the object-retrier branch from cbce35c to c041042 Compare March 10, 2021 09:41

thedac reviewed Mar 23, 2021

View reviewed changes

thedac suggested changes Mar 24, 2021

View reviewed changes

Fix typo and rename class for better understanding

e9f11da

Fixes a typo (None -> not None) and renames the class from ObjectRetrier -> ObjectRetrierWraps to make it clearer that the class instantiation is to wrap and object with the retrier code rather than do retries at that moment.

thedac approved these changes Mar 25, 2021

View reviewed changes

thedac merged commit 0ace737 into openstack-charmers:master Mar 25, 2021

ajkavanagh deleted the object-retrier branch March 25, 2021 16:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Object retrier #514

Object retrier #514

ajkavanagh commented Mar 3, 2021

codecov-io commented Mar 3, 2021 •

edited

Loading

ajkavanagh commented Mar 9, 2021

thedac left a comment

thedac Mar 23, 2021

ajkavanagh Mar 24, 2021

thedac Mar 23, 2021

ajkavanagh Mar 24, 2021

thedac Mar 24, 2021

ajkavanagh commented Mar 24, 2021

thedac commented Mar 24, 2021

thedac left a comment

thedac left a comment

Object retrier #514

Object retrier #514

Conversation

ajkavanagh commented Mar 3, 2021

codecov-io commented Mar 3, 2021 • edited Loading

Codecov Report

ajkavanagh commented Mar 9, 2021

thedac left a comment

Choose a reason for hiding this comment

thedac Mar 23, 2021

Choose a reason for hiding this comment

ajkavanagh Mar 24, 2021

Choose a reason for hiding this comment

thedac Mar 23, 2021

Choose a reason for hiding this comment

ajkavanagh Mar 24, 2021

Choose a reason for hiding this comment

thedac Mar 24, 2021

Choose a reason for hiding this comment

ajkavanagh commented Mar 24, 2021

thedac commented Mar 24, 2021

thedac left a comment

Choose a reason for hiding this comment

thedac left a comment

Choose a reason for hiding this comment

codecov-io commented Mar 3, 2021 •

edited

Loading