New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow multiple routers to update route status #8309
Allow multiple routers to update route status #8309
Conversation
@@ -157,7 +169,7 @@ func (a *StatusAdmitter) hasIngressBeenTouched(route *routeapi.Route, lastTouch | |||
// recordIngressTouch tracks whether the ingress record updated succeeded and returns true if the admitter can | |||
// continue. Conflict errors are treated as no error, but indicate the touch was not successful and the caller | |||
// should retry. | |||
func (a *StatusAdmitter) recordIngressTouch(route *routeapi.Route, touch *unversioned.Time, err error) (bool, error) { | |||
func (a *StatusAdmitter) recordIngressTouch(route *routeapi.Route, touch *unversioned.Time, oldTouch *unversioned.Time, err error) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable should probably be called ourTouch
or localTime
, since it belongs to us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually the previous touch time (before we updated it to be now
). I'm not super-averse to changing it, but IMO it's clearer as oldTouch
or prevTouch
(or possibly touch
could be newTouch
, and oldTouch
could be currentTouch
)
I've added in new unit tests, and done a some testing in my local setup, but it would be nice to have another couple pairs of eyes run through the logic. |
// still need to set the last-touched time so that others can tell we've | ||
// modified this Ingress value | ||
now := nowFn() | ||
ingress.Conditions[i].LastTransitionTime = &now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this modifying the ingress, is false really appropriate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it actually should stay as false
-- it's only used in the rejection path, and it's actually in a path where we don't want to update if that's the only thing that would be changed, AFAICT, but I'll double check in the morning when I'm more awake ;-).
d5cdfdb
to
3df79a4
Compare
|
||
} | ||
|
||
func TestStatusFightBetweenRouters(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might want to run a longer sequence of random fighting (increment one or the other randomly and invoke handle route in a for loop), then simulate a jump forward in the clock (set now past the content interval) and verify only one ends up winning.
Would do that as a separate test:
i.e.:
for i ... 100
if rand.Bool()
mutateA
else
mutateB
verify postcondition still holds
advance clock > contention interval
for 1..100
Might even want to simulate 3. It should be fast to run, but I really want to ensure we have a realistic "fight".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try adding something like this in.
3b8ab82
to
8fe78b8
Compare
Alright, I've added in a simulated protracted series of events between 2 'old' routers and two 'new' routers. It shows what happens when the new routers spin up, one hits a conflict, and the updates that result. It also shows what happens when an update to a route comes in in the middle of a rolling upgrade, and tests that cache entries expire after a while. Additionally, I uncovered a couple another corner case that appeared to be incorrect, corrected it, and added tests so that we don't regress in the future. Let me know if I've misdiagnosed anything. Namely: if we reject a route, but only change the hostname, leaving the message the same, we should still consider that changed, and try to update. |
|
||
t.Logf("...which should cause 'new' router #2 and the two 'old' routers to receive an update, and ignore it") | ||
now = unversioned.Time{Time: now.Add(1 * time.Second)} | ||
nowFn = func() unversioned.Time { return now } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I wouldn't expect you to have to redefine this since the closure should capture the variable.
Changes look good to me, thanks for digging in [test] |
Whoops, don't merge quite yet, there's one tweak I need to make to the test b/c the contention period causes one of the back-and-forths to last 1 round-trip longer (just realized it as I was looking at the tests one last time ;-) ) |
Previously, if multiple differently-named routers attempted to update their respective Ingress entries in a route status, one would go through, while the others would recieve conflict errors. The last-touched cache, designed to prevent fighting between two same-named routers (for example, on a rolling update), would get the route added with the zero-time, meaning that the router would not submit further updates for that route (and consequently that router would never set the proper status on the route). Now, conflicts no longer insert an entry into the last-touched cache. Instead, the code now properly updates the last-touched time, which prevents the fighting between instances of the same router while tolerating conflicts inbetween instances of different routers. Additionally, cache entries now expire after a period of time (the "contention period"), as originally intended.
8fe78b8
to
e8d857a
Compare
That should fix the "incorrect" test (nothing was broken -- it just slightly misrepresented the sequence of events due to the 'old' routers still being in their contention period). |
Evaluated for origin test up to e8d857a |
continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/2646/) |
Lgtm [merge] |
continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/5500/) (Image: devenv-rhel7_3887) |
Evaluated for origin merge up to e8d857a |
Previously, if multiple differently-named routers attempted to update
their respective Ingress entries in a route status, one would go
through, while the others would recieve conflict errors.
The last-touched cache, designed to prevent fighting between two
same-named routers (for example, on a rolling update), would get
the route added with the zero-time, meaning that the router would
not submit further updates for that route (and consequently that
router would never set the proper status on the route).
Now, conflicts no longer insert an entry into the last-touched cache.
Instead, the code now properly updates the last-touched time, which
prevents the fighting between instances of the same router while
tolerating conflicts inbetween instances of different routers.
Additionally, cache entries now expire after a period of time
(the "contention period"), as originally intended.