MegaFix global behavior bugs. #225

Baliedge · 2024-02-26T19:51:03Z

Every call to GetRateLimits would reset the ResetTime and not the Remaining counter. This would cause counters to eventually deplete and never fully reset. The solution involved fixing two issues:

The Duration value was never properly propagated in global behavior. This was added to the global broadcast logic.
- The changes in PR Change global behavior #219 fixes propagation issues in UpdatePeerGlobals during a global broadcast but neglected to propagate Duration.
- As a result, logic in algorithms.go would detect a change in Duration to zero and trigger a reset of the ResetTime. This code path does not reset the Remaining counter because it's meant for cases where an existing rate limit had been extended or abbreviated in duration.
- I had wondered why this was never a problem before that PR. That's because that PR fixed a global broadcast bug that was setting the wrong data type in a CacheItem struct and logic in algorithms.go would ignore it, causing it to short circuit around the logic that checks Duration. Once the data type was corrected, the Duration bug was revealed.
The ResetTime generated by the owning and non-owning peers did not always match exactly.
- Value would vary slightly depending on network lag and system time synchronization because peers were generating ResetTime in multiple places based on clock.Now().
- This isn't a showstopper normally, but it does prevent writing a unit test to ensure ResetTime doesn't change due to the above bug.
- GetRateLimits() will set a requestTime and pass it around so that any date/time computation to set ResetTime will always use the same base value instead of clock.Now().

Fix race condition in QueueUpdate() used by peers to propagate updates to rate limits that it owns.

Updates include ratelimit state, such as the Remaining counter. So, if the same key were updated multiple times it may get added in non-chronological order. The last update wins, potentially passing a stale Remaining count, thereby dropping hits already applied.
The fix is to pass only ratelimit key info to QueueUpdates(). Then, when the timer calls to propagate the update, get the current ratelimit state of each queued update just before sending to the peers.

Fix inconsistency with over limit status when calling GetRateLimits on a non-owner peer with global behavior.

The logic would always return a response with status UNDER_LIMIT no matter how many hits were applied.
This differs when the same request reaches the owner peer, which will return the appropriate status.
The fix adds a check if hits > remaining and set status accordingly.

Optimize calls to GetRateLimits with zero hits to not trigger any global updates because nothing changed.

Add rigorous functional tests around global behavior to verify full peer-to-peer propagation after a call to GetRateLimits.

Fix doublecounting of metric gubernator_over_limit_counter on both non-owner and owner peers. Only count on owner peer.

Fix metric doublecounting of gubernator_getratelimit_counter. When a non-owner uses Global behavior to process a request, do not increment the counter. After it global sends to the owner, the owner will increment the counter. This counter shall be the accurate count of rate limits checked.

Remove redundant metric gubernator_broadcast_counter. Use gubernator_broadcast_duration_count instead.

Fix intermittent test error related to TestHealthCheck that causes the next test to fail because the Gubernator services were restarted and aren't always ready in time to accept requests.

Every call to `GetRateLimits` would reset the `ResetTime` and not the `Remaining` counter. This would cause counters to eventually deplete and never fully reset.

Request time is resolved at first call to `getLocalRateLimit()`, then is propagated across peer-to-peer for global behavior.

…fix-global

QueueUpdate() allowed for sending request/response when local ratelimits are updated. However, the order they get called to QueueUpdate() is not guaranteed to be chronological. This causes stale updates to propagate, causing lost hits. Instead, QueueUpdate() will only pass the request. The current ratelimit state will be retrieved immediately before propagation. Rigorous functional tests added around global behavior.

- Simplify passing of request time across layers. - Better handling of metrics in tests. - Better detection of global broadcasts, global updates, and idle. - Drop redundant metric `guberator_global_broadcast_counter`. - Fix metric `gubernator_global_queue_length` for global broadcast. - Add metric `gubernator_global_send_queue_length` for global send.

…on-owner" Better tracks the code path. Can exclude non-owner activity. Can get accurate total rate limit checks with query like `rate(gubernator_getratelimit_counter{calltype="local"}[1m])`

…rement a counter.

Non-owners shouldn't be persisting rate limit state.

thrawn01

LGTM, We should do something about the global code in v3. With all the counters used for functional tests I feel like it would be simpler to just add a new API call which gets the current status of broadcasts and updates. Thoughts?

Baliedge · 2024-03-13T17:52:05Z

LGTM, We should do something about the global code in v3. With all the counters used for functional tests I feel like it would be simpler to just add a new API call which gets the current status of broadcasts and updates. Thoughts?

Agreed on both.

Fix global behavior ResetTime bug.

f02cb5b

Every call to `GetRateLimits` would reset the `ResetTime` and not the `Remaining` counter. This would cause counters to eventually deplete and never fully reset.

Baliedge requested a review from seany-boy13 February 26, 2024 19:51

Baliedge marked this pull request as draft February 28, 2024 21:50

Baliedge added 5 commits February 28, 2024 17:13

Refine request time propagation.

a665c3c

Request time is resolved at first call to `getLocalRateLimit()`, then is propagated across peer-to-peer for global behavior.

Merge branch 'master' of github.com:mailgun/gubernator into Baliedge/…

427dd2b

…fix-global

Fix compile error.

56a0b22

Fix intermittent test error caused by TestHealthCheck.

cb3816a

Baliedge changed the title ~~Fix global behavior ResetTime bug.~~ Fix global behavior bugs. Mar 7, 2024

Baliedge added 14 commits March 11, 2024 11:17

Fix lint errors.

7c67a32

Tidy code.

24bee89

Add back tests that were erroneously removed.

ea42442

Fix tests.

bd38ee6

Fix benchmark test errors.

65ee4fa

Fix TestHealthCheck.

d5c74d2

Fix flaky test TestHealthCheck.

cd7bbab

Backwards compatibility needed for upgrading.

0fb2a33

Fix compile error.

47eede5

Fix test.

2229596

Fix for overlimit metric doublecounting on non-owner and owner.

c0608d5

Metric gubernator_getratelimit_counter adds calltype value "local n…

517bf63

…on-owner" Better tracks the code path. Can exclude non-owner activity. Can get accurate total rate limit checks with query like `rate(gubernator_getratelimit_counter{calltype="local"}[1m])`

Changed mind. Instead of calltype="local non-owner", just don't inc…

5f137ad

…rement a counter.

Baliedge changed the title ~~Fix global behavior bugs.~~ MegaFix global behavior bugs. Mar 11, 2024

Don't call OnChange() event from non-owner.

f51861d

Non-owners shouldn't be persisting rate limit state.

Baliedge marked this pull request as ready for review March 12, 2024 16:26

Baliedge added 3 commits March 13, 2024 10:23

Simplify cache item expiration check.

d55016d

Rename RequestTime to CreatedAt in protos.

5ce3bc1

Revert optimization that won't work.

0b2adf6

thrawn01 approved these changes Mar 13, 2024

View reviewed changes

Baliedge merged commit 3982e50 into master Mar 19, 2024
4 checks passed

Baliedge mentioned this pull request Mar 19, 2024

Unexpected GetRateLimits result from non-owning peer when using Global behaviour #208

Closed

Baliedge deleted the Baliedge/fix-global branch March 19, 2024 13:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MegaFix global behavior bugs. #225

MegaFix global behavior bugs. #225

Baliedge commented Feb 26, 2024 •

edited

thrawn01 left a comment

Baliedge commented Mar 13, 2024

MegaFix global behavior bugs. #225

MegaFix global behavior bugs. #225

Conversation

Baliedge commented Feb 26, 2024 • edited

thrawn01 left a comment

Choose a reason for hiding this comment

Baliedge commented Mar 13, 2024

Baliedge commented Feb 26, 2024 •

edited