Skip to content
This repository has been archived by the owner on Apr 19, 2024. It is now read-only.

MegaFix global behavior bugs. #225

Merged
merged 24 commits into from
Mar 19, 2024
Merged

MegaFix global behavior bugs. #225

merged 24 commits into from
Mar 19, 2024

Conversation

Baliedge
Copy link
Contributor

@Baliedge Baliedge commented Feb 26, 2024

Every call to GetRateLimits would reset the ResetTime and not the Remaining counter. This would cause counters to eventually deplete and never fully reset. The solution involved fixing two issues:

  • The Duration value was never properly propagated in global behavior. This was added to the global broadcast logic.
    • The changes in PR Change global behavior #219 fixes propagation issues in UpdatePeerGlobals during a global broadcast but neglected to propagate Duration.
    • As a result, logic in algorithms.go would detect a change in Duration to zero and trigger a reset of the ResetTime. This code path does not reset the Remaining counter because it's meant for cases where an existing rate limit had been extended or abbreviated in duration.
    • I had wondered why this was never a problem before that PR. That's because that PR fixed a global broadcast bug that was setting the wrong data type in a CacheItem struct and logic in algorithms.go would ignore it, causing it to short circuit around the logic that checks Duration. Once the data type was corrected, the Duration bug was revealed.
  • The ResetTime generated by the owning and non-owning peers did not always match exactly.
    • Value would vary slightly depending on network lag and system time synchronization because peers were generating ResetTime in multiple places based on clock.Now().
    • This isn't a showstopper normally, but it does prevent writing a unit test to ensure ResetTime doesn't change due to the above bug.
    • GetRateLimits() will set a requestTime and pass it around so that any date/time computation to set ResetTime will always use the same base value instead of clock.Now().

Fix race condition in QueueUpdate() used by peers to propagate updates to rate limits that it owns.

  • Updates include ratelimit state, such as the Remaining counter. So, if the same key were updated multiple times it may get added in non-chronological order. The last update wins, potentially passing a stale Remaining count, thereby dropping hits already applied.
  • The fix is to pass only ratelimit key info to QueueUpdates(). Then, when the timer calls to propagate the update, get the current ratelimit state of each queued update just before sending to the peers.

Fix inconsistency with over limit status when calling GetRateLimits on a non-owner peer with global behavior.

  • The logic would always return a response with status UNDER_LIMIT no matter how many hits were applied.
  • This differs when the same request reaches the owner peer, which will return the appropriate status.
  • The fix adds a check if hits > remaining and set status accordingly.

Optimize calls to GetRateLimits with zero hits to not trigger any global updates because nothing changed.

Add rigorous functional tests around global behavior to verify full peer-to-peer propagation after a call to GetRateLimits.

Fix doublecounting of metric gubernator_over_limit_counter on both non-owner and owner peers. Only count on owner peer.

Fix metric doublecounting of gubernator_getratelimit_counter. When a non-owner uses Global behavior to process a request, do not increment the counter. After it global sends to the owner, the owner will increment the counter. This counter shall be the accurate count of rate limits checked.

Remove redundant metric gubernator_broadcast_counter. Use gubernator_broadcast_duration_count instead.

Fix intermittent test error related to TestHealthCheck that causes the next test to fail because the Gubernator services were restarted and aren't always ready in time to accept requests.

Every call to `GetRateLimits` would reset the `ResetTime` and not the `Remaining` counter.  This would cause counters to eventually deplete and never fully reset.
@Baliedge Baliedge marked this pull request as draft February 28, 2024 21:50
Request time is resolved at first call to `getLocalRateLimit()`, then is propagated across peer-to-peer for global behavior.
QueueUpdate() allowed for sending request/response when local ratelimits are updated.  However, the order they get called to QueueUpdate() is not guaranteed to be chronological.  This causes stale updates to propagate, causing lost hits.
Instead, QueueUpdate() will only pass the request.  The current ratelimit state will be retrieved immediately before propagation.

Rigorous functional tests added around global behavior.
@Baliedge Baliedge changed the title Fix global behavior ResetTime bug. Fix global behavior bugs. Mar 7, 2024
- Simplify passing of request time across layers.
- Better handling of metrics in tests.
- Better detection of global broadcasts, global updates, and idle.
- Drop redundant metric `guberator_global_broadcast_counter`.
- Fix metric `gubernator_global_queue_length` for global broadcast.
- Add metric `gubernator_global_send_queue_length` for global send.
…on-owner"

Better tracks the code path.
Can exclude non-owner activity.
Can get accurate total rate limit checks with query like `rate(gubernator_getratelimit_counter{calltype="local"}[1m])`
@Baliedge Baliedge changed the title Fix global behavior bugs. MegaFix global behavior bugs. Mar 11, 2024
Non-owners shouldn't be persisting rate limit state.
@Baliedge Baliedge marked this pull request as ready for review March 12, 2024 16:26
Copy link
Contributor

@thrawn01 thrawn01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, We should do something about the global code in v3. With all the counters used for functional tests I feel like it would be simpler to just add a new API call which gets the current status of broadcasts and updates. Thoughts?

@Baliedge
Copy link
Contributor Author

LGTM, We should do something about the global code in v3. With all the counters used for functional tests I feel like it would be simpler to just add a new API call which gets the current status of broadcasts and updates. Thoughts?

Agreed on both.

@Baliedge Baliedge merged commit 3982e50 into master Mar 19, 2024
4 checks passed
@Baliedge Baliedge deleted the Baliedge/fix-global branch March 19, 2024 13:50
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants