Fix global behavior #216

thrawn01 · 2024-01-23T21:28:53Z

Over Limit Behavior

The way it's supposed to work is that the owner of the rate limit will broadcast the computed result of the rate limit to other peers, and that is what will be returned to clients who submit hits to a peer. The owner of the rate limit should be setting the OVER_LIMIT status before broadcasting.

However, we don't store OVER_LIMIT in the cache, See this comment because it is expected that every time the functions in algorithms.go are called that OVER_LIMIT will be calculated and set properly. Again, because of this, the owning peer must preform those calculations and set the OVER_LIMIT status before broadcasting to other peers.

As these wonderful contributors have noticed (@miparnisari and @philipgough) It is currently not doing this! As was pointed out by @miparnisari, I do not know why this has not been caught before now. (Probably because of the lack of testing around the global functionality 😥)

At one time, I had envisioned the broadcaster would only send hits to the peers, and then each peer would calculate the correct OVER_LIMIT status based on the number of hits received from the owning peer. However, this plan would require more logic in the peers for when things like RESET_REMAINING or when a negative hit occurs, etc.... Instead I decided the owning peer would broadcast a RateLimitResp to each pear which in turn would be returned to the client as the most up to date status of the rate limit. It looks like I goofed when I made the transition and decided to make a call to getLocalRateLimit() for each queued broadcast. As a consequence I had to set rl.Hits = 0 which short circuits the OVER_LIMIT calculation, as a result the broadcasted responses do not have the proper OVER_LIMIT set. I think the correct way to handle this is to have the broadcaster queue responses and send them without recalculating the current status before broadcast to peers. (Included in this PR)

Also, I noted during testing that when using TOKEN_BUCKET the non owning peer would respond with UNDER_LIMIT on the third hit. (which is kinda correct, because it will only respond with what the owner has broadcast to it) but isn't what most users would expect when using TOKEN_BUCKET. So I added some code for the peer to return OVER_LIMIT if the broadcast response has a remainder of 0.

I think this works for TOKET_BUCKET because the expiration time on the owner and peer are the same, such that when the bucket expires, both the peer and owner will respond with 0 Hits correctly at the same time. However, for LEAKY_BUCKET the peer doesn't really know what tokens have leaked since we last checked.... I guess we could have the peer calculate the leak... but that sounds icky... Maybe someone can convince me otherwise?

functional_test.go

miparnisari · 2024-01-24T01:52:44Z

functional_test.go

+				{
+					Name:      "test_global_token_limit",
+					UniqueKey: "account:12345",
+					Algorithm: guber.Algorithm_TOKEN_BUCKET,


Instead of hardcoding the algorithm here you can do something like

algorithms := []guber.Algorithm{guber.Algorithm_TOKEN_BUCKET, guber.Algorithm_LEAKY_BUCKET} for _, algo := range algorithms { t.Run(algo.String(), func(t *testing.T) { // Send two hits that should be processed by the owner and the peer and deplete the remaining sendHit(algo, guber.Status_UNDER_LIMIT, 1)

And you can remove the other test (assuming that everything else is the same)

I think it makes sense to keep the tests split personally for clarity but its a good point that we could merge if we expect the results to always be the same

Always avoid DRY in tests, functional tests should be clear and verbose, and should generally not change once written.

functional_test.go

miparnisari · 2024-01-24T02:12:02Z

gubernator.go

+			//
+			// We cannot preform this for LEAKY_BUCKET as we don't know how much time or what other requests
+			// might have influenced the leak rate at the owning peer.
+			// (Maybe we should preform the leak calculation here?????)


As a newbie to this codebase, this comment suggests that we are breaking the single responsibility principle and we need to move this elsewhere. My thought was that we need a new API called PeekRateLimit that doesn't affect the hit count and only calculates the right status.

100% agree with this, as I had the exact same thought. However, I'm planning on removing GRPC so I decided to wait for a future PR.

I could be convinced otherwise!

I would avoid major refactors that are outside the context of this PR, for which I consider to be a bugfix. A follow up might make sense though.

philipgough · 2024-01-24T15:52:42Z

I guess we could have the peer calculate the leak... but that sounds icky... Maybe someone can convince me otherwise?

I agree that sounds like a bit too much logic leaking into the peer.

@thrawn01 thanks for taking this up. Changes make sense and lgtm from as best as I can understand the flow.

philipgough

lgtm - thanks again :)

miparnisari

Sorry, i'm taking back my approval because we tested this and, at big scale, it doesn't work. We will send a PR soon with our findings. See #218

thrawn01 · 2024-02-10T03:13:34Z

This PR is replaced by #219

philipgough and others added 3 commits December 13, 2023 09:49

test: Add test for global rate limiting with load balancing

a48a45d

fix: owners set correct UNDER_LIMIT status

18b14db

fix global update behavior

2e0648a

thrawn01 requested a review from Baliedge as a code owner January 23, 2024 21:28

This was referenced Jan 23, 2024

fix: owners set UNDER_LIMIT status when Remaining=0 #212

Closed

Unexpected GetRateLimits result from non-owning peer when using Global behaviour #208

Closed

miparnisari reviewed Jan 24, 2024

View reviewed changes

Added findNonOwningPeer() and getClientToNonOwningPeer()

f95acfd

thrawn01 requested review from philipgough and miparnisari January 24, 2024 17:23

miparnisari approved these changes Jan 25, 2024

View reviewed changes

philipgough approved these changes Jan 29, 2024

View reviewed changes

Baliedge approved these changes Feb 2, 2024

View reviewed changes

miparnisari suggested changes Feb 7, 2024

View reviewed changes

thrawn01 mentioned this pull request Feb 10, 2024

Change global behavior #219

Merged

thrawn01 closed this Feb 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix global behavior #216

Fix global behavior #216

thrawn01 commented Jan 23, 2024

miparnisari Jan 24, 2024

philipgough Jan 24, 2024

thrawn01 Jan 24, 2024

miparnisari Jan 24, 2024

thrawn01 Jan 24, 2024

philipgough Jan 24, 2024 •

edited

philipgough commented Jan 24, 2024

philipgough left a comment

miparnisari left a comment •

edited

thrawn01 commented Feb 10, 2024

Fix global behavior #216

Fix global behavior #216

Conversation

thrawn01 commented Jan 23, 2024

Over Limit Behavior

miparnisari Jan 24, 2024

Choose a reason for hiding this comment

philipgough Jan 24, 2024

Choose a reason for hiding this comment

thrawn01 Jan 24, 2024

Choose a reason for hiding this comment

miparnisari Jan 24, 2024

Choose a reason for hiding this comment

thrawn01 Jan 24, 2024

Choose a reason for hiding this comment

philipgough Jan 24, 2024 • edited

Choose a reason for hiding this comment

philipgough commented Jan 24, 2024

philipgough left a comment

Choose a reason for hiding this comment

miparnisari left a comment • edited

Choose a reason for hiding this comment

thrawn01 commented Feb 10, 2024

philipgough Jan 24, 2024 •

edited

miparnisari left a comment •

edited