Unexpected GetRateLimits result from non-owning peer when using Global behaviour #208

philipgough · 2023-12-13T14:58:05Z

I am trying to understand if we have hit a bug/regression or that if this behaviour is somehow expected and I am missing something.

I have a reverse proxy that reaches out to a Gubernator cluster all running on Kubernetes. The requests are load balanced over gRPC on the client side in a round robin fashion to each replica in the Gubernator cluster. I have observed that when a request lands on a replica who does not own a rate limit but fetches result from cache, I always get a status under limit response from that replica. Prometheus metrics show that broadcasts and peering etc are working as I would expect.

An example request might look like

	resp, err := c.client.GetRateLimits(ctx, &gubernator.GetRateLimitsReq{
		Requests: []*gubernator.RateLimitReq{{
			Name:      "some_name",
			UniqueKey: "some_key",
			Hits:      1,
			Limit:     1,
			Duration:  30000,
			Algorithm: gubernator.Algorithm_LEAKY_BUCKET,
			Behavior:  gubernator.Behavior_GLOBAL,
		}},
	})

Running for example 30 test requests against two different scenarios, the first which runs a single replica and so we can guarantee the owner, the second with two replicas which should split requests 50/50 in sync fashion will produce the following results:

Gubernator version is v2.2.1
Gubernator is running with a single replica
Count of 200 is 1
Count of 429 is 29

Gubernator version is v2.2.1
Gubernator is running with two replicas
Count of 200 is 16
Count of 429 is 14

I believe I have narrowed down the change of behaviour to #157 although I can't exactly explain why.

But running the above scenarios one again in the prior release returns the following, which is what I would expect as correct:

Gubernator version is v2.0.0-rc.32
Gubernator is running with two replicas
Count of 200 is 1
Count of 429 is 29

Gubernator version is v2.0.0-rc.32
Gubernator is running with one replica
Count of 200 is 1
Count of 429 is 29

I've added a test case (which expectedly fails) to reproduce this exact scenario in #207 and reverting #157 on top of that PR allows the test to pass.

Hits appears to be 0 for non-owning peers in algorithims.gowhen I step through with the debugger.

The text was updated successfully, but these errors were encountered:

philipgough · 2023-12-13T15:25:09Z

Actually it looks to be related to https://github.com/mailgun/gubernator/blob/master/global.go#L225 which sets hits to zero in all cases on the broadcast, so when it arrives at the non-owning peer it will never pass the code added by #157 and report over the limit.

douglascamata · 2023-12-13T17:45:08Z

@philipgough's findings seem pretty odd. Imagine this scenario:

A rate limit for a given key is a 5/5 (hits/limit), with status UNDER_LIMIT.
On the 6th hit that gets to the key owner, it will go to 6/5 OVER_LIMIT, a broadcast from the owner should happen: now we run getLocalRateLimit inside the owner with a request that has 0/5 OVER_LIMIT before broadcasting to get the most up-to-date status of that limit:

gubernator/global.go

Line 227 in 885519d

status, err := gm.instance.getLocalRateLimit(ctx, rl)
This will run through each algorithm's code, which is where things get problematic and we end up getting status UNDER_LIMIT for some reason. 🤷

I tried to understand why that UNDER_LIMIT status comes and I failed. 😭

Both leaky bucket and token bucket algorithms show this problem, potentially for different reasons... I don't know.

philipgough · 2024-01-16T14:34:20Z

@Baliedge would you mind taking a look at this if you can spare some time? Not sure if its an issue with understanding the correct usage or a valid problem! TY

miparnisari · 2024-01-20T21:01:33Z

I stumbled upon this issue yesterday because I saw in our server logs that, for any given request, 429s were always returned by one specific host in our cluster (the owner host for that request). I spent a couple of hours looking into this and I believe I know what's going on.

I wrote a test (in a private repo, sorry) that does this:

Create two peers with the default settings. Peer 2 will be the owner.
Call GetRateLimits on peer 1 with Behavior: GLOBAL, Amount: 1, Limit: 1, Duration: 1 * time.Minute,. This is a cache miss in peer 1 so it goes to peer 2. Peer 2 gets the hit and should update peer 1 with remaining = 0, status = under_limit
Verify that Allowed = true, Remaining = 0.
time.Sleep(1 * time.Second). This gives enough time to broadcast the change from peer 2 to peer 1.
Call GetRateLimits on peer 1, again, with Behavior: GLOBAL, Amount: 1, Limit: 1, Duration: 1 * time.Minute,. This is a cache hit in peer 1
Verify that Allowed = false. This assertion fails.

I think the reason why the assertion fails is: in algorithms.go, both algorithms have

rl := &RateLimitResp{
    Status:    Status_UNDER_LIMIT,
   // ...
}

// ...

if r.Hits == 0 {
   return rl, nil
}

When the owner of a key is done aggregating all the hits it got and makes a final peek to GetRateLimits (ie it sends a copy of the request with hits=0), we don't want to blindly return whatever we had. We are not inspecting the value of r.Remaining, which could be zero at this point. So we are not setting the status to OVER_LIMIT from the owners, which means the peers will not see it either.

Either we have to fix this in the algorithms themselves (risky, this code doesn't have unit tests!), or we need to change the code of global.broadcastPeers so that after doing getLocalRateLimit we manually inspect the value of Remaining and if it's zero, we manually set the status to OVER_LIMIT.

~~Judging by the git blame, this problem has been there for roughly 3 years. But if my judgement is correct then it's odd that it was noticed by two independent reporters with very few days in between.~~ As @philipgough said, the test passes on v2.0.0-rc.32. Which means this has been broken since October 2022 😅

thrawn01 · 2024-01-23T21:42:31Z

I might have a fix #216

Baliedge · 2024-02-21T15:24:02Z

PR #219 was merged recently. Did this fix the issue reported here?

douglascamata · 2024-02-22T14:45:01Z

@Baliedge: @philipgough kindly left a PR with a reproduction of the problem. So it would be cool to incorporate the test into the codebase (if it's not already done) and confirm wether the issue is fixed through the test.

Baliedge · 2024-02-22T14:51:28Z

@douglascamata I'm not finding any PRs to your name. Could you provide a link?

douglascamata · 2024-02-22T14:53:11Z

@Baliedge: as I said, it is @philipgough's PR... mentioned in his first comment in this issue. Here it is: #207

Baliedge · 2024-02-22T20:29:06Z

@douglascamata @philipgough: Would you both please ~~review~~ take a look at PR #224?

philipgough · 2024-02-26T09:17:46Z

@Baliedge - sorry I was on vacation for a bit so missed this ping.

Baliedge · 2024-03-19T13:36:02Z

I've released v2.4.0 which includes fixes in PRs #219 and #225 that specifically address this problem. We've been running it in production for the last week and I'm satisfied with the results.

Let me know if you have any feedback. Thanks for your help!

philipgough changed the title ~~Unexpected GetRateLimits when using Global behaviour with Leaky Bucket~~ Unexpected GetRateLimits result from non-owning peer when using Global behaviour Dec 13, 2023

This was referenced Jan 22, 2024

fix: owners set UNDER_LIMIT status when Remaining=0 #212

Closed

docs: add docs in global.go #213

Merged

thrawn01 mentioned this issue Jan 23, 2024

test: Add test for global rate limiting with load balancing #207

Closed

miparnisari mentioned this issue Feb 8, 2024

Fix global mode #218

Closed

Baliedge mentioned this issue Feb 22, 2024

Add test for global rate limiting with load balancing #224

Merged

Baliedge closed this as completed Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected GetRateLimits result from non-owning peer when using Global behaviour #208

Unexpected GetRateLimits result from non-owning peer when using Global behaviour #208

philipgough commented Dec 13, 2023 •

edited

Loading

philipgough commented Dec 13, 2023 •

edited

Loading

douglascamata commented Dec 13, 2023 •

edited

Loading

philipgough commented Jan 16, 2024

miparnisari commented Jan 20, 2024 •

edited

Loading

thrawn01 commented Jan 23, 2024

Baliedge commented Feb 21, 2024

douglascamata commented Feb 22, 2024

Baliedge commented Feb 22, 2024

douglascamata commented Feb 22, 2024

Baliedge commented Feb 22, 2024 •

edited

Loading

philipgough commented Feb 26, 2024

Baliedge commented Mar 19, 2024

Unexpected GetRateLimits result from non-owning peer when using Global behaviour #208

Unexpected GetRateLimits result from non-owning peer when using Global behaviour #208

Comments

philipgough commented Dec 13, 2023 • edited Loading

philipgough commented Dec 13, 2023 • edited Loading

douglascamata commented Dec 13, 2023 • edited Loading

philipgough commented Jan 16, 2024

miparnisari commented Jan 20, 2024 • edited Loading

thrawn01 commented Jan 23, 2024

Baliedge commented Feb 21, 2024

douglascamata commented Feb 22, 2024

Baliedge commented Feb 22, 2024

douglascamata commented Feb 22, 2024

Baliedge commented Feb 22, 2024 • edited Loading

philipgough commented Feb 26, 2024

Baliedge commented Mar 19, 2024

philipgough commented Dec 13, 2023 •

edited

Loading

philipgough commented Dec 13, 2023 •

edited

Loading

douglascamata commented Dec 13, 2023 •

edited

Loading

miparnisari commented Jan 20, 2024 •

edited

Loading

Baliedge commented Feb 22, 2024 •

edited

Loading