Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ratelimit:All requests are responded with 429 when redis down #12115

Closed
fnp927 opened this issue Feb 27, 2019 · 22 comments
Closed

Ratelimit:All requests are responded with 429 when redis down #12115

fnp927 opened this issue Feb 27, 2019 · 22 comments

Comments

@fnp927
Copy link

fnp927 commented Feb 27, 2019

Describe the bug
I applied redis rate limit function on my cluster. And use the shell script visit it 10 times every second. The script only prints the response status code. We can see from the picture below, it runs good(200 means pass and 429 means too many requests).

image

However,when redis down. All requests would be responded with 429 (“too many requests”).

image

Is this design reasonable? From my point of view, the rate limit function should not be applied on the system when part of it encounters errors.

Expected behavior
Rate limit function should not work when something wrong.

Version
Istio version : 1.0.0

Environment
Istio 1.0.0

Some configurations
Redis quota:
image

quotaSpecBinding:
image

shell scripts
image

Thanks

@gargnupur
Copy link
Contributor

@fnp927 : Thanks for reporting this!
Do you have mixer logs when this happens?

@douglas-reid
Copy link
Contributor

@gargnupur this was reported in the slack channel as well. when redis failures occur, we send back a default quota result, which is 0 granted quota. it looks like that was intentional.

i'm not sure what the correct thing to do here is, except it probably is better to send back an error and let the policy fail_open/fail_closed setting take over.

@gargnupur
Copy link
Contributor

Thanks Doug for the info..
Sure, rather than sending empty quota results in this function https://github.com/istio/istio/blob/master/mixer/adapter/redisquota/redisquota.go#L313:19 we can send status for the errors..

policy fail_open/fail_closed setting: where do we define this? are you referring to network_fail_open policy here: https://sourcegraph.com/github.com/istio/proxy@ce051fe8e813e19c87cabae5fb12643d4c063edb/-/blob/include/istio/mixerclient/options.h#L42:8 ?

@kyessenov
Copy link
Contributor

@gargnupur Yes, and that settings is populated from https://github.com/istio/api/blob/master/mesh/v1alpha1/config.proto#L61 and pod annotations. Propagating an error to the client seems reasonable to me.

@douglas-reid douglas-reid added this to the 1.2 milestone Mar 5, 2019
gargnupur added a commit to gargnupur/api that referenced this issue May 14, 2019
This is so that we can result status when we are not able to fetch quota and check call can fail based on fail-open policy set by user.
Issue: istio/istio#12115
istio-testing pushed a commit to istio/api that referenced this issue May 16, 2019
* Add a status field in QuotaResult

This is so that we can result status when we are not able to fetch quota and check call can fail based on fail-open policy set by user.
Issue: istio/istio#12115

* Fix build
@douglas-reid
Copy link
Contributor

@gargnupur what's left here?

@gargnupur
Copy link
Contributor

Reopening this PR: #12507 and using the status field added in PR istio/api@657d9f2.
After that using the added status in quotaresult in istio proxy...

Will try to get to these next week..

gargnupur added a commit to gargnupur/api that referenced this issue Jul 2, 2019
This is the QuotaResult that is used by grpcServer in istio.io/istio and we can set this when not able to fetch quota and check call can fail in proxy based on fail-open policy set by user.

Ref: istio/istio#12115
istio-testing pushed a commit to istio/api that referenced this issue Jul 8, 2019
* Add a Status field in CheckResponse Quota Result

This is the QuotaResult that is used by grpcServer in istio.io/istio and we can set this when not able to fetch quota and check call can fail in proxy based on fail-open policy set by user.

Ref: istio/istio#12115

* Run make proto-commit
@rlenglet rlenglet modified the milestones: 1.4, 1.3 Jul 9, 2019
@duderino
Copy link
Contributor

@kyessenov @douglas-reid could one of you guide #15390 into master and then cherrypick into the release-1.3 branch?

@douglas-reid
Copy link
Contributor

I've approved that PR, but I'm not 100% certain that that is the only bit needed. I think we'll want to check the proxy code to be sure that the quota status returns are being handled appropriately.

@douglas-reid
Copy link
Contributor

@duderino after some basic manual testing with the PR changes, I'm convinced that we need a change to: https://github.com/istio/proxy/blob/master/src/istio/mixerclient/client_impl.cc#L286-L288 as well.

@duderino
Copy link
Contributor

@douglas-reid how should that code change?

@douglas-reid
Copy link
Contributor

@duderino istio/proxy#2401 is my attempt at getting it right. Curious for your thoughts.

@duderino
Copy link
Contributor

duderino commented Sep 3, 2019

@douglas-reid lgtm. Looks like a nice addition to 1.3

Please close this issue if there is no more work remaining

@douglas-reid
Copy link
Contributor

I believe this can be closed now.

@sbgreene1307
Copy link

sbgreene1307 commented Dec 26, 2019

@kyessenov @douglas-reid @gargnupur

We are still seeing this issue on istio 1.3.5 after a bit of stagnant traffic.

Steps to reproduce:

  1. Using redis for rate limit quota tracking
  2. Disable redis by simply changing the replicas count in the StatefulSet to 0
  3. We have an internal mesh service-to-service call that receives a 429 status code from Mixer.
    • The initial request has also failed as well as subsequent requests after stagnant traffic
  4. Let the network traffic stagnate for anywhere from 10-30min
  5. Send another request, sometimes works fine, sometimes receives 429.

The GC Forced seems to be caused by making a request, but also occurs periodically when no traffic is being sent on the service-to-service connection that is using redis.

istio-system/istio-policy-585c5c8d55-4vwnn[mixer]: GC forced
istio-system/istio-policy-585c5c8d55-4vwnn[mixer]: gc 12029 @1441943.580s 0%: 0.015+26+0.008 ms clock, 0.060+0/23/53+0.033 ms cpu, 12->12->8 MB, 17 MB goal, 6 P
istio-system/istio-policy-585c5c8d55-wpt7z[mixer]: GC forced
istio-system/istio-policy-585c5c8d55-wpt7z[mixer]: gc 11486 @1377082.175s 0%: 0.020+9.1+0.006 ms clock, 0.081+0/9.7/39+0.024 ms cpu, 11->11->8 MB, 17 MB goal, 6 P
istio-system/istio-policy-585c5c8d55-fqjlp[mixer]: GC forced
istio-system/istio-policy-585c5c8d55-fqjlp[mixer]: gc 6229 @747151.685s 0%: 0.034+12+0.006 ms clock, 0.13+0/13/40+0.026 ms cpu, 12->12->8 MB, 16 MB goal, 6 P
istio-system/istio-policy-585c5c8d55-wpt7z[mixer]: scvg9179: 0 MB released
istio-system/istio-policy-585c5c8d55-wpt7z[mixer]: scvg9179: inuse: 15, idle: 41, sys: 56, released: 36, consumed: 20 (MB)
istio-system/istio-policy-585c5c8d55-fqjlp[mixer]: 2019-12-26T14:29:24.622220Z	error	adapters	failed to run quota script: dial tcp 10.0.108.236:6379: connect: connection refused	{"adapter": "redishandler.stage"}
istio-system/istio-policy-585c5c8d55-4vwnn[mixer]: 2019-12-26T14:29:25.129100Z	warn	Requested quota 'requestcountquota' is not configured

@idouba idouba reopened this Feb 14, 2020
@idouba
Copy link
Member

idouba commented Feb 14, 2020

It seems proxy use cache sometimes and make a "RESOURCE_EXHAUSTED:Quota is exhausted " decision instead of remote call and retrieving the UNAVAILABLE status from mixer server and do fail open .

[2020-02-09 09:50:57.959][31][debug][filter] [src/istio/mixerclient/client_impl.cc:125] Quota cache hit=true, status=OK, remote_call=true   
[2020-02-09 09:50:59.228][31][debug][filter] [src/istio/mixerclient/client_impl.cc:125] Quota cache hit=true, status=OK, remote_call=false   
[2020-02-09 09:51:00.010][31][debug][filter] [src/istio/mixerclient/client_impl.cc:125] Quota cache hit=false, status=UNAVAILABLE, remote_call=true   
[2020-02-09 09:51:00.030][31][debug][filter] [src/istio/mixerclient/client_impl.cc:125] Quota cache hit=false, status=UNAVAILABLE, remote_call=true   
[2020-02-09 09:51:00.498][31][debug][filter] [src/istio/mixerclient/client_impl.cc:125] Quota cache hit=true, status=OK, remote_call=true   
[2020-02-09 09:51:05.391][31][debug][filter] [src/istio/mixerclient/client_impl.cc:125] Quota cache hit=true, status=RESOURCE_EXHAUSTED:Quota is exhausted for: countquota-rest,
[2020-02-09 09:51:07.958][30][debug][filter] [src/istio/mixerclient/client_impl.cc:125] Quota cache hit=true, status=OK, remote_call=false   
[2020-02-09 09:51:09.228][31][debug][filter] [src/istio/mixerclient/client_impl.cc:125] Quota cache hit=true, status=OK, remote_call=false   
[2020-02-09 09:51:09.244][31][debug][filter] [src/istio/mixerclient/client_impl.cc:125] Quota cache hit=true, status=RESOURCE_EXHAUSTED:Quota is exhausted for: countquota-rest,
[2020-02-09 09:51:10.024][31][debug][filter] [src/istio/mixerclient/client_impl.cc:125] Quota cache hit=true, status=OK, remote_call=true   
[2020-02-09 09:51:12.331][31][debug][filter] [src/istio/mixerclient/client_impl.cc:125] Quota cache hit=true, status=RESOURCE_EXHAUSTED:Quota is exhausted for: countquota-rest,
[2020-02-09 09:51:13.832][31][debug][filter] [src/istio/mixerclient/client_impl.cc:125] Quota cache hit=true, status=RESOURCE_EXHAUSTED:Quota is exhausted for: countquota-rest,
[2020-02-09 09:51:15.006][31][debug][filter] [src/istio/mixerclient/client_impl.cc:125] Quota cache hit=true, status=OK, remote_call=true   
[2020-02-09 09:51:17.958][30][debug][filter] [src/istio/mixerclient/client_impl.cc:125] Quota cache hit=false, status=UNAVAILABLE, remote_call=true   
[2020-02-09 09:51:19.228][30][debug][filter] [src/istio/mixerclient/client_impl.cc:125] Quota cache hit=true, status=OK, remote_call=false   
[2020-02-09 09:51:19.454][31][debug][filter] [src/istio/mixerclient/client_impl.cc:125] Quota cache hit=true, status=RESOURCE_EXHAUSTED:Quota is exhausted for: countquota-rest,
[2020-02-09 09:51:24.276][30][debug][filter] [src/istio/mixerclient/client_impl.cc:125] Quota cache hit=false, status=UNAVAILABLE, remote_call=true   

@kyessenov @douglas-reid @gargnupur

@fnp927
Copy link
Author

fnp927 commented Feb 18, 2020

Some of the requests are still responsed with 429 when redis restart

Thanks for the previous work and I've already updated my mixer which merged the code fixed for this bug.
After I deploy the new mixer,It worked fine. When I turned the redis down, the requests all passed and none of them were limited by the mixer. In other words, it really works.
However,when I tried to restart the redis, and I 'm sure the redis is down at that time, some of the requests were limited and some of them passed. It worked like the redis was still on. After the restart, all works fine. The point here is that the mixer may not track the status of redis at real time or the code may run into the prefetch logic.

@ltdeoliveira
Copy link

I'm also facing the same issue with Istio 1.3.8.

When Redis is up, Quotas are working properly. If I delete Redis, then all requests are responded with 429. Mixer logs are showing the following:

2020-02-28T17:39:49.310587Z error adapters failed to run quota script: dial tcp: lookup redis-cluster-svc.redis.svc.cluster.local on 10.100.0.10:53: no such {"adapter": "redishandler.my-namespace"}

@wsdlll
Copy link

wsdlll commented Mar 9, 2020

The same issue with Istio 1.3.0.

Just as @idouba said. It may be caused by the cache check and prefetch.
I try to disabled the quota cache by setting disable_quota_cache=true. The problem is solved. There was no 429 when the redis restarted.

Please help to check the quota_cache.

@PeterPanHaiHai
Copy link

I'm also facing the same issue with Istio 1.3.0.

As @fnp927 @wsdlll @idouba said above, when I restart the redis, all requests during restart are responded with 429 . After disable quota cache, all requests during restart are responded with 200.

Is there any other way to solve this problem besides setting disable_quota_cache=true?

@PeterPanHaiHai
Copy link

As mentioned above, after setting disable_quota_cache=true, all requests during redis restart are responded with 200, but the performance decline is serious, reaching more than 50%.

disable_quota_cache QPS
false 8958.45
true 3819.49

ps: the test command is as follows
./wrk -t 4 -c 300 -d 120 http://<nginx PodIP>:80

@douglas-reid @gargnupur any help appreciated

@gargnupur
Copy link
Contributor

@PeterPanHaiHai : apologize for the late reply.
From Istio 1.5+, we are suggesting users to use Envoy's rate limiting service. Am not sure, if we want to invest in improving performance of older versions..
@mandarjog , @douglas-reid wdyt?

@gargnupur
Copy link
Contributor

Closing as per no response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests