Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I set up RPS limit #670

Closed
jawlitkp opened this issue Mar 7, 2019 · 44 comments
Closed

How do I set up RPS limit #670

jawlitkp opened this issue Mar 7, 2019 · 44 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@jawlitkp
Copy link

jawlitkp commented Mar 7, 2019

No description provided.

@rramkumar1
Copy link
Contributor

@jawlitkp FWIW, we currently, we set the RPS per backend to be 1 for both the InstanceGroup and NEG cases [1].

We can definitely surface this setting to users via the BackendConfig CRD [2] but note that the setting only makes sense when using Ingress w/ NEG's [3]. If you are using InstanceGroup's the setting does not behave how you would expect.

[1] https://github.com/kubernetes/ingress-gce/blob/master/pkg/backends/ig_linker.go#L62
[2] https://github.com/kubernetes/ingress-gce/blob/master/pkg/apis/backendconfig/v1beta1/types.go
[3] https://cloud.google.com/kubernetes-engine/docs/how-to/container-native-load-balancing

@rramkumar1 rramkumar1 changed the title How do I set up RPS limit ,currently by default it is being set up to 100000000000000 How do I set up RPS limit Mar 11, 2019
@rramkumar1 rramkumar1 added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 11, 2019
@jskeet
Copy link

jskeet commented Mar 16, 2019

I've manually updated the RPS in my GCE Load Balancer to 100 for each service. (Currently NEG is out of scope as my cluster doesn't have VPC enabled, and I'd rather not go down that path right now if I don't need to.)

Will that RPS configuration be "sticky", or will it be lost next time I update the ingress, e.g. to add another domain?

@bowei
Copy link
Member

bowei commented Mar 17, 2019

What is your use case is for setting an RPS limit? Traffic is double balanced coming into node then pod. Traffic will be balanced across nodes first. You will probably have worse balancing behavior with the limit than setting it to 1 (i.e. completely uniform spread across all nodes).

@jskeet
Copy link

jskeet commented Mar 17, 2019

Ah, I see. It was mostly just to remove the warning when looking at monitoring, to be honest.

@lentzi90
Copy link

The documentation for the RPS limit is quite confusing when combined with how GKE is using it. If I understood the comments above correctly, it is not really a capacity, but rather a weight to make sure the load is spread evenly over all instances. This makes sense, but I think the documentation should also mention this. Maybe it should even be renamed and the warning removed from the cloud console. (I can open a new issue for this.)

Perhaps more severe is the following. What happens if we use multiple regions? With 1 RPS capacity, would the load be spread uniformly over all regions or would requests go to the closest region even if "usage is at capacity" (preferred since it is not really capacity)?

This seem to indicate that requests would actually go to other regions if there is no available capacity:

  1. If the closest instances to the user have available capacity, then the request is forwarded to that closest set of instances.
  2. Incoming requests to the given region are distributed evenly across all available backend services and instances in that region. However, at very small loads, the distribution may appear to be uneven.
  3. If there are no healthy instances with available capacity in a given region, the load balancer instead sends the request to the next closest region with available capacity.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 19, 2019
@lgelfan
Copy link

lgelfan commented Aug 25, 2019

We recently started testing GCE ingress / HTTP(s) LB as all our others are nginx-ingress and noticed this issue. Certainly would be good to clarify the MAX=1 (per instance) default and if there are use cases for increasing it to improve performance. I'm not sure how using sessionAffinity might affect the optimal setting. We are using NEG so I assume from the previous comments that's how it was set but it's not clear if increasing it would improve performance. We are using websockets for this app and see a fair amount of 502 errors so I was wondering if that might have anything to do with it. We increased the number of instances to more than double what it was for nginx-ingress setup and that did help. It increased the MAX=n value do to there being more instances per zone, but again, I don't know if it's the increased application capacity or that this (effectively) changed the RPS setting. While there were a few long-running connections (websocket and long-polling), overall the traffic was pretty low and the CPU and memory were minimal (Node.js app).
The backend-services output includes:

- balancingMode: RATE
  capacityScaler: 1.0
  group: https://www.googleapis.com/compute/v1/projects/xxx
  maxRatePerEndpoint: 1.0

I would be interested in any updates or clarification.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 25, 2019
@Jannis
Copy link

Jannis commented Oct 2, 2019

We're using Kubernetes on Google Cloud, with the GCE ingress for load balancing. As far as I can see there is currently no way to configure the RPS per instance/group in BackendConfig. It would be fantastic if that was possible, so that you don't have to the Google Cloud Platform console to configure it.

Am I maybe missing something and configuring this is already possible through Kubernetes?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 31, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 30, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mheffner
Copy link

Was there a fix here? I see this got closed due to inactivity, but I didn't see a followup.

@marcin-ptaszynski
Copy link

marcin-ptaszynski commented Jul 1, 2020

Hi @rramkumar1 I've been using GKE with NEGs and so far I was able to manually set BackendService maxRatePerEndpoint and the setting persisted, but now it seems the setting is reverted back to 1 RPS/Endpoint by the controller after a few minutes. Is there any way to workaround this?

@rramkumar1
Copy link
Contributor

@marcin-ptaszynski This is not something that is surfaced today in any of our APIs.

cc @freehan to see if there are any workarounds

@rramkumar1 rramkumar1 reopened this Jul 6, 2020
@rramkumar1 rramkumar1 removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 6, 2020
@freehan
Copy link
Contributor

freehan commented Jul 6, 2020

@marcin-ptaszynski Can you check if the backend service contains any beta or alpha feature? We recently fixed a bug where NEG linker will refresh the backends when there is alpha/beta feature enabled on backend-service. #1162 I think the fix will be included in the latest roll out in GKE rapid channel.

@marcin-ptaszynski
Copy link

@freehan , thank you. We're using grpc and our service annotations look like:

  annotations:
    beta.cloud.google.com/backend-config: '{"ports": {"http2":"grpc-backend-config"}}'
    cloud.google.com/app-protocols: '{"http2":"HTTP2"}'
    cloud.google.com/neg: '{"ingress": true}'

So it seems we're hitting all of beta features.

In any case, exposing MaxRPS config via BackendConfig CRD would be great to have as it (with managedcertificates operator) would allow automating full LB lifecycle from GKE.

@williambrode
Copy link

It's worth noting that if you can't change the MaxRPS then Session Affinity is virtually worthless. Session Affinity only works when the endpoints/instances aren't at max load. I might open a separate issue that references this for clarity on the user-facing issue.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@bowei
Copy link
Member

bowei commented Feb 1, 2021

/reopen

@k8s-ci-robot
Copy link
Contributor

@bowei: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Feb 1, 2021
@bowei
Copy link
Member

bowei commented Feb 1, 2021

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 1, 2021
@grzegorz655
Copy link

This is still an issue any way for adding balancing-mode and max RPS to backendconfig with NEG?

@bowei
Copy link
Member

bowei commented Mar 3, 2021

Yes, we are looking at pulling it from our backlog, the most likely avenue is as an annotation on Service.

@ivansenic
Copy link

Can anyone clarify what Max RPS means in this context? Is it really the max request per second allowed to be executed against the back-end. This would be incredibly stupid?

However, we had a traffic spike during this weekend and indeed there were bunch of 502 most likely coming from the ingress. GKE backends were healthy all the time and were perfectly fine using almost no resources at all. No restarts as well. Can this be the reason? And what's the logic for setting this by default to one?

One of our services has at the moment 165 RPS and it labeled with The usage is at capacity with a nice ⚠️ sign. Why the hack are we paying for this if it's giving us such a ridiculous defaults? Increasing manually did remove the warning, but is not the way we want it. Fix this!

@williambrode
Copy link

@ivansenic no it doesn't mean it will refuse requests if you go beyond the limit. Its used for load balancing so if you have 2 endpoints in your NEG and MAX 10 RPS and one endpoint starts getting more than 10 RPS it will start routing requests to the other one. With 1 RPS on each endpoint it just means distribute the requests as evenly as possible.

If you use a global multi-cluster load balancer it will also sum the total RPS of a region and route to other regions if it goes beyond the max. So in that case you really don't want 1 RPS max because it would start routing your US-west traffic to your Australia cluster for example (I know from experience).

@bowei
Copy link
Member

bowei commented May 24, 2021

Max rps means try to fill backends up to max rps before spilling over; i.e. if you have 5 backends with max RPS / endpoint = 10, they will each get 10 rps before getting more. No requests will be dropped -- the only effect is on how the traffic is distributed.

@ivansenic
Copy link

Ok thanks sorry for confusion.

@ben-xo
Copy link

ben-xo commented May 28, 2021

We have also hit this issue. Is there a way to attach a maxRatePerEndpoint to an ingress in GKE yet? (the issue is as above that each backend can do about 12 RPS, but when the pods are unevenly distributed by zone, google will round robin them as it doesn't know where there's free capacity, despite knowing that the backends are unevenly distributed).

@williambrode
Copy link

@ben-xo I'm confused what you mean? The loadbalancer should be able to directly balance to each pod in GKE - so I'm not sure why unevenly distributing by zone would be a problem?

If you are using the autoneg-controller
Then you set the maxRatePerEndpoint via the annotation on the service:

anthos.cft.dev/autoneg: '{"name":"autoneg_test", "max_rate_per_endpoint":1000}'

@abrahammartin
Copy link

Is there any ETA for this? We currently have to change this manually via the UI instead of via K8s manifest, which is far from ideal.

@cparedes
Copy link

cparedes commented Sep 9, 2021

Running into this too. Would love to be able to set this programmatically with either BackendConfig or a service annotation

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 8, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 7, 2022
@ben-xo
Copy link

ben-xo commented Jan 24, 2022

/remove-lifecycle stale

@ben-xo
Copy link

ben-xo commented Jan 24, 2022

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 24, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 24, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 24, 2022
@swetharepakula
Copy link
Member

We do not plan to add this feature to Ingress. This feature is available with the GKE Gateway API Implementation. Please use the Gateway API for this feature.

@swetharepakula swetharepakula closed this as not planned Won't fix, can't repro, duplicate, stale Jun 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests