Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support large clusters without throttling #247

Closed
19 tasks done
feiskyer opened this issue Oct 8, 2019 · 15 comments
Closed
19 tasks done

Support large clusters without throttling #247

feiskyer opened this issue Oct 8, 2019 · 15 comments
Assignees
Labels
area/provider/azure Issues or PRs related to azure provider
Milestone

Comments

@feiskyer
Copy link
Member

feiskyer commented Oct 8, 2019

Is your feature request related to a problem?/Why is this needed

From Kubernetes cloud provider side, when kubernetes cluster raises the number of nodes to 500+, throttling issues are observed from ARM and CRP by getting 429 back. What's worse, cluster is unable to recover from the state unless we manually shutdown controller manager. This is happening because of a couple of reasons:

  • Azure Cloud Provider code iterates through each vm instance instead of using VMSS.List, hence the requests would increase as O(N) when node number increases.
  • Azure generated Go SDK retries by itself. However, it does not count 429 in number of attempts and hence it will blindly retry when getting throttled and that gets us stuck in throttled state forever.
  • Azure cloud provider caches VMSS and VM in 1 minute. However, when it needs to do a PUT, it will need to refresh cache if cache is older than 1 minute to make sure it does not overwrite existing changes.
  • Azure cloud provider needs to fill the caches when kube-controller-manager is restarted. However, when 429 has already been happening, the cache initialization won't succeed until ARM accepts those requests.

From CRP side, when the call rate limit overage is severe, CRP will stop accepting any requests and will block all client calls for the particular operation group until the call rate goes down significantly for some min “penalty window”. This penalty window is what’s returned in the Retry-After header. If a high number of concurrent callers rush to make the same calls again at later time, they can cause the hard throttling to start again. Exponential back-off on throttling errors is recommended to mitigate the throttling issue.

Describe the solution you'd like in detail

To support large clusters without throttling, the following changes are proposed for short-term (during Kubernetes v1.17):

1) Honor Retry-After headers and get rid of Go SDK silent retries

When cloud provider reaches request limit, ARM would return 429 HTTP status code and a Retry-After header. Cloud provider should wait for the Retry-After seconds before sending the next request. To get this header, current Go SDK logic should be replaced with our own implementations.

go-autorest has added send decorator via context in #417. Kubernetes cloud provider also needs to adopt it and switch to cloud provider's own retry logic.

2) Switching to LIST instead of GET for VMSS, VMSSVM and VM

After switching to LIST API, cloud provider would invoke LIST VMSS first and then LIST instances for each VMSS. In such way, the request count would be improved to O(number of VMSS). Multiple VMs are paged by LIST, and VMSS returns 120 VMs with instanceView in per page.

When updating the caches for a VMSS, we should also avoid concurrent GETs by adding per entry locks (entry means List of VMs in a VMSS).

3) Reuse the outdated caches on throttling

Though there're already caches in cloud provider, when caches are outdated or have just been PUT, ARM requests are required to refresh them. On the cases when subscription is throttled, all the caches may be outdated and we have no way to recover until ARM starts to accept requests again.

Note: when a data disk is attaching to the VM, or the VM is adding to a LoadBalancer backend address pool, cloud provider would make a PUT request for the VM/VMSSVM.

Azure cloud provider refreshes the cache mainly for three changes: the powerstate, the LB configuration and the data disks. It's ok to reuse the outdated caches with some assumptions below:

  • The LB configuration has been ready during provision steps and there're not many changes after provisioning. On node scaling (up or down), the LB configuration is already managed by cloud provider, so cloud provider already knows what new updates has been made.
  • The powerstate changes could only be get by new query, but it's ok to treat the node as shutdown or removed with more delay because the node would be NotReady anyway (new Pods won't be able to schedule on it).
  • Azure cloud provider also has the information of newly attached AzureDisk PVs. However, if users have also attched other data disks, then these new disks may be detached by cloud provider. It's ok to assume users won't attach disks because Kubernetes PV are encouraged.

When the requests are throttled, cloud provider would retry with backoff configurations. After retries, if it's still failed with throttling errors, then cloud provider would extend the cache TTL based on Retry-After header (which means the outdated caches would be still used with another Retry-After delay).

Reusing the caches would work for most cases, but there's one exception: AzureDisk attach/detach failure. There're couple of reasons may cause this, and PUT again with cached VM usually would still fail. So for this case, cloud provider would invalidate and refresh the cache.

When throttled, x-ms-failure-cause header determines which service is throttling. We should log this so that it's easy to know why the request is throttled.

4) Extend cache TTL from 1 minute to 10 minutes

To reduce the total number of requests, 10 minutes would be used by default. The cache TTL would also be configurable from cloud provider config file, so that it could be easily tuned by customer.

We should also make the TTL configurable by cloud-config, so that customers could tune it later easily.

Describe alternatives you've considered

Azure Resource Graph (ARG) also provides a higher volume of queries, however it's not considered because a few issues are not resolved now:

  • Not all data are supported there, e.g. instanceView is missing for "microsoft.compute/virtualmachines" and "microsoft.compute/virtualmachinescalesets/virtualmachines" is not supported.
  • ARG resource may have 1 minutes delay after it's updated, while we expect to get the latest data.
  • The API versions for each resource are not defined in ARG, so there may be broken changes there while we are not aware of.

Cross projects improvements

Cloud provider is not the only source of rate limit issues, there're also a lot of addons running on the cluster, e.g. cluster-autoscaler and pod identity.

All of those projects should reduce ARM requests and fix 429 issues. We're planning to implement this as a library in cloud provider, so that other projects could easily vendor and reuse the same logic.

Work items

Those work items are targeting at Kubernetes v1.18 and would be cherry-picked to v1.15-v1.17:

Rewrite Azure clients that support per-client rate limiting, backoff retries and honor of retry-after headers. Those items would only be included in v1.18:

@feiskyer feiskyer added the area/provider/azure Issues or PRs related to azure provider label Oct 8, 2019
@feiskyer feiskyer self-assigned this Oct 8, 2019
@feiskyer
Copy link
Member Author

feiskyer commented Oct 8, 2019

/cc @craiglpeters

@pawelpabich
Copy link

We've also suffered from this problem when we had 9+ small clusters with autoscaler. Each cluster had no more than 5 VMs.

@sylus
Copy link

sylus commented Oct 24, 2019

We are also getting this issue also in AKS Engine we documented the whole issue, base replication case, current environment, and attempts tried over at:

Azure/aks-engine#1860 (comment)

Kubernetes Version: v1.15.5 (1 master, 4 nodes)
AKS Engine: v0.42.2
Kernel Version: 5.0.0-1023-azure
OS Image: Ubuntu 18.04.3 LTS
Date Issue Started: Last week (Oct 15, 2019) on clusters that running w/out issue for 3 months

@devigned
Copy link
Contributor

Looks like since v12.2.0 (https://github.com/Azure/go-autorest/blob/740293c019d8314ce3378d456b4327fa646297e6/CHANGELOG.md#bug-fixes-5) go-autorest has respected Retry-After headers via autorest.DelayWithRetryAfter. Might just be able to update the go.mod in cloud provider azure. It's currently at v0.9.0 for autorest.

@aramase
Copy link
Member

aramase commented Oct 30, 2019

In an effort to reduce the # of calls that are made in an idle cluster, we have merged kubernetes/kubernetes#83685.

What this PR changes?

This PR introduces 2 types of read types for data in cache -

Default read type - If the data in cache is expired, then the new data is fetched from cloud, stored in cache and returned to caller.
Unsafe read type - If the data exists in the cache (expired/not expired), it's returned to the caller.
This segregation is done because there are calls in the cloud-provider which are safe with reading stalled cache data.

For instance, In a cluster with the configuration 1 VMSS, 100 nodes and 100 disks majority of calls from the cloud-provider are performed as part of the reconciliation loop that's run by the volume controller. This loop runs every 1min to ensure all the disks mounted are attached. This means every min, a call is made for each disk to ensure it's attached which in this scenario can account for ~6k-7k calls/h.

Every time the disk is attached/detached the cache entry for the VM instance is invalidated. The next time the reconcile loop is run, new data will be fetched and we can validate if the state looks fine after the initial Attach/Detach. This means it's safe to read stalled data from cache for the disk reconcile loop. With this change introduced in the PR, the total number of calls to ARM for an idle cluster with 100 disks reduces to 100 calls/h from the previous 7k calls/h.

Behavioral changes

  • Since the reconcile loop reads from stalled cache, it's recommended the user doesn't make any changes to the disk outside of the cluster primitives (using pvc).
  • All other disk operations attach/detach continue to use default read type.
  • GetInstanceTypeByNodeName, GetZoneByNodeName, GetZone now read stalled data from cache as these values don't change.

@tdihp
Copy link

tdihp commented Nov 14, 2019

Is action 2 (Switching to LIST instead of GET for VMSS, VMSSVM and VM) applicable for VMAS implementation of cluster-autoscaler?

@feiskyer
Copy link
Member Author

feiskyer commented Nov 15, 2019

@tdihp we're planning to do that, but it's not added yet in cluster-autoscaler.

@nilo19
Copy link
Contributor

nilo19 commented Nov 30, 2019

/cc

@jcrowthe
Copy link

jcrowthe commented Dec 2, 2019

Does kubernetes/autoscaler#2527 address the usage of LIST instead of GET in the cluster-autoscaler?

@feiskyer
Copy link
Member Author

/milestone v1.18

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.17, v1.18 Dec 17, 2019
@nilo19
Copy link
Contributor

nilo19 commented Dec 17, 2019

Does kubernetes/autoscaler#2527 address the usage of LIST instead of GET in the cluster-autoscaler?

yeah I think so.

@feiskyer
Copy link
Member Author

Status update: validation results on Kubernetes v1.18

Validation scenario: scaling one VMSS up and down between 1 node and 100 nodes.

Without the fix (v1.18.0-alpha.1):

  • HighCostGetVMScaleSet3Min: allowedRequestCount 180 and measuredRequestCount 1044
  • HighCostGetVMScaleSet30Min: allowedRequestCount 900 and measuredRequestCount 6404
  • it's hard to recover because of retries in kube-controller-manager and "measuredRequestCount" would increase quickly to tens of thousands (e.g. 9000)
  • a lot of scaled down nodes are pending on NotReady state because node lifecyle controller unable to determine their state

With the fix (latest master branch):

  • No more HighCostGetVMScaleSet3Min and HighCostGetVMScaleSet30Min errors since Retry-After headers are honored and it is usually within 2 minutes
  • Number of throttled requests decrease to less than 10 and most requests have succeeded
  • No more cluster hanging issues, the clients would retry after Retry-After expires

Here is chart of number of throttled and successful requests:

1 18-comparison

@feiskyer
Copy link
Member Author

feiskyer commented Mar 6, 2020

Since code freeze has been started in kubernetes, the unit tests part would be moved to v1.19 (tracking by #310).

@feiskyer
Copy link
Member Author

feiskyer commented Mar 6, 2020

most items have been done and the remaining tests (including e2e and unit tests) have been tracked by different issues.

/close

@k8s-ci-robot
Copy link
Contributor

@feiskyer: Closing this issue.

In response to this:

most items have been done and the remaining tests (including e2e and unit tests) have been tracked by different issues.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

enxebre added a commit to enxebre/installer that referenced this issue Mar 10, 2020
Client side rate limiting is being problematic for fresh installs and scaling operations [1]

Azure ARM throttling is applied at subscription level, so client side rate limiting helps to prevent cluster sharing the same subscription from disrupting each other.
However there's lower limits which apply at the SP/tenant and resource level e.g ARM limits the number of write calls per service principal to 1200/hour [2]. Since we ensure particular SPs per cluster via Cloud Credential Operator it should be relatively safe to disable the client rate limiting

Orthogonally to this some improvements on the rate limiting and back off mechanisms are being added to the cloud provider.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1782516.
[2] https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/request-limits-and-throttling
[3] kubernetes-sigs/cloud-provider-azure#247
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/installer that referenced this issue May 19, 2020
Client side rate limiting is being problematic for fresh installs and scaling operations [1]

Azure ARM throttling is applied at subscription level, so client side rate limiting helps to prevent cluster sharing the same subscription from disrupting each other.
However there's lower limits which apply at the SP/tenant and resource level e.g ARM limits the number of write calls per service principal to 1200/hour [2]. Since we ensure particular SPs per cluster via Cloud Credential Operator it should be relatively safe to disable the client rate limiting

Orthogonally to this some improvements on the rate limiting and back off mechanisms are being added to the cloud provider.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1782516.
[2] https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/request-limits-and-throttling
[3] kubernetes-sigs/cloud-provider-azure#247
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/installer that referenced this issue Jun 4, 2020
Client side rate limiting is being problematic for fresh installs and scaling operations [1]

Azure ARM throttling is applied at subscription level, so client side rate limiting helps to prevent cluster sharing the same subscription from disrupting each other.
However there's lower limits which apply at the SP/tenant and resource level e.g ARM limits the number of write calls per service principal to 1200/hour [2]. Since we ensure particular SPs per cluster via Cloud Credential Operator it should be relatively safe to disable the client rate limiting

Orthogonally to this some improvements on the rate limiting and back off mechanisms are being added to the cloud provider.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1782516.
[2] https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/request-limits-and-throttling
[3] kubernetes-sigs/cloud-provider-azure#247
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/azure Issues or PRs related to azure provider
Projects
None yet
Development

No branches or pull requests

9 participants