Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid using tag filters for EC2 API where possible #76749

Merged
merged 1 commit into from May 7, 2019
Merged

Avoid using tag filters for EC2 API where possible #76749

merged 1 commit into from May 7, 2019

Conversation

mcrute
Copy link
Contributor

@mcrute mcrute commented Apr 18, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:
For very large clusters these tag filters are not efficient within the EC2 API and will result in rate limiting. Most of these queries have filters that are targeted narrowly enough that the elimination of the tags filter will not return significantly more data but will be executed more efficiently by the EC2 API and reduce the chance of throttling.

Additionally, some API wrappers did not support pagination despite the underlying API calls being paginated. This change adds pagination to prevent accidentally truncating the returned results.

This change, once approved, will also need back-ported to 1.13 and 1.14.

Which issue(s) this PR fixes:
Fixes #76747

Does this PR introduce a user-facing change?:

Limit use of tags when calling EC2 API to prevent API throttling for very large clusters

/sig aws
/sig cloud-provider
/priority critical-urgent
/assign @mcrute @justinsb

@k8s-ci-robot k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/aws priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 18, 2019
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cloudprovider labels Apr 18, 2019
baseFilters := []*ec2.Filter{newEc2Filter("instance-state-name", "running")}

filters := c.tagging.addFilters(baseFilters)
di, err := c.describeInstances(filters)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you pls explain more in detail why this is going to help?

I think aws limits rate by number of calls and looks like this is going to double the DescribeInstances call count when provisioning volume

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two rate limits at play here; the rate limit for DescribeInstances and an overload protection rate limit for some downstream services used by DescribeInstances. The previous tag query was particularly inefficient and resulted in the triggering of the downstream overload protection limit which can not be raised. It is significantly more efficient to make two DescribeInstances calls with a more simple tag filter than one with the combined filter. It should be possible to raise the DescribeInstances rate limit if a user still hits that limit with this update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mcrute

I double checked the code, since this function is only used in volume creation, and it is just looking for a set of zones that has running instances from the cluster, would using kube client (aws.Cloud already has this) to fetch such information be enough? This will not need to make cloud provider call at all. We can start with List from api server directly and look for optimization by listing from cache.

It is true that there could be a race between node die and and the node actually disappear from api server, but under such condition, there will be more high level reconciliation triggered so we are good.

wanna see what other people thinks about it

Copy link
Contributor

@zhan849 zhan849 Apr 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like getting zone from kubernetes work - we tried #76976 (same logic but slightly different function calls in the version we use (1.12.3)), and before we were able only to create 1 PVC every ~5sec, and now with the reduced calls, we are able to create 50 PVC all at a time and get all volumes bound within 30sec, all pods started running within 90sec.

// There are no "or" filters by key, so we look for both the legacy and new key, and then we have to post-filter
f := newEc2Filter("tag-key", TagNameKubernetesClusterLegacy, t.clusterTagKey())

f := newEc2Filter("tag-key", t.clusterTagKey())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why do we use "tag-key" as filter name here but "tag:{name}" filter name in addLegacyFilters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have this backwards... will swap the lines and update.

// 1.5 -> 1.6 clusters and exists for backwards compatibility
//
// This lets us run multiple k8s clusters in a single EC2 AZ
func (t *awsTagging) addLegacyFilters(filters []*ec2.Filter) []*ec2.Filter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you are removing quite a few legacy filters from various places including load balancer, routes, and subnets, is this going to change the behavior and cause backward compatibility issues?

Copy link
Contributor Author

@mcrute mcrute Apr 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing these filters may result in a few more items being returned from the API calls but those extra items will be filtered out in the existing filtering loops below each change in this PR. The trade-off is that these queries are much more efficient within the EC2 API and are less likely to trigger internal rate limits as was the case with DescribeInstances. Given we're already doing a second pass of filtering this code is entirely backwards compatible.

Also worth noting is that many of the existing filters are incredibly specific and should result in a very few items being returned so the additional tag filtering within the API was redundant.

Copy link
Contributor

@zhan849 zhan849 Apr 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, looks good with recent version, @justinsb could you pls check this as well to see if clusters provisioned by legacy kube up (not sure if we still care about it) and old versions of kops are fine.

Copy link
Member

@micahhausler micahhausler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Apr 23, 2019
@zhan849
Copy link
Contributor

zhan849 commented May 7, 2019

since there is no objections from community, shall we can proceed and merge it in? @mcrute @micahhausler

For very large clusters these tag filters are not efficient within the
EC2 API and will result in rate limiting. Most of these queries have
filters that are targeted narrowly enough that the elimination of the
tags filter will not return significantly more data but will be executed
more efficiently by the EC2 API.

Additionally, some API wrappers did not support pagination despite the
underlying API calls being paginated. This change adds pagination to
prevent truncating the returned results.
@micahhausler
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 7, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mcrute, micahhausler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit f5d958a into kubernetes:master May 7, 2019
@k8s-ci-robot
Copy link
Contributor

@mcrute: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce-100-performance c8edfa2 link /test pull-kubernetes-e2e-gce-100-performance

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@mcrute mcrute deleted the ec2-rate-limit-fix branch May 20, 2019 09:30
justinsb added a commit to justinsb/kubernetes that referenced this pull request May 31, 2019
Copy-pasted accidentally in kubernetes#76749
k8s-ci-robot added a commit that referenced this pull request Aug 31, 2019
…-upstream-release-1.14

Automated cherry pick of #76749: Avoid using tag filters for EC2 API where possible
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cloudprovider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tag expression results in EC2 API throttling
4 participants