Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud provider AWS library should query instance by ID when possible #78140

Merged
merged 1 commit into from Jul 12, 2019

Conversation

@zhan849
Copy link
Contributor

commented May 20, 2019

What type of PR is this?
/kind feature
/sig aws
/sig cloud-provider
/sig scalability

What this PR does / why we need it:
This PR changed the implementation of getInstanceByNodeName() in aws library to query instance from provider using instance id when possible, since it is much more efficient to get instance by ID in aws than run a filter. This PR checks if it is able to get provider id from v1.Node, if possible, it will get instance by node id, or it will fall back to running the query

We tested internally wit batch PVC provisioning benchmark and here are the improvements shown that makes k8s more scalable in aws environment:

  • Reduced more than 90% DescribeInstance calls (1638 calls including lots of retries -> 116 calls all successful with batch size 50)
  • Latency of each DescribeInstance call also got reduced
  • Before this change, peak DescribeInstances QPS grows linearly and ~= batch size, with this change, peak DescribeInstances QPS stays stably at ~10 calls/sec

Which issue(s) this PR fixes:
Fixes #78136

This is also the first step towards fixing #78409 , by introduce informer-fed node cache to aws library.

Special notes for your reviewer:
Regarding the tradeoff of offloading the workload to kubernetes:

  1. Node querying has limited frequency, mostly during node controller reconciliation, and the only bursts are during batch infrastructure provisioning (i.e. create a batch of 100 PVC with ebs, aggressive autoscaling by adding 100 nodes to cluster, etc). Get() calls are also relatively cheap. Therefore, load added to api server is limited, and can be optimized further by introducing informer interface to aws cloud provider

  2. Usually in production, each aws account would have multiple Kubernetes clusters and other non-k8s clusters, but aws related api call throttles are account-wide, so that is to say, if one kubernetes cluster is making inefficient call to cloud provider, other k8s/non-k8s clusters will be affected as well. so it worth the tradeoff and affects of burst in k8s calls can be limited to k8s cluster level, with proper retry mechanism (which we already have) control plane can make progress eventually

Does this PR introduce a user-facing change?:

none

Release note:

Optimize EC2 DescribeInstances API calls in aws cloud provider library by querying instance ID instead of EC2 filters when possible

/assign @zhan849 @mcrute @micahhausler
/cc @justinsb @jsafrane

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented May 20, 2019

Hi @zhan849. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@zhan849 zhan849 force-pushed the zhan849:aws-get-instance-by-id branch from f506114 to 20c2c4f May 21, 2019

@k8s-ci-robot k8s-ci-robot added size/M and removed size/S labels May 21, 2019

@mcrute

This comment has been minimized.

Copy link
Member

commented May 21, 2019

/ok-to-test

@mcrute

This comment has been minimized.

Copy link
Member

commented May 21, 2019

@zhan849 could you please add some release notes as well... something along the lines of (feel free to re-word this):

reduce the number of EC2 DescribeInstances API calls by querying the instance ID from the API server
@mcrute

This comment has been minimized.

Copy link
Member

commented May 21, 2019

/priority important-soon

@k8s-ci-robot k8s-ci-robot added this to the v1.16 milestone Jun 28, 2019

@zhan849 zhan849 force-pushed the zhan849:aws-get-instance-by-id branch from 631e65b to 37b59d2 Jun 28, 2019

@k8s-ci-robot k8s-ci-robot added size/L and removed size/M labels Jun 28, 2019

@zhan849 zhan849 force-pushed the zhan849:aws-get-instance-by-id branch from 37b59d2 to 1c94382 Jun 28, 2019

@zhan849

This comment has been minimized.

Copy link
Contributor Author

commented Jun 28, 2019

/retest

@zhan849

This comment has been minimized.

Copy link
Contributor Author

commented Jun 28, 2019

added informer-fed node cache to this PR, added tests. all checks passed.
@jsafrane @justinsb @mcrute

@jsafrane

This comment has been minimized.

Copy link
Member

commented Jul 1, 2019

lgtm-ish, but I'd prefer second pair of yes on this. @justinsb @gnufied @liggitt

}

node, err := c.nodeInformer.Lister().Get(string(nodeName))
if err != nil {

This comment has been minimized.

Copy link
@gnufied

gnufied Jul 1, 2019

Member

Does this require a RBAC policy change so as AWS cloudprovider can list nodes?

This comment has been minimized.

Copy link
@zhan849

zhan849 Jul 1, 2019

Author Contributor

Good question! I checked that after v1.15, we have deprecated auto-provisioned cloud provider role, and have user to create them #66635

I will modify the release note about the face that we need "list" permission now.

This comment has been minimized.

Copy link
@zhan849

zhan849 Jul 1, 2019

Author Contributor

Also, @justinsb we might want to modify kops in this case as well? kubernetes/kops#6899

This comment has been minimized.

Copy link
@gnufied

gnufied Jul 1, 2019

Member

I do not remember if we need a RBAC rule if we are using a shared informer. If we do not - we are probably fine. We need to confirm that...

This comment has been minimized.

Copy link
@zhan849

zhan849 Jul 2, 2019

Author Contributor

actually @justinsb I did some tests and verified that we do not need additional RBAC rules for system:aws-cloud-provider to use shared informer. Our clusters are provisioned using kops and all roles contain default values

@zhan849 zhan849 force-pushed the zhan849:aws-get-instance-by-id branch from 1c94382 to 6ae7620 Jul 1, 2019

@gnufied

This comment has been minimized.

Copy link
Member

commented Jul 1, 2019

Two more questions:

fix kubelet log flushing issue in azure disk

Why is this line in PR description?

Reduced more than 90% DescribeInstance calls (1638 calls including lots of retries -> 116 calls all successful with batch size 50)

I understand fetching by instance-id is more efficient but how does this PR reduces number of DescribeInstance calls? Can you explain more?

@zhan849

This comment has been minimized.

Copy link
Contributor Author

commented Jul 1, 2019

Two more questions:

fix kubelet log flushing issue in azure disk

Why is this line in PR description?

Reduced more than 90% DescribeInstance calls (1638 calls including lots of retries -> 116 calls all successful with batch size 50)

I understand fetching by instance-id is more efficient but how does this PR reduces number of DescribeInstance calls? Can you explain more?

thanks @gnufied
for question 1 - fixed, might have been some mis-copied stuff.
for question 2 - it's mostly about retries, in practice, making describing call with filters instead of ids at a high frequency are more likely to trigger throttling or 500+ return code, and result in long tail of backoff and retries (lots of them in our practice), but actual amount of gain would depend on other activity in the same aws account

@gnufied

This comment has been minimized.

Copy link
Member

commented Jul 1, 2019

making describing call with filters instead of ids at a high frequency are more likely to trigger throttling or 500+ return code, and result in long tail of backoff and retries (lots of them in our practice).

Is there any AWS documentation for this?

@zhan849

This comment has been minimized.

Copy link
Contributor Author

commented Jul 1, 2019

making describing call with filters instead of ids at a high frequency are more likely to trigger throttling or 500+ return code, and result in long tail of backoff and retries (lots of them in our practice).

Is there any AWS documentation for this?

@gnufied I cannot find any - higher failure rate for filter is just our observation and is likely to be account specific

@gnufied

This comment has been minimized.

Copy link
Member

commented Jul 1, 2019

I cannot find any - higher failure rate for filter is just our observation and is likely to be account specific

Okay, I think the change is reasonable. I would like us to confirm RBAC rule change, whether we need it or not.

@zhan849

This comment has been minimized.

Copy link
Contributor Author

commented Jul 2, 2019

I cannot find any - higher failure rate for filter is just our observation and is likely to be account specific

Okay, I think the change is reasonable. I would like us to confirm RBAC rule change, whether we need it or not.

@gnufied I did some test and verified that we do not need additional RBAC for aws:cloud-provider to shared informer

@gnufied

This comment has been minimized.

Copy link
Member

commented Jul 2, 2019

/lgtm

@gnufied

This comment has been minimized.

Copy link
Member

commented Jul 12, 2019

I think we had enough opportunity for other approves to take a look at it.

/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jul 12, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gnufied, zhan849

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@zhan849

This comment has been minimized.

Copy link
Contributor Author

commented Jul 12, 2019

/retest

1 similar comment
@zhan849

This comment has been minimized.

Copy link
Contributor Author

commented Jul 12, 2019

/retest

@k8s-ci-robot k8s-ci-robot merged commit 416a717 into kubernetes:master Jul 12, 2019

23 checks passed

cla/linuxfoundation zhan849 authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-iscsi Skipped.
pull-kubernetes-e2e-gce-iscsi-serial Skipped.
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.