Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decrease dynamic provisioning time by 1.5 seconds #2021

Merged

Conversation

AndrewSirenko
Copy link
Contributor

@AndrewSirenko AndrewSirenko commented Apr 23, 2024

Is this a bug fix or adding new feature?
Improvement

What is this PR about? / Why do we need it?
Cut median dynamic provisioning latency in half (~3.7 to 2 seconds) by polling EBS volume creation more agressively.

Today, when dynamically provisioning a volume via our CreateVolume RPC, we wait 3s after an EC2 CreateVolume call to start polling DescribeVolumes. Based on testing in us-west-2, ap-northeast-1, and us-east-1, the p10 & median time between calling EC2 CreateVolume and the volume being created is ~1.5s.

Polling every 3 seconds means that a typical CreateVolume RPC will take ~3.5 seconds (and a worse case of 6+ seconds once every hundred volumes). In this PR, we use a more aggressive initial delay and polling interval. Finally, we switch to an exponential backoff in order to decrease the likelihood of being rate-limited for DescribeVolumes if volume creation time slows down.

What testing is done?

Measured seconds between external-provisioner CreateVolume RPC first start and first success for 100 PVCs launched across 100 pods on a 30 node cluster. Repeated 3 times for each combination of parameters.

Final column is what we went for in this PR.

3s Initial Sleep; 3s poll duration (Today's Performance) 1.5s Initial Sleep; 1s poll duration 1s Initial Sleep; Exponential backoff with .75s initial 1s Initial Sleep; Exponential backoff with .5s initial 1.5s Initial Sleep; Exponential backoff with .5s initial 1.25s Initial Sleep; Exponential backoff with .5s initial
p10 3.52 1.90 1.76 1.73 1.93 1.76
p50 3.76 2.19 2.52 2.57 2.19 2.02
p90 3.85 2.81 2.99 2.86 2.60 2.85
p95 3.86 3.25 3.11 3.63 3.12 3.09
p99 6.66 5.31 4.37 3.94 4.40 4.01
Avg amount of DV calls 27 41 30 34 33 32

NOTE: Tested with 500ms maxDelay (instead of 1s) for describeVolume batcher, because we will decrease that value in a future parameter tuning PR. That maxDelay value mostly affected p90 p95 (I presume those affected were the first DescribeVolume calls in every batch)

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 23, 2024
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 23, 2024
Copy link

Code Coverage Diff

This PR does not change the code coverage

@torredil
Copy link
Member

/retest

Copy link
Member

@torredil torredil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! thank you for digging into this and the efficiency improvement, this PR is very significant 💯

pkg/cloud/cloud.go Outdated Show resolved Hide resolved
pkg/cloud/cloud.go Outdated Show resolved Hide resolved
@ConnorJC3
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 25, 2024
Copy link
Member

@torredil torredil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: torredil

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 25, 2024
@AndrewSirenko
Copy link
Contributor Author

/retest

@k8s-ci-robot k8s-ci-robot merged commit 48b2755 into kubernetes-sigs:master Apr 25, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants