Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When a lot of persistentvolumes are created together, POST persistentvolume request latency grows significantly #87808

Open
mborsz opened this issue Feb 4, 2020 · 6 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@mborsz
Copy link
Member

mborsz commented Feb 4, 2020

What happened:
I was running a scale test where I created ~8 PV/s for some period of time. I observed that POST persistentvolume latency started growing

I debugged this already:

What you expected to happen:

POST persistentvolume latency to be ~constant during the test.

This can be achieved by:

  1. removing cloud.GetAutoLabelsForZonalPD call from https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/gcepd/gce_util.go#L188
    • this way, there is no API call that can fail after object is created
    • since we just successfully created a PD, we know all parameters for that PD (zone, region, potentially in future some other disk's properties as well) and we can use this information for labels computation and there is no point in validating if it still exists.
  2. Implementing proper retry strategy for cloud.GetAutoLabelsForPD calls so that we do not move load to kube-apiserver when kube-controller-manager is overloaded.
  3. (partially) by simply passing zone argument in https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/gcepd/gce_util.go#L188 we can reduce number of calls there 3-4x times in case of zonal disks, but this is a mitigation only

For me, 1 is the most reasonable thing to do (and I have some draft PR for that), but I wanted to hear your opinion on this before I start cleaning the PR.

Potential drawbacks of this approach are:

How to reproduce it (as minimally and precisely as possible):
Just create a lot of pvs together.
Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:

/assign @saad-ali
/cc @wojtek-t

@mborsz mborsz added the kind/bug Categorizes issue or PR as related to a bug. label Feb 4, 2020
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Feb 4, 2020
@wojtek-t
Copy link
Member

wojtek-t commented Feb 4, 2020

/sig storage

@kubernetes/sig-storage-bugs @msau42

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 4, 2020
@mborsz
Copy link
Member Author

mborsz commented Feb 4, 2020

Some draft PR that implements 1: #87811

The overall idea is to modify CreateDisk methods to return Disk object on success (CreateDisk* methods have all necessary data to do that).
If we will need to add some other label in the future that can be deduced based on the input data this approach will still work just fine. Otherwise (e.g. if we want to add labels that cannot be easily deduced based on the input, e.g. expected IOPs rate or creationTimestamp), we will need to add GCE call back.

Please let me know what you think about this approach.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 4, 2020
@wojtek-t wojtek-t added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label May 4, 2020
@wojtek-t
Copy link
Member

wojtek-t commented May 4, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 4, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 2, 2020
@wojtek-t
Copy link
Member

wojtek-t commented Aug 3, 2020

/remove-lifecycle stale

This has been partially addressed by #87811 but we need to verify if we need more.
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests

5 participants