Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mounting (only 'default-token') volume takes a long time when creating a batch of pods #28616

Closed
coufon opened this issue Jul 7, 2016 · 18 comments
Assignees
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Milestone

Comments

@coufon
Copy link
Contributor

coufon commented Jul 7, 2016

We break down the e2e latency of creating a batch of pods and find that mounting volumes is a serialization point that slows down the process. The pods in the test do not have volumes in specification, but they have the 'default token' volumes.

When creating 30 nginx pods on a desktop, with mounting volume it takes around 25s, and it takes 12s without mounting volumes.

With mounting pods, the cumulative histogram of #pod is as following. The curves show the total number of pods arrived some point (firstSeen: pod addition detected, volume: just before 'WaitForAttachAndMount' in syncPod, container: just before 'containerRuntime.SyncPod', running: pod status is running). It shows that mounting volume starts very soon but the pods arrive at 'container' slowly one by one.

<img src="https://cloud.githubusercontent.com/assets/11655397/16663304/7d0eba58-4430-11e6-9298-625ca13575b2.png" width="70%", height="70%">

If we skip the whole 'WaitForAttachAndMount' function call, the cumulative histogram becomes:

<img src="https://cloud.githubusercontent.com/assets/11655397/16663475/0b48fafe-4431-11e6-9093-e0ba3f1303c5.png" width="70%", height="70%">

@coufon
Copy link
Contributor Author

coufon commented Jul 7, 2016

@yujuhong @dchen1107

@yujuhong yujuhong added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Jul 7, 2016
@yujuhong
Copy link
Contributor

yujuhong commented Jul 7, 2016

/cc @kubernetes/sig-storage @kubernetes/sig-node

@saad-ali
Copy link
Member

saad-ali commented Jul 7, 2016

@coufon Volumes are mounted asynchronously by the volume manager. The loops in volume manager operate at 10Hz or slower. Based on the existing periods for these loops it could take as much as 500ms or more for a pod's volume to get mounted even if the mount operation itself is much faster than that.

For the sake of testing, could you try your test with the following changes:
In volume_manager change the following values:

reconcilerLoopSleepPeriod time.Duration = 10 * time.Millisecond
desiredStateOfWorldPopulatorLoopSleepPeriod time.Duration = 10 * time.Millisecond
podAttachAndMountRetryInterval time.Duration = 30 * time.Millisecond

@matchstick
Copy link
Contributor

@coufon did you run your test on 1.2? Or 1.3? if not on 1.3 can you try there too? We recently changed a lot of code in that path. Hopefully you can see an improvement.

@yujuhong yujuhong added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Jul 7, 2016
@yujuhong
Copy link
Contributor

yujuhong commented Jul 7, 2016

To add some context, we don't have an SLO for pod startup during batch creation yet. @coufon is helping us with the node performance benchmark, gathering and analyzing the data, so that we can have a more complete picture of the node performance (e.g., pod startup/deletion throughput and latency).

@coufon Volumes are mounted asynchronously by the volume manager. The loops in volume manager operate at 10Hz or slower. Based on the existing periods for these loops it could take as much as 500ms or more for a pod's volume to get mounted even if the mount operation itself is much faster than that.

The reconciler loops with a 500ms. In each iteration, does it mount volumes for pods sequentially or does it parallelize the work? IIUC, in v1.2, mounts are handled by each pod worker in parallel. Would this have affected the latency?

@saad-ali
Copy link
Member

saad-ali commented Jul 8, 2016

In each iteration, does it mount volumes for pods sequentially or does it parallelize the work?

Volume manager parallelizes work as long as the underlying volume is not the same. For secret volumes that means as long as the SecretName is different. In this case, if all the pods being batch created are identical (and therefore referencing the same volume), then they would be handled serially.

@coufon
Copy link
Contributor Author

coufon commented Jul 8, 2016

@saad-ali I redo the test with the new parameters. The result is the same.

@matchstick My local kubernetes code are pulled on 15th June. So the code do not contain the updates of the previous month. I will do the tests for both 1.2 and 1.3 later.

@saad-ali
Copy link
Member

saad-ali commented Jul 8, 2016

@coufon Ya, then it's because the volume mounts are happening serially, as mentioned above. We can probably optimize this by enabling parallelization despite the same underlying volume for volume plugins where multiple pending operations don't matter (like secrets).

@jingxu97
Copy link
Contributor

jingxu97 commented Jul 8, 2016

If performance is what we might want to address in v1.4, I can help work on
it since we do see opportunities to optimize it.

On Thu, Jul 7, 2016 at 6:37 PM, Saad Ali notifications@github.com wrote:

@coufon https://github.com/coufon Ya, then it's because the volume
mounts are happening serially, as mentioned above. We can probably optimize
this by enabling parallelization despite the same underlying volume for
volume plugins where multiple pending operations don't matter (like
secrets).


You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
#28616 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ASSNxe8dhfOFoRQiNRqvximgkF8mj7Hhks5qTanVgaJpZM4JHV1f
.

  • Jing

@yujuhong yujuhong added this to the next-candidate milestone Jul 8, 2016
@yujuhong
Copy link
Contributor

yujuhong commented Jul 8, 2016

Marking this next-candidate tentatively. If this is a low-hanging fruit, we can consider doing it.

@yujuhong yujuhong removed this from the next-candidate milestone Jul 8, 2016
@yujuhong
Copy link
Contributor

yujuhong commented Jul 8, 2016

If we skip the whole 'WaitForAttachAndMount' function call, the cumulative histogram becomes:

@coufon, let's run the test against kubernetes v1.2 to see if this is truly a regression first.

By the way, the secret volume plugin has to get the secret from the apiserver and then write it on the disk. We have an apiserver QPS limit, so even if we parallelize this task, we will still be restricted by the QPS limit (although maybe to a lesser extent).

I guess the question is why we are fetching the same secret for many pods repeatedly and whether it's safe to cache secrets.
(found a related issue: #19188).

@saad-ali
Copy link
Member

saad-ali commented Jul 8, 2016

@coufon, let's run the test against kubernetes v1.2 to see if this is truly a regression first.

It will be to some extent since 1.2 has no protection for concurrent attach/mount operations on the same device.

By the way, the secret volume plugin has to get the secret from the apiserver and then write it on the disk. We have an apiserver QPS limit, so even if we parallelize this task, we will still be restricted by the QPS limit (although maybe to a lesser extent).

True, depending on what that the API QPS limit is, parallelization may not help much. But it's fairly trivial to do and safe for not attachable volumes so worth doing. I plan to send out PR making secret/config map/etc volume mounting parallelized.

@smarterclayton
Copy link
Contributor

This does sound like what we're seeing - large numbers of pods scheduled on the same machine result in some of them hitting higher level timeouts.

@saad-ali saad-ali added this to the v1.3 milestone Jul 12, 2016
@saad-ali saad-ali added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 12, 2016
@matchstick
Copy link
Contributor

@saad-ali officially assigning this to saad, but hoping that @smarterclayton will be involved in the process. I agree this should be in 1.3 as it can be viewed as a changing behaviour.

@smarterclayton
Copy link
Contributor

@kubernetes/rh-storage as discussed we need to help Saad with this

@mboersma
Copy link
Contributor

mboersma commented Jul 13, 2016

I see something very similar (deis/workflow#372) when four different pods attempt to mount the same secret volume at roughly the same time, except the timeouts are pathological and the pods will be stuck in the ContainerCreating status indefinitely. This isn't an issue in k8s 1.2.5.

@saad-ali
Copy link
Member

@mboersma You're likely hitting #28750 The fix for this issue should help a lot.

mboersma added a commit to mboersma/workflow that referenced this issue Jul 18, 2016
mboersma added a commit to mboersma/workflow that referenced this issue Jul 19, 2016
mboersma added a commit to mboersma/workflow that referenced this issue Jul 19, 2016
mboersma added a commit to mboersma/workflow that referenced this issue Jul 19, 2016
k8s-github-robot pushed a commit that referenced this issue Jul 20, 2016
Automatic merge from submit-queue

Allow mounts to run in parallel for non-attachable volumes

This PR:
* Fixes #28616
  * Enables mount volume operations to run in parallel for non-attachable volume plugins.
  * Enables unmount volume operations to run in parallel for all volume plugins.
* Renames `GoRoutineMap` to `GoroutineMap`, resolving a long outstanding request from @thockin: `"Goroutine" is a noun`
@pmorie
Copy link
Member

pmorie commented Jul 27, 2016

Fix: #29673

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests

9 participants