Mounting (only 'default-token') volume takes a long time when creating a batch of pods #28616

coufon · 2016-07-07T17:58:50Z

We break down the e2e latency of creating a batch of pods and find that mounting volumes is a serialization point that slows down the process. The pods in the test do not have volumes in specification, but they have the 'default token' volumes.

When creating 30 nginx pods on a desktop, with mounting volume it takes around 25s, and it takes 12s without mounting volumes.

With mounting pods, the cumulative histogram of #pod is as following. The curves show the total number of pods arrived some point (firstSeen: pod addition detected, volume: just before 'WaitForAttachAndMount' in syncPod, container: just before 'containerRuntime.SyncPod', running: pod status is running). It shows that mounting volume starts very soon but the pods arrive at 'container' slowly one by one.

If we skip the whole 'WaitForAttachAndMount' function call, the cumulative histogram becomes:

coufon · 2016-07-07T17:59:27Z

@yujuhong @dchen1107

yujuhong · 2016-07-07T18:12:57Z

/cc @kubernetes/sig-storage @kubernetes/sig-node

saad-ali · 2016-07-07T23:00:19Z

@coufon Volumes are mounted asynchronously by the volume manager. The loops in volume manager operate at 10Hz or slower. Based on the existing periods for these loops it could take as much as 500ms or more for a pod's volume to get mounted even if the mount operation itself is much faster than that.

For the sake of testing, could you try your test with the following changes:
In volume_manager change the following values:

reconcilerLoopSleepPeriod time.Duration = 10 * time.Millisecond
desiredStateOfWorldPopulatorLoopSleepPeriod time.Duration = 10 * time.Millisecond
podAttachAndMountRetryInterval time.Duration = 30 * time.Millisecond

matchstick · 2016-07-07T23:10:00Z

@coufon did you run your test on 1.2? Or 1.3? if not on 1.3 can you try there too? We recently changed a lot of code in that path. Hopefully you can see an improvement.

yujuhong · 2016-07-07T23:41:27Z

To add some context, we don't have an SLO for pod startup during batch creation yet. @coufon is helping us with the node performance benchmark, gathering and analyzing the data, so that we can have a more complete picture of the node performance (e.g., pod startup/deletion throughput and latency).

@coufon Volumes are mounted asynchronously by the volume manager. The loops in volume manager operate at 10Hz or slower. Based on the existing periods for these loops it could take as much as 500ms or more for a pod's volume to get mounted even if the mount operation itself is much faster than that.

The reconciler loops with a 500ms. In each iteration, does it mount volumes for pods sequentially or does it parallelize the work? IIUC, in v1.2, mounts are handled by each pod worker in parallel. Would this have affected the latency?

saad-ali · 2016-07-08T00:59:01Z

In each iteration, does it mount volumes for pods sequentially or does it parallelize the work?

Volume manager parallelizes work as long as the underlying volume is not the same. For secret volumes that means as long as the SecretName is different. In this case, if all the pods being batch created are identical (and therefore referencing the same volume), then they would be handled serially.

coufon · 2016-07-08T01:25:24Z

@saad-ali I redo the test with the new parameters. The result is the same.

@matchstick My local kubernetes code are pulled on 15th June. So the code do not contain the updates of the previous month. I will do the tests for both 1.2 and 1.3 later.

saad-ali · 2016-07-08T01:36:39Z

@coufon Ya, then it's because the volume mounts are happening serially, as mentioned above. We can probably optimize this by enabling parallelization despite the same underlying volume for volume plugins where multiple pending operations don't matter (like secrets).

jingxu97 · 2016-07-08T05:25:32Z

If performance is what we might want to address in v1.4, I can help work on
it since we do see opportunities to optimize it.

On Thu, Jul 7, 2016 at 6:37 PM, Saad Ali notifications@github.com wrote:

@coufon https://github.com/coufon Ya, then it's because the volume
mounts are happening serially, as mentioned above. We can probably optimize
this by enabling parallelization despite the same underlying volume for
volume plugins where multiple pending operations don't matter (like
secrets).

—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
#28616 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ASSNxe8dhfOFoRQiNRqvximgkF8mj7Hhks5qTanVgaJpZM4JHV1f
.

Jing

yujuhong · 2016-07-08T18:30:52Z

Marking this next-candidate tentatively. If this is a low-hanging fruit, we can consider doing it.

yujuhong · 2016-07-08T23:05:48Z

If we skip the whole 'WaitForAttachAndMount' function call, the cumulative histogram becomes:

@coufon, let's run the test against kubernetes v1.2 to see if this is truly a regression first.

By the way, the secret volume plugin has to get the secret from the apiserver and then write it on the disk. We have an apiserver QPS limit, so even if we parallelize this task, we will still be restricted by the QPS limit (although maybe to a lesser extent).

I guess the question is why we are fetching the same secret for many pods repeatedly and whether it's safe to cache secrets.
(found a related issue: #19188).

saad-ali · 2016-07-08T23:53:37Z

@coufon, let's run the test against kubernetes v1.2 to see if this is truly a regression first.

It will be to some extent since 1.2 has no protection for concurrent attach/mount operations on the same device.

By the way, the secret volume plugin has to get the secret from the apiserver and then write it on the disk. We have an apiserver QPS limit, so even if we parallelize this task, we will still be restricted by the QPS limit (although maybe to a lesser extent).

True, depending on what that the API QPS limit is, parallelization may not help much. But it's fairly trivial to do and safe for not attachable volumes so worth doing. I plan to send out PR making secret/config map/etc volume mounting parallelized.

smarterclayton · 2016-07-11T20:11:22Z

This does sound like what we're seeing - large numbers of pods scheduled on the same machine result in some of them hitting higher level timeouts.

matchstick · 2016-07-12T04:13:05Z

@saad-ali officially assigning this to saad, but hoping that @smarterclayton will be involved in the process. I agree this should be in 1.3 as it can be viewed as a changing behaviour.

smarterclayton · 2016-07-13T17:13:02Z

@kubernetes/rh-storage as discussed we need to help Saad with this

mboersma · 2016-07-13T21:54:06Z

I see something very similar (deis/workflow#372) when four different pods attempt to mount the same secret volume at roughly the same time, except the timeouts are pathological and the pods will be stuck in the ContainerCreating status indefinitely. This isn't an issue in k8s 1.2.5.

saad-ali · 2016-07-13T22:03:27Z

@mboersma You're likely hitting #28750 The fix for this issue should help a lot.

Until kubernetes/kubernetes#28616 and kubernetes/kubernetes#28750 are fixed, Deis Workflow cannot start on Kubernetes 1.3.x.

@thockin

Automatic merge from submit-queue Allow mounts to run in parallel for non-attachable volumes This PR: * Fixes #28616 * Enables mount volume operations to run in parallel for non-attachable volume plugins. * Enables unmount volume operations to run in parallel for all volume plugins. * Renames `GoRoutineMap` to `GoroutineMap`, resolving a long outstanding request from @thockin: `"Goroutine" is a noun`

pmorie · 2016-07-27T16:16:52Z

Fix: #29673

yujuhong added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Jul 7, 2016

yujuhong added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Jul 7, 2016

yujuhong added the kind/enhancement label Jul 8, 2016

yujuhong added this to the next-candidate milestone Jul 8, 2016

yujuhong removed this from the next-candidate milestone Jul 8, 2016

yujuhong mentioned this issue Jul 11, 2016

Error while tearing down pod, "device or resource busy" on service account secret #28750

Closed

yujuhong added the team/cluster label Jul 11, 2016

smarterclayton mentioned this issue Jul 12, 2016

1.3 upstream picks openshift/origin#9790

Closed

saad-ali added this to the v1.3 milestone Jul 12, 2016

saad-ali added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 12, 2016

matchstick assigned saad-ali Jul 12, 2016

childsb assigned rootfs Jul 13, 2016

mboersma mentioned this issue Jul 13, 2016

Pods fail to mount secret on k8s 1.3.0 + GKE deis/workflow#372

Closed

saad-ali mentioned this issue Jul 14, 2016

Allow mounts to run in parallel for non-attachable volumes #28939

Merged

felixbuenemann mentioned this issue Jul 15, 2016

quay.io/coreos/hyperkube image for v1.2.5 coreos/coreos-kubernetes#570

Closed

mboersma added a commit to mboersma/workflow that referenced this issue Jul 18, 2016

docs(install-workflow): note that k8s 1.2.x is required

431bfc3

Until kubernetes/kubernetes#28616 and kubernetes/kubernetes#28750 are fixed, Deis Workflow cannot start on Kubernetes 1.3.x.

mboersma mentioned this issue Jul 18, 2016

docs(install-workflow): note that k8s 1.2.x is required deis/workflow#377

Merged

Quentin-M mentioned this issue Jul 19, 2016

Kubernetes keeps failing at mounting a volume #29166

Closed

mboersma added a commit to mboersma/workflow that referenced this issue Jul 19, 2016

docs(install-workflow): note that k8s 1.2.x is required

de4c765

Until kubernetes/kubernetes#28616 and kubernetes/kubernetes#28750 are fixed, Deis Workflow cannot start on Kubernetes 1.3.x.

mboersma added a commit to mboersma/workflow that referenced this issue Jul 19, 2016

docs(install-workflow): note that k8s 1.2.x is required

01fa249

Until kubernetes/kubernetes#28616 and kubernetes/kubernetes#28750 are fixed, Deis Workflow cannot start on Kubernetes 1.3.x.

mboersma added a commit to mboersma/workflow that referenced this issue Jul 19, 2016

docs(install-workflow): note that k8s 1.2.x is required

dfb6c8e

Until kubernetes/kubernetes#28616 and kubernetes/kubernetes#28750 are fixed, Deis Workflow cannot start on Kubernetes 1.3.x.

k8s-github-robot closed this as completed in #28939 Jul 20, 2016

pmorie mentioned this issue Jul 25, 2016

Secret and ConfigMap volume do not report unique volume name correctly #29555

Closed

coufon mentioned this issue Sep 16, 2016

Test volume manager in node-e2e-test #32838

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mounting (only 'default-token') volume takes a long time when creating a batch of pods #28616

Mounting (only 'default-token') volume takes a long time when creating a batch of pods #28616

coufon commented Jul 7, 2016 •

edited

Loading

coufon commented Jul 7, 2016

yujuhong commented Jul 7, 2016

saad-ali commented Jul 7, 2016

matchstick commented Jul 7, 2016

yujuhong commented Jul 7, 2016

saad-ali commented Jul 8, 2016

coufon commented Jul 8, 2016

saad-ali commented Jul 8, 2016

jingxu97 commented Jul 8, 2016

yujuhong commented Jul 8, 2016

yujuhong commented Jul 8, 2016

saad-ali commented Jul 8, 2016

smarterclayton commented Jul 11, 2016

matchstick commented Jul 12, 2016

smarterclayton commented Jul 13, 2016

mboersma commented Jul 13, 2016 •

edited

Loading

saad-ali commented Jul 13, 2016

pmorie commented Jul 27, 2016

Mounting (only 'default-token') volume takes a long time when creating a batch of pods #28616

Mounting (only 'default-token') volume takes a long time when creating a batch of pods #28616

Comments

coufon commented Jul 7, 2016 • edited Loading

coufon commented Jul 7, 2016

yujuhong commented Jul 7, 2016

saad-ali commented Jul 7, 2016

matchstick commented Jul 7, 2016

yujuhong commented Jul 7, 2016

saad-ali commented Jul 8, 2016

coufon commented Jul 8, 2016

saad-ali commented Jul 8, 2016

jingxu97 commented Jul 8, 2016

yujuhong commented Jul 8, 2016

yujuhong commented Jul 8, 2016

saad-ali commented Jul 8, 2016

smarterclayton commented Jul 11, 2016

matchstick commented Jul 12, 2016

smarterclayton commented Jul 13, 2016

mboersma commented Jul 13, 2016 • edited Loading

saad-ali commented Jul 13, 2016

pmorie commented Jul 27, 2016

coufon commented Jul 7, 2016 •

edited

Loading

mboersma commented Jul 13, 2016 •

edited

Loading