Operation executor doesn't seem to be throttling or backing off on retries #71569

msau42 · 2018-11-29T18:47:28Z

What happened:
In this test, volume mount failed because the driver-registrar took a minute to come up. However, it seems like volume mount operation is retrying without any sort of backoff.

https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new-parallel/400
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new-parallel/400/artifacts/bootstrap-e2e-minion-group-lztn/kubelet.log

This message appears in kubelet.log 400+ times in less than a minute.

I1126 15:27:50.424926    1446 server.go:459] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-csi-volumes-92vsl", Name:"pod-subpath-test-csi-hostpath-dynamicpv-mgnk", UID:"cc5409be-f18f-11e8-9a1b-42010a800002", APIVersion:"v1", ResourceVersion:"47363", FieldPath:""}): type: 'Warning' reason: 'FailedMount' MountVolume.NewMounter initialization failed for volume "pvc-c9e516f1-f18f-11e8-9a1b-42010a800002" : driver name csi-hostpath-e2e-tests-csi-volumes-92vsl not found in the list of registered CSI drivers
...
I1126 15:28:32.294522    1446 server.go:459] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-csi-volumes-92vsl", Name:"pod-subpath-test-csi-hostpath-dynamicpv-mgnk", UID:"cc5409be-f18f-11e8-9a1b-42010a800002", APIVersion:"v1", ResourceVersion:"47363", FieldPath:""}): type: 'Warning' reason: 'FailedMount' MountVolume.NewMounter initialization failed for volume "pvc-c9e516f1-f18f-11e8-9a1b-42010a800002" : driver name csi-hostpath-e2e-tests-csi-volumes-92vsl not found in the list of registered CSI drivers

What you expected to happen:
Operation executor should not be retrying operations so frequently.

@kubernetes/sig-storage-bugs

/kind bug

The text was updated successfully, but these errors were encountered:

saad-ali · 2018-11-29T18:56:55Z

I'll take a look at this.
/assign

saad-ali · 2018-11-29T19:51:47Z

Problem is that reconciler depends on operation executor to do backoff. In the normal case operation executor returns a method that does the actual work. Any errors by that method result in exponential backoff.

But in this case, the error is coming from the code creating the mount method which is not handled by the back off logic, and is called again and again by the reconciler resulting in log and event spam.

This didn't use to be a problem because initialization errors were uncommon (plugins were built in), but with CSI the driver may not exist and these initialization errors are expected.

I'll look in to cleaning this up.

AishSundar · 2018-11-29T20:26:26Z

Is this something that will affect users in 1.13 and needs CPing?

saad-ali · 2018-11-29T21:06:18Z

This will result in kubelet log spam and possibly kubernetes event spam.

It happens when someone creates a pod referencing a CSI volume before the CSI driver for that volume is installed. Which happens frequently in E2E tests because they install/remove driver with every test. But should be rare in real world usage since most users are likely to install a driver first before using it.

saad-ali · 2018-11-29T21:08:14Z

If release team is worried about this, we could create a quick PR to suppress logging/event generation of this these errors.

Otherwise we plan to address this more holistically for 1.14 and 1.13.1.

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Nov 29, 2018

msau42 mentioned this issue Nov 29, 2018

Flaking Test: subpath failures in new-master-upgrade-cluster-new-parallel, other jobs #71383

Closed

k8s-ci-robot assigned saad-ali Nov 29, 2018

saad-ali mentioned this issue Nov 30, 2018

Reduce CSI log and event spam #71581

Merged

k8s-ci-robot closed this as completed in #71581 Dec 1, 2018

saad-ali mentioned this issue Dec 18, 2018

Automated Cherry Pick of #71581 to release-1.13 #72161

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operation executor doesn't seem to be throttling or backing off on retries #71569

Operation executor doesn't seem to be throttling or backing off on retries #71569

msau42 commented Nov 29, 2018

saad-ali commented Nov 29, 2018

saad-ali commented Nov 29, 2018

AishSundar commented Nov 29, 2018

saad-ali commented Nov 29, 2018

saad-ali commented Nov 29, 2018

Operation executor doesn't seem to be throttling or backing off on retries #71569

Operation executor doesn't seem to be throttling or backing off on retries #71569

Comments

msau42 commented Nov 29, 2018

saad-ali commented Nov 29, 2018

saad-ali commented Nov 29, 2018

AishSundar commented Nov 29, 2018

saad-ali commented Nov 29, 2018

saad-ali commented Nov 29, 2018