Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operation executor doesn't seem to be throttling or backing off on retries #71569

Closed
msau42 opened this issue Nov 29, 2018 · 5 comments · Fixed by #71581
Closed

Operation executor doesn't seem to be throttling or backing off on retries #71569

msau42 opened this issue Nov 29, 2018 · 5 comments · Fixed by #71581
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@msau42
Copy link
Member

msau42 commented Nov 29, 2018

What happened:
In this test, volume mount failed because the driver-registrar took a minute to come up. However, it seems like volume mount operation is retrying without any sort of backoff.

https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new-parallel/400
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new-parallel/400/artifacts/bootstrap-e2e-minion-group-lztn/kubelet.log

This message appears in kubelet.log 400+ times in less than a minute.

I1126 15:27:50.424926    1446 server.go:459] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-csi-volumes-92vsl", Name:"pod-subpath-test-csi-hostpath-dynamicpv-mgnk", UID:"cc5409be-f18f-11e8-9a1b-42010a800002", APIVersion:"v1", ResourceVersion:"47363", FieldPath:""}): type: 'Warning' reason: 'FailedMount' MountVolume.NewMounter initialization failed for volume "pvc-c9e516f1-f18f-11e8-9a1b-42010a800002" : driver name csi-hostpath-e2e-tests-csi-volumes-92vsl not found in the list of registered CSI drivers
...
I1126 15:28:32.294522    1446 server.go:459] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-tests-csi-volumes-92vsl", Name:"pod-subpath-test-csi-hostpath-dynamicpv-mgnk", UID:"cc5409be-f18f-11e8-9a1b-42010a800002", APIVersion:"v1", ResourceVersion:"47363", FieldPath:""}): type: 'Warning' reason: 'FailedMount' MountVolume.NewMounter initialization failed for volume "pvc-c9e516f1-f18f-11e8-9a1b-42010a800002" : driver name csi-hostpath-e2e-tests-csi-volumes-92vsl not found in the list of registered CSI drivers

What you expected to happen:
Operation executor should not be retrying operations so frequently.

@kubernetes/sig-storage-bugs

/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Nov 29, 2018
@saad-ali
Copy link
Member

I'll take a look at this.
/assign

@saad-ali
Copy link
Member

Problem is that reconciler depends on operation executor to do backoff. In the normal case operation executor returns a method that does the actual work. Any errors by that method result in exponential backoff.

But in this case, the error is coming from the code creating the mount method which is not handled by the back off logic, and is called again and again by the reconciler resulting in log and event spam.

This didn't use to be a problem because initialization errors were uncommon (plugins were built in), but with CSI the driver may not exist and these initialization errors are expected.

I'll look in to cleaning this up.

@AishSundar
Copy link
Contributor

Is this something that will affect users in 1.13 and needs CPing?

@saad-ali
Copy link
Member

This will result in kubelet log spam and possibly kubernetes event spam.

It happens when someone creates a pod referencing a CSI volume before the CSI driver for that volume is installed. Which happens frequently in E2E tests because they install/remove driver with every test. But should be rare in real world usage since most users are likely to install a driver first before using it.

@saad-ali
Copy link
Member

If release team is worried about this, we could create a quick PR to suppress logging/event generation of this these errors.

Otherwise we plan to address this more holistically for 1.14 and 1.13.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants