New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operation executor doesn't seem to be throttling or backing off on retries #71569
Comments
I'll take a look at this. |
Problem is that reconciler depends on operation executor to do backoff. In the normal case operation executor returns a method that does the actual work. Any errors by that method result in exponential backoff. But in this case, the error is coming from the code creating the mount method which is not handled by the back off logic, and is called again and again by the reconciler resulting in log and event spam. This didn't use to be a problem because initialization errors were uncommon (plugins were built in), but with CSI the driver may not exist and these initialization errors are expected. I'll look in to cleaning this up. |
Is this something that will affect users in 1.13 and needs CPing? |
This will result in kubelet log spam and possibly kubernetes event spam. It happens when someone creates a pod referencing a CSI volume before the CSI driver for that volume is installed. Which happens frequently in E2E tests because they install/remove driver with every test. But should be rare in real world usage since most users are likely to install a driver first before using it. |
If release team is worried about this, we could create a quick PR to suppress logging/event generation of this these errors. Otherwise we plan to address this more holistically for 1.14 and 1.13.1. |
What happened:
In this test, volume mount failed because the driver-registrar took a minute to come up. However, it seems like volume mount operation is retrying without any sort of backoff.
https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new-parallel/400
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new-parallel/400/artifacts/bootstrap-e2e-minion-group-lztn/kubelet.log
This message appears in kubelet.log 400+ times in less than a minute.
What you expected to happen:
Operation executor should not be retrying operations so frequently.
@kubernetes/sig-storage-bugs
/kind bug
The text was updated successfully, but these errors were encountered: