Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Injector failure mode prevents Pod deletion #500

Closed
cjyar opened this issue Jul 17, 2023 · 6 comments · Fixed by #501
Closed

Injector failure mode prevents Pod deletion #500

cjyar opened this issue Jul 17, 2023 · 6 comments · Fixed by #501
Labels
bug Something isn't working

Comments

@cjyar
Copy link
Contributor

cjyar commented Jul 17, 2023

Describe the bug
If a Pod has the agent-inject annotation yet gets created without the injected sidecars, then any future update to the Pod will trigger the injector to add the sidecars. If the Pod has already been created, these attempts to modify spec.containers or spec.initContainers will fail, thus causing the Pod update to fail. If the Pod has a finalizer, it will be impossible to remove the finalizer, and therefore it will be impossible to remove the Pod.

It's easy to enter this failure mode if there's a temporary connectivity error between the Kubernetes apiserver and the vault agent injector.

To Reproduce
Steps to reproduce the behavior:

  1. Deploy vault-agent-injector. Use the default settings which result in the MutatingWebhookConfiguration having spec.failurePolicy: Ignore and spec.timeoutSeconds: 30.
  2. Force a vault-agent-injector failure by running kubectl scale deploy --replicas=0.
  3. Create a Job with the agent-inject annotation. Configure the Job's entrypoint to wait a while, e.g. sleep 300.
  4. After the Pod is created, it should take 30 seconds before it starts running due to the MWC timeout.
  5. After the Pod is running, but before it exits, restart vault-agent-injector with kubectl scale deploy --replicas=2.
  6. After the Pod exits, it will fail to be deleted.
  7. The Kubernetes controller-manager (job-controller) will try to delete the batch.kubernetes.io/job-tracking finalizer, but it will fail with Pod "pod-name" is invalid: spec.initContainers: Forbidden: pod updates may not add or remove containers.
  8. Attempts to delete the Pod will succeed, but the Pod won't be deleted because the finalizer won't be removed.
  9. Attempts to manually delete the finalizer using kubectl edit pod will fail with the same error.

Expected behavior
The vault-agent-injector shouldn't block Pods from being finalized.

Environment

  • Kubernetes version:
    • Distribution or cloud vendor (OpenShift, EKS, GKE, AKS, etc.): GKE v1.25.10-gke.1200
    • Other configuration options or runtime services (istio, etc.): none
  • vault-k8s version: 1.2.1

Workaround
To delete a Pod that's stuck in this state, use kubectl edit pod to delete the vault.hashicorp.com/agent-inject annotation.

Fix
Even though this behavior is surprising, and the Kubernetes error message isn't super helpful, I think Kubernetes is actually doing the right thing. I think the injector could be modified to address this problem, though. If it's being asked to mutate a Pod, and the Pod's status.phase is a string other than Pending, then it should do nothing. In other words, if a Pod has already been created, the injector shouldn't try to add containers because the Pod's spec.initContainers and spec.containers are immutable.

@cjyar cjyar added the bug Something isn't working label Jul 17, 2023
@komapa
Copy link

komapa commented Aug 22, 2023

Hello, we were wondering over here why even watch for UPDATE events in the operator? What is it accounting for? Thanks!

@cjyar
Copy link
Contributor Author

cjyar commented Aug 23, 2023

@komapa You're probably right that only watching for CREATE events would be a cleaner fix for this problem. It would be nice to hear from the vault-k8s maintainers.

@alculquicondor
Copy link

alculquicondor commented Nov 3, 2023

I don't think the fix #501 is bullet proof.

Let's assume the pod from step 3 fails to schedule. The user realizes this and attempts to delete the Pod (or Job).

At this point, the Pod still has phase: Pending, but its containers also can't be mutated.

The more appropriate fix is to simply NOT add a container if the webhook is called for an UPDATE operation.

/reopen

@tomhjp tomhjp reopened this May 10, 2024
@tomhjp
Copy link
Contributor

tomhjp commented May 10, 2024

Sorry for letting this linger, and thanks for the reports. I'm going to work on getting hashicorp/vault-helm#783 merged which will fix this properly as per the above comments.

@tomhjp
Copy link
Contributor

tomhjp commented May 10, 2024

#619 is also related for anyone using the deployment yaml instead of the helm chart.

@tomhjp
Copy link
Contributor

tomhjp commented May 10, 2024

#783 is merged, hopefully this fix will stick with the next release of the helm chart.

@tomhjp tomhjp closed this as completed May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants