Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm operator stucks if install/upgrade were aborted due to unexpected exit of the operators #94

Open
SimonBaeumer opened this issue Jun 10, 2021 · 4 comments · May be fixed by #116
Open
Assignees
Milestone

Comments

@SimonBaeumer
Copy link
Member

If I kill the operator while an helm install is executed it is not able to recover as it always receives an error due to the existing lock.

@joelanford
Copy link
Member

Hmm. I wasn't aware that helm created a lock during installs (and I assume other operations that affect release data?).

Could you share the error you were getting and/or any other breadcrumbs? Assuming the lock is some sort of kube object helm creates, I wonder if we could inject a label or other identifier that could help us identify it later as a lock we created (vs the helm CLI or another client) and try to recover.

@SimonBaeumer
Copy link
Member Author

SimonBaeumer commented Jun 16, 2021

Hey @joelanford ! Thanks for your answer! :)
Helm does update its release secret status to pending-upgrade:

❯ helm get all <release-name> | head
NAME: stackrox-secured-cluster-services
LAST DEPLOYED: Tue Jun 15 18:47:11 2021
NAMESPACE: stackrox
STATUS: pending-upgrade
REVISION: 8
TEST SUITE: None

I even found several issue in helm like this but until now it looks like the only solutions are:

  • deleting the secret
  • updating the status of the release manually
  • do a rollback

I'll try to fix this today and add you to the PR - would be really useful to have feedback from you on this.

@joelanford
Copy link
Member

Ah! I'm pretty sure (though not positive) that we inject an owner reference into the release secret. If so, that would help us identify release secrets we create. Whatever solution we choose, I think it should include a check that we will only try to automatically resolve it if we see that we are the only interested party to the release.

That would avoid a situation where the operator suddenly inherits and potentially stomps on a release when a CR is created for an existing release.

@SimonBaeumer
Copy link
Member Author

Action items / open questions:

  • Check for the owner reference on Helm secrets and config maps to only resolve pending states on operator owned resources
  • Should not operator owned resources recover from pending states?
  • Adding tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants