Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System application cilium stops upgrading because another operation appears to be in progress #12846

Closed
Tracked by #12095
embik opened this issue Nov 16, 2023 · 6 comments · Fixed by #13301
Closed
Tracked by #12095
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/app-management Denotes a PR or issue as being assigned to SIG App Management. sig/networking Denotes a PR or issue as being assigned to SIG Networking.
Milestone

Comments

@embik
Copy link
Member

embik commented Nov 16, 2023

What happened?

On our captain environment, a majority of user clusters that use the cilium system application as CNI are failing to upgrade the Cilium Helm chart to deploy any updates (e.g. when trying to update from 1.13.3 to 1.13.8). The user-cluster-controller-manager logs:

{"level":"info","time":"2023-11-16T09:50:02.361Z","logger":"kkp-app-installation-controller","caller":"action/upgrade.go:144","msg":"preparing upgrade for kube-system-cilium","applicationinstallation":"kube-system/cilium"}
{"level":"error","time":"2023-11-16T09:50:02.651Z","logger":"kkp-app-installation-controller","caller":"application-installation-controller/controller.go:139","msg":"ReconcilingError","applicationinstallation":"kube-system/cilium","error":"handling installation of application installation: another operation (install/upgrade/rollback) is in progress"}
{"level":"error","time":"2023-11-16T09:50:02.651Z","caller":"controller/controller.go:274","msg":"Reconciler error","controller":"kkp-app-installation-controller","object":{"name":"cilium","namespace":"kube-system"},"namespace":"kube-system","name":"cilium","reconcileID":"4025b08d-fe2f-461f-aa7f-0aa8c6d6d4a4","error":"handling installation of application installation: another operation (install/upgrade/rollback) is in progress"}

Helm shows that the release is in "pending-upgrade" state and has a very high release:

NAME              	NAMESPACE  	REVISION	UPDATED                                	STATUS         	CHART               	APP VERSION
kube-system-cilium	kube-system	23523   	2023-09-19 21:08:45.002273733 +0000 UTC	pending-upgrade	cilium-1.13.3       	1.13.3

I'm not sure what creates this condition (it seems to happen to all clusters sooner or later), but my suspicion is that when the user-cluster-controller-manager gets terminated, it can make the helm release stuck in this state and won't try to pick it up again after restart. Since we are constantly running release upgrades (see #12095), the chance of hitting one of those upgrades is quite high.

Expected behavior

KKP does not stop reconciling the CNI system application.

How to reproduce the issue?

Unclear, probably install latest KKP, create a user cluster with Cilium as CNI and run it through a couple of Kubernetes upgrades for both the underlying seed cluster and the user cluster.

How is your environment configured?

  • KKP version: v2.23.8
  • Shared or separate master/seed clusters?: shared

Provide your KKP manifest here (if applicable)

# paste manifest here

What cloud provider are you running on?

N/A

What operating system are you running in your user cluster?

N/A

Additional information

Workaround

To unblock, it is possible to delete the Helm release secrets from the kube-system namespace. The user-cluster-controller-manager will then re-deploy after some time.

To find the release secrets, run and then delete them:

$ kubectl get secrets | grep sh.helm.release.v1.kube-system-cilium
@embik embik added kind/bug Categorizes issue or PR as related to a bug. sig/app-management Denotes a PR or issue as being assigned to SIG App Management. sig/networking Denotes a PR or issue as being assigned to SIG Networking. labels Nov 16, 2023
@embik embik added this to the KKP 2.25 milestone Nov 16, 2023
@xrstf
Copy link
Contributor

xrstf commented Nov 29, 2023

Instead of deleting Secrets manually, I found that simply helm rollback would also help.

@kubermatic-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
After a furter 30 days, they will turn rotten.
Mark the issue as fresh with /remove-lifecycle stale.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@kubermatic-bot kubermatic-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 28, 2024
@teimyBr
Copy link

teimyBr commented Feb 28, 2024

would there be a fix in kkp 2.24.x ?

@embik
Copy link
Member Author

embik commented Feb 28, 2024

/remove-lifecycle stale

@kubermatic-bot kubermatic-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 28, 2024
@embik
Copy link
Member Author

embik commented Feb 28, 2024

Hi @teimyBr, this would likely depend on the fix. We are working on improving the general logic around Helm installations, but a fix (just) for getting out of the described situation would very likely be backported to 2.24, yes.

@toschneck
Copy link
Member

crossref 2.25 backport #13332

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/app-management Denotes a PR or issue as being assigned to SIG App Management. sig/networking Denotes a PR or issue as being assigned to SIG Networking.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants