Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot get cluster machine-config operator out of degraded state after fixing reason for an upgrade failure #1261

Closed
markusdd opened this issue Jun 14, 2022 · 5 comments

Comments

@markusdd
Copy link

Due to Issue coreos/fedora-coreos-tracker#701
we had an upgrade failure in our cluster as kubelet would not start due to selinux denials after machine config rebooted the nodes.
We were able to fix that by restoring the policy to the delivered version by FCOS before re-applying our necessary changes. (we run some EDA tools that need execheap and execmod on nfs_t)

Using this method our first master would finsih the upgrade and the upgrade would progress through the worker nodes flawlessly. The 2 remaining masters though are not upgrading it seems, because mchaine-config operator is permanently in degraded mode and I have no idea how to get it out of that state. Other than that the cluster works perfectly and jobs are running, but the update isn't progressing.
I have seen some messages regarding etcd but I don't have any reason to believe it is degraded as it is running on all nodes, guards are there and no degradation is reported.

All other operators are already at the new version and running fine, so this is the state:
image

image

Please help to get out of this mess. I also tried to retrigger machine-config on the first master by using sudo touch /run/machine-config-daemon-force and removiong the daemon pod to force re-creation. This led to a reboot and re-application of the config, but it ended up in the same state.

Describe the bug
While, after fixing the selinux issue, worker nodes updates progressed and finished, the remaining 2 masters will not update because I cannot get the machine config operator out of degraded mode.

Version
4.10.0-0.okd-2022-05-28-062148 in upgrade towards 4.10.0-0.okd-2022-06-10-131327

Log bundle
https://next.mkcloud.dynu.net/index.php/s/MeDEWsPgFgJnnKg

@markusdd
Copy link
Author

Ok, haha, I got this running again, but it was super subtle and the message above is misleading, because it says 'retrying'.
image

Actually, the master MasterConfigPool set itself into 'pause' state, I just noticed this by accident. Setting this to resume actually made the update continue.

So this issue would now be about how to improve this log situation, because to me it looked like somethign is still fundamentally broken.

@vrutkovs
Copy link
Member

I don't follow - which log needs to changed?

@markusdd
Copy link
Author

markusdd commented Jun 15, 2022

grafik

Here it reports that the machine config is degraded, and that it is retrying.

In all reality, due to the timeout apparanetly, the master MachioneConfigPool paused itself. So you have to go in and unpause it once the underlying poroblem (in our case selinux policy) was fixed.
If you don't notice that pause state you are wondering why nothing is continuing. It claims it is retrying, in reality it does nothing.

@vrutkovs
Copy link
Member

managedFields says "Mozilla" client has set paused last time, so it did not happen automatically

@markusdd
Copy link
Author

markusdd commented Jun 15, 2022

Ok, then I know the problem: there is this 'Pause update' button during an update in the Cluster Settings menu.
We used that to investigate, later pressed 'Resume Update'. For the workers this led to finishing the update, but for the masters it didn't end this state.

It must have been this because frankly I did not even know you could pause the MCP from the GUI in the MachineConfigPools screen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants