Cannot get cluster machine-config operator out of degraded state after fixing reason for an upgrade failure #1261

markusdd · 2022-06-14T19:36:58Z

Due to Issue coreos/fedora-coreos-tracker#701
we had an upgrade failure in our cluster as kubelet would not start due to selinux denials after machine config rebooted the nodes.
We were able to fix that by restoring the policy to the delivered version by FCOS before re-applying our necessary changes. (we run some EDA tools that need execheap and execmod on nfs_t)

Using this method our first master would finsih the upgrade and the upgrade would progress through the worker nodes flawlessly. The 2 remaining masters though are not upgrading it seems, because mchaine-config operator is permanently in degraded mode and I have no idea how to get it out of that state. Other than that the cluster works perfectly and jobs are running, but the update isn't progressing.
I have seen some messages regarding etcd but I don't have any reason to believe it is degraded as it is running on all nodes, guards are there and no degradation is reported.

All other operators are already at the new version and running fine, so this is the state:

Please help to get out of this mess. I also tried to retrigger machine-config on the first master by using sudo touch /run/machine-config-daemon-force and removiong the daemon pod to force re-creation. This led to a reboot and re-application of the config, but it ended up in the same state.

Describe the bug
While, after fixing the selinux issue, worker nodes updates progressed and finished, the remaining 2 masters will not update because I cannot get the machine config operator out of degraded mode.

Version
4.10.0-0.okd-2022-05-28-062148 in upgrade towards 4.10.0-0.okd-2022-06-10-131327

Log bundle
https://next.mkcloud.dynu.net/index.php/s/MeDEWsPgFgJnnKg

The text was updated successfully, but these errors were encountered:

markusdd · 2022-06-14T20:43:11Z

Ok, haha, I got this running again, but it was super subtle and the message above is misleading, because it says 'retrying'.

Actually, the master MasterConfigPool set itself into 'pause' state, I just noticed this by accident. Setting this to resume actually made the update continue.

So this issue would now be about how to improve this log situation, because to me it looked like somethign is still fundamentally broken.

vrutkovs · 2022-06-15T05:06:28Z

I don't follow - which log needs to changed?

markusdd · 2022-06-15T15:15:29Z

Here it reports that the machine config is degraded, and that it is retrying.

In all reality, due to the timeout apparanetly, the master MachioneConfigPool paused itself. So you have to go in and unpause it once the underlying poroblem (in our case selinux policy) was fixed.
If you don't notice that pause state you are wondering why nothing is continuing. It claims it is retrying, in reality it does nothing.

vrutkovs · 2022-06-15T15:32:38Z

managedFields says "Mozilla" client has set paused last time, so it did not happen automatically

markusdd · 2022-06-15T18:03:28Z

Ok, then I know the problem: there is this 'Pause update' button during an update in the Cluster Settings menu.
We used that to investigate, later pressed 'Resume Update'. For the workers this led to finishing the update, but for the masters it didn't end this state.

It must have been this because frankly I did not even know you could pause the MCP from the GUI in the MachineConfigPools screen.

vrutkovs closed this as completed Jun 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot get cluster machine-config operator out of degraded state after fixing reason for an upgrade failure #1261

Cannot get cluster machine-config operator out of degraded state after fixing reason for an upgrade failure #1261

markusdd commented Jun 14, 2022

markusdd commented Jun 14, 2022

vrutkovs commented Jun 15, 2022

markusdd commented Jun 15, 2022 •

edited

Loading

vrutkovs commented Jun 15, 2022

markusdd commented Jun 15, 2022 •

edited

Loading

Cannot get cluster machine-config operator out of degraded state after fixing reason for an upgrade failure #1261

Cannot get cluster machine-config operator out of degraded state after fixing reason for an upgrade failure #1261

Comments

markusdd commented Jun 14, 2022

markusdd commented Jun 14, 2022

vrutkovs commented Jun 15, 2022

markusdd commented Jun 15, 2022 • edited Loading

vrutkovs commented Jun 15, 2022

markusdd commented Jun 15, 2022 • edited Loading

markusdd commented Jun 15, 2022 •

edited

Loading

markusdd commented Jun 15, 2022 •

edited

Loading