4.10 upgrade blockers #1168

vrutkovs · 2022-03-20T06:30:43Z

Upgrade from 4.10.0-0.okd-2022-03-07-131213 to latest nightly gets stuck on AWS/GCP due to broken networking.

FCOS tracking bug - coreos/fedora-coreos-tracker#1136

LorbusChris · 2022-03-25T11:29:56Z

I'm doing some testing over in openshift/okd-machine-os#328

We're possibly hitting https://bugzilla.redhat.com/show_bug.cgi?id=2058030

fortinj66 · 2022-04-01T17:18:25Z

I can reproduce this update issue on AWS. After an update to any nightly the first master and worker are broken network wise. Neither can be sshed into nor pinged while the others are still fine.

Unfortunately, without being able to get into the broken servers, debugging is difficult at best.

Note that a new cluster with the same nightly installs fine.

I have not done a nightly to nightly upgrade...

Here is my confusion. If you look at the release notes, there has not been an update to MCO since the 2022-03-07 OKD release.

fortinj66 · 2022-04-01T21:41:53Z

I found the issue:

Apr 01 20:53:03 localhost.localdomain systemd[1]: Starting Open vSwitch Database Unit...
Apr 01 20:53:04 localhost.localdomain ovs-ctl[2270]: ovsdb-tool: I/O error: /etc/openvswitch/conf.db: open failed (Permission denied)
Apr 01 20:53:04 localhost.localdomain ovs-ctl[2272]: ovsdb-server: I/O error: /etc/openvswitch/conf.db: failed to lock lockfile (Resource temporarily unavailable)
Apr 01 20:53:04 localhost.localdomain ovs-ctl[2227]: Starting ovsdb-server ... failed!

During the upgrade the file ownership in /etc/openvswitch is changed to the following:

[core@localhost openvswitch]$ ls -al
total 624
drwxr-xr-x.  2 gluster gluster     86 Apr  1 21:16 .
drwxr-xr-x. 90 root    root      8192 Apr  1 21:22 ..
-rw-------.  1 gluster gluster      0 Apr  1 18:[  606.270257] audit: type=1334 audit(1648848415.242:254): prog-id=0 op=UNLOAD
36 .conf.db.~lock~
-rw-r-----.  1 gluster gluster 618081 Apr  1 21:16 conf.db
-rw-r--r--.  1 gluster gluster    163 Apr  1 18:35 default.conf
-rw-r--r--.  1 gluster gluster [  606.329193] audit: type=1334 audit(1648848415.242:255): prog-id=0 op=UNLOAD
    37 Apr  1 18:36 system-id.conf
[core@localhost openvswitch]$ ls -aln
total 624
drwxr-xr-x.  2 983 979     86 Apr  1 21:16 .
drwxr-xr-x. 90   0   0   8192 Apr  1 21:22 ..
-rw-------.  1 983 979      0 Apr  1 18:36 .conf.db.~lock~
-rw-r-----.  1 983 979 618081 Apr  1 21:16 conf.db
-rw-r--r--.  1 983 979    163 Apr  1 18:35 default.conf
-rw-r--r--.  1 983 979     37 Apr  1 18:36 system-id.conf

it should be:


[core@localhost openvswitch]$ ls -al
total 624
drwxr-xr-x.  2 openvswitch gluster     86 Apr  1 21:16 .
drwxr-xr-x. 90 root        root      8192 Apr  1 21:22 ..
-rw-------.  1 openvswitch gluster      0 Apr  1 18:36 .conf.db.~lock~
-rw-r-----.  1 openvswitch gluster 618081 Apr  1 21:16 conf.db
-rw-r--r--.  1 openvswitch gluster    163 Apr  1 18:35 default.conf
-rw-r--r--.  1 openvswitch gluster     37 Apr  1 18:36 system-id.conf
[core@localhost openvswitch]$ 
[core@localhost openvswitch]$ ls -aln
total 624
drwxr-xr-x.  2 985 979     86 Apr  1 21:16 .
drwxr-xr-x. 90   0   0   8192 Apr  1 21:22 ..
-rw-------.  1 985 979      0 Apr  1 18:36 .conf.db.~lock~
-rw-r-----.  1 985 979 618081 Apr  1 21:16 conf.db
-rw-r--r--.  1 985 979    163 Apr  1 18:35 default.conf
-rw-r--r--.  1 985 979     37 Apr  1 18:36 system-id.conf

Changing the file ownership and rebooting resolves the issue.

So the question is where is this being changed???

edit: actually need to also change the group ownership to hugetlbfs for consistency

vrutkovs · 2022-04-05T12:40:56Z

Reopening to track when this lands in stable channel, the fix went in https://amd64.origin.releases.ci.openshift.org/releasestream/4.10.0-0.okd/release/4.10.0-0.okd-2022-04-05-104439

fortinj66 · 2022-04-09T12:27:03Z

I think #1182 is a blocker too, at least for vSphere IPI. Unless you know the workaround, bootstrap never finishes.

vrutkovs · 2022-04-27T07:00:16Z

All blockers were resolved, new 4.10 version is released

vrutkovs pinned this issue Mar 20, 2022

vrutkovs mentioned this issue Mar 21, 2022

CVE-2022-0811 Container escape in cri-o #1156

Closed

LorbusChris changed the title ~~New 4.10 stable cannot be released~~ 4.10 upgrade blockers Mar 23, 2022

This was referenced Apr 2, 2022

OKD overlay: fix openvswitch permissions openshift/okd-machine-os#331

Closed

okd overlay: fix openvswitch permissions openshift/okd-machine-os#334

Merged

abaxo mentioned this issue Apr 4, 2022

Upgrade from 4.9.0-0.okd-2022-02-12-140851 to 4.10.0-0.okd-2022-03-07-131213 results in failed networking #1169

Closed

openshift-merge-robot closed this as completed in openshift/okd-machine-os#334 Apr 5, 2022

vrutkovs reopened this Apr 5, 2022

vrutkovs closed this as completed Apr 27, 2022

vrutkovs unpinned this issue May 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4.10 upgrade blockers #1168

4.10 upgrade blockers #1168

vrutkovs commented Mar 20, 2022

LorbusChris commented Mar 25, 2022

fortinj66 commented Apr 1, 2022

fortinj66 commented Apr 1, 2022 •

edited

Loading

vrutkovs commented Apr 5, 2022 •

edited

Loading

fortinj66 commented Apr 9, 2022

vrutkovs commented Apr 27, 2022

4.10 upgrade blockers #1168

4.10 upgrade blockers #1168

Comments

vrutkovs commented Mar 20, 2022

LorbusChris commented Mar 25, 2022

fortinj66 commented Apr 1, 2022

fortinj66 commented Apr 1, 2022 • edited Loading

vrutkovs commented Apr 5, 2022 • edited Loading

fortinj66 commented Apr 9, 2022

vrutkovs commented Apr 27, 2022

fortinj66 commented Apr 1, 2022 •

edited

Loading

vrutkovs commented Apr 5, 2022 •

edited

Loading