Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4.10 upgrade blockers #1168

Closed
vrutkovs opened this issue Mar 20, 2022 · 6 comments · Fixed by openshift/okd-machine-os#334
Closed

4.10 upgrade blockers #1168

vrutkovs opened this issue Mar 20, 2022 · 6 comments · Fixed by openshift/okd-machine-os#334

Comments

@vrutkovs
Copy link
Member

Upgrade from 4.10.0-0.okd-2022-03-07-131213 to latest nightly gets stuck on AWS/GCP due to broken networking.

FCOS tracking bug - coreos/fedora-coreos-tracker#1136

@vrutkovs vrutkovs pinned this issue Mar 20, 2022
@LorbusChris LorbusChris changed the title New 4.10 stable cannot be released 4.10 upgrade blockers Mar 23, 2022
@LorbusChris
Copy link
Contributor

I'm doing some testing over in openshift/okd-machine-os#328

We're possibly hitting https://bugzilla.redhat.com/show_bug.cgi?id=2058030

@fortinj66
Copy link
Contributor

I can reproduce this update issue on AWS. After an update to any nightly the first master and worker are broken network wise. Neither can be sshed into nor pinged while the others are still fine.

Unfortunately, without being able to get into the broken servers, debugging is difficult at best.

Note that a new cluster with the same nightly installs fine.

I have not done a nightly to nightly upgrade...

Here is my confusion. If you look at the release notes, there has not been an update to MCO since the 2022-03-07 OKD release.

@fortinj66
Copy link
Contributor

fortinj66 commented Apr 1, 2022

I found the issue:

Apr 01 20:53:03 localhost.localdomain systemd[1]: Starting Open vSwitch Database Unit...
Apr 01 20:53:04 localhost.localdomain ovs-ctl[2270]: ovsdb-tool: I/O error: /etc/openvswitch/conf.db: open failed (Permission denied)
Apr 01 20:53:04 localhost.localdomain ovs-ctl[2272]: ovsdb-server: I/O error: /etc/openvswitch/conf.db: failed to lock lockfile (Resource temporarily unavailable)
Apr 01 20:53:04 localhost.localdomain ovs-ctl[2227]: Starting ovsdb-server ... failed!

During the upgrade the file ownership in /etc/openvswitch is changed to the following:

[core@localhost openvswitch]$ ls -al
total 624
drwxr-xr-x.  2 gluster gluster     86 Apr  1 21:16 .
drwxr-xr-x. 90 root    root      8192 Apr  1 21:22 ..
-rw-------.  1 gluster gluster      0 Apr  1 18:[  606.270257] audit: type=1334 audit(1648848415.242:254): prog-id=0 op=UNLOAD
36 .conf.db.~lock~
-rw-r-----.  1 gluster gluster 618081 Apr  1 21:16 conf.db
-rw-r--r--.  1 gluster gluster    163 Apr  1 18:35 default.conf
-rw-r--r--.  1 gluster gluster [  606.329193] audit: type=1334 audit(1648848415.242:255): prog-id=0 op=UNLOAD
    37 Apr  1 18:36 system-id.conf
[core@localhost openvswitch]$ ls -aln
total 624
drwxr-xr-x.  2 983 979     86 Apr  1 21:16 .
drwxr-xr-x. 90   0   0   8192 Apr  1 21:22 ..
-rw-------.  1 983 979      0 Apr  1 18:36 .conf.db.~lock~
-rw-r-----.  1 983 979 618081 Apr  1 21:16 conf.db
-rw-r--r--.  1 983 979    163 Apr  1 18:35 default.conf
-rw-r--r--.  1 983 979     37 Apr  1 18:36 system-id.conf

it should be:


[core@localhost openvswitch]$ ls -al
total 624
drwxr-xr-x.  2 openvswitch gluster     86 Apr  1 21:16 .
drwxr-xr-x. 90 root        root      8192 Apr  1 21:22 ..
-rw-------.  1 openvswitch gluster      0 Apr  1 18:36 .conf.db.~lock~
-rw-r-----.  1 openvswitch gluster 618081 Apr  1 21:16 conf.db
-rw-r--r--.  1 openvswitch gluster    163 Apr  1 18:35 default.conf
-rw-r--r--.  1 openvswitch gluster     37 Apr  1 18:36 system-id.conf
[core@localhost openvswitch]$ 
[core@localhost openvswitch]$ ls -aln
total 624
drwxr-xr-x.  2 985 979     86 Apr  1 21:16 .
drwxr-xr-x. 90   0   0   8192 Apr  1 21:22 ..
-rw-------.  1 985 979      0 Apr  1 18:36 .conf.db.~lock~
-rw-r-----.  1 985 979 618081 Apr  1 21:16 conf.db
-rw-r--r--.  1 985 979    163 Apr  1 18:35 default.conf
-rw-r--r--.  1 985 979     37 Apr  1 18:36 system-id.conf

Changing the file ownership and rebooting resolves the issue.

So the question is where is this being changed???

edit: actually need to also change the group ownership to hugetlbfs for consistency

@vrutkovs
Copy link
Member Author

vrutkovs commented Apr 5, 2022

Reopening to track when this lands in stable channel, the fix went in https://amd64.origin.releases.ci.openshift.org/releasestream/4.10.0-0.okd/release/4.10.0-0.okd-2022-04-05-104439

@vrutkovs vrutkovs reopened this Apr 5, 2022
@fortinj66
Copy link
Contributor

I think #1182 is a blocker too, at least for vSphere IPI. Unless you know the workaround, bootstrap never finishes.

@vrutkovs
Copy link
Member Author

All blockers were resolved, new 4.10 version is released

@vrutkovs vrutkovs unpinned this issue May 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants