Bug 1884101: Fixes systemd ovs check for ovn/sdn #816

trozet · 2020-10-01T14:31:53Z

There is a period of time where MCO will lay down the files for 4.6
before it reboots the node into the new OS. If the ovs pods are coming
up at this time they could accidentally think systemd ovs is running,
which it is not. This patch modifies the check to look in systemd, where
the ovs-configuration service will be enabled only when 4.6 is booted.

Signed-off-by: Tim Rozet trozet@redhat.com

openshift-ci-robot · 2020-10-01T14:32:01Z

@trozet: This pull request references Bugzilla bug 1884101, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.6.0) matches configured target release for branch (4.6.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1884101: Fixes systemd ovs check for ovn/sdn

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

trozet · 2020-10-01T14:32:57Z

@runcom @juanluisvaladas @danwinship PTAL

knobunc · 2020-10-01T14:37:11Z

/lgtm

juanluisvaladas · 2020-10-01T14:59:12Z

As @dcbw and @squeed told me in the past that checking if something is enabled in systemd by checking the file is unreliable.
Given that we control the whole environment and that code freeze is tomorrow, I think it's reasonably safe and I would move forward with this.

abhat · 2020-10-01T15:00:41Z

As @dcbw and @squeed told me in the past that checking if something is enabled in systemd by checking the file is unreliable.
Given that we control the whole environment and that code freeze is tomorrow, I think it's reasonably safe and I would move forward with this.

Yeah, I think the concern was systemd can have those files put anywhere. But yeah it's ok to move forward with this for now.

trozet · 2020-10-01T15:10:25Z

systemd will put the file where you choose the install target. I don't see how that is unreliable. I guess the unreliable part is, if the service changed to install to a different place then that is a possibility. It's not necessarily a problem for us because we control the ovs-configuration service as part of MCO. It would be more risky if that came from a package rpm in a repo somewhere.

dcbw · 2020-10-01T16:10:38Z

I tihnk there are still issues here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/816/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn/1311675419161792512/artifacts/e2e-gcp-ovn/pods/

the OVS pods think system OVS has started, but cant' tail logs. And ovn-controller can't talk to OVS on the host

dcbw · 2020-10-01T16:10:52Z

/hold

dcbw · 2020-10-01T16:21:14Z

2020-10-01T15:03:25+00:00 - starting ovs-daemons
ovsdb-server is already running.
ovs-vswitchd is already running.
Enabling remote OVSDB managers.
Enabling udp to port 6081 with iptables.
2020-10-01T15:03:25+00:00 - ovs-daemons running
tail: tail: cannot open '/var/log/openvswitch/ovsdb-server.log' for readingcannot open '/var/log/openvswitch/ovs-vswitchd.log' for reading: No such file or directory: No such file or directory

which indicates we did not succeed in the check this PR changes: if [[ -f '/host/etc/systemd/system/network-online.target.wants/ovs-configuration.service' ]]; then

abhat · 2020-10-01T18:01:51Z

Guessing you changed this to allow the service file to be a symlink to the file on disk at /usr/lib(64)/systemd?

dcbw · 2020-10-01T18:09:35Z

Guessing you changed this to allow the service file to be a symlink to the file on disk at /usr/lib(64)/systemd?

Most things in the systemd directories are already symlinks, so -f wouldn't match those, which (we think) caused it to always double-start OVS.

abhat · 2020-10-01T18:11:17Z

Guessing you changed this to allow the service file to be a symlink to the file on disk at /usr/lib(64)/systemd?

Most things in the systemd directories are already symlinks, so -f wouldn't match those, which (we think) caused it to always double-start OVS.

Yep, that makes sense.

There is a period of time where MCO will lay down the files for 4.6 before it reboots the node into the new OS. If the ovs pods are coming up at this time they could accidentally think systemd ovs is running, which it is not. This patch modifies the check to look in systemd, where the ovs-configuration service will be enabled only when 4.6 is booted. Signed-off-by: Tim Rozet <trozet@redhat.com>

dcbw · 2020-10-01T19:40:30Z

/hold cancel
/lgtm

openshift-ci-robot · 2020-10-01T19:40:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dcbw, knobunc, trozet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [dcbw,knobunc,trozet]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

trozet · 2020-10-01T19:41:25Z

as @dcbw pointed out using -e wont work because it will try to dereference the symlink. So -e /host/etc/systemd/system/network-online.target.wants/ovs-configuration.service will try to dereference to /etc/systemd/system/ovs-configuration.service in the container, which is really mounted at /host/etc/systemd/system/ovs-configuration.service. -L will not dereference the symlink so that will work.

rbbratta · 2020-10-01T20:37:40Z

bindata/network/openshift-sdn/sdn-ovs.yaml

          # Check to see if ovs is provided by the node:
-          if [[ -f '/host/usr/local/bin/configure-ovs.sh' ]]; then
+          if [[ -L '/host/etc/systemd/system/network-online.target.wants/ovs-configuration.service' ]]; then


as a fail-safe is it possible with dbus to try to start ovs-vswitchd regardless. presumably org.freedesktop.systemd1.Unit.Start is a no-op if ovs-vswitchd is already started.

We tried that originally and the short answer is "it's more complicated than we thought, error-prone, and not possible to do yet". Mainly due to containers having different UID, PID space, and stuff like that.

the dbus API seems to work inside the ovs container, it starts a stopped ovs-vswitchd and doesn't affect a started ovs-vswitchd

gdbus call --system --dest org.freedesktop.systemd1 --object-path /org/freedesktop/systemd1/unit/ovs_2dvswitchd_2eservice --method org.freedesktop.systemd1.Unit.Start "replace"

we dont want to start ovs here, we just want to detect if it is running in systemd. Through upgrade process we need container ovs to still work.

Ah, I meant after this line, once we detect ovs-configuration.service we could add a failsafe ovs-vswitchd start.

If ovs-vswitchd is already running then systemd won't do anything, but other-wise we have a failsafe.

We seem to already have a failsafe start of ovsdb-server on line 63.

--- a/bindata/network/openshift-sdn/sdn-ovs.yaml +++ b/bindata/network/openshift-sdn/sdn-ovs.yaml @@ -38,12 +38,12 @@ spec: # systemctl cannot be used in a separate PID namespace to reach # the systemd running in PID 1. Therefore we need to use the dbus API - systemctl_restart(){ + systemctl_dbus(){ gdbus call \ --system \ --dest org.freedesktop.systemd1 \ - --object-path /org/freedesktop/systemd1/unit/"$(svc_encode_name ${1})"_2eservice \ - --method org.freedesktop.systemd1.Unit.Restart "replace" + --object-path /org/freedesktop/systemd1/unit/"$(svc_encode_name ${2})"_2eservice \ + --method org.freedesktop.systemd1.Unit."${1}" "replace" } svc_encode_name(){ # systemd encodes some characters, so far we only need to encode @@ -53,13 +53,14 @@ spec: # Check to see if ovs is provided by the node: if [[ -f '/host/usr/local/bin/configure-ovs.sh' ]]; then echo "openvswitch is running in systemd" + systemctl_dbus Start ovs-vswitchd # In some very strange corner cases, the owner for /run/openvswitch # can be wrong, so we need to clean up and restart. ovs_uid=$(chroot /host id -u openvswitch) ovs_gid=$(chroot /host id -g openvswitch) chown -R "${ovs_uid}:${ovs_gid}" /run/openvswitch if [[ ! -S /run/openvswitch/db.sock ]]; then - systemctl_restart ovsdb-server + systemctl_dbus Restart ovsdb-server fi # Don't need to worry about restoring flows; this can only change if we've rebooted exec tail -F /host/var/log/openvswitch/ovs-vswitchd.log /host/var/log/openvswitch/ovsdb-server.log

openshift-bot · 2020-10-01T21:12:12Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-01T21:38:13Z

/retest

Please review the full test history for this PR and help us cut down flakes.

dcbw · 2020-10-01T21:54:28Z

Most recent test failures are either Azure resource problems bringing up VMs, or a flake (apiserver terminating gracefully) on the SDN multi job, Windows job is the current AWS outage.

openshift-bot · 2020-10-01T22:04:13Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-01T23:35:17Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-02T00:40:34Z

/retest

Please review the full test history for this PR and help us cut down flakes.

trozet · 2020-10-02T00:52:08Z

@dcbw @knobunc e2e upgrade is failing due to:
fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:296]: Oct 2 00:23:51.411: Service was unreachable during disruption for at least 2m11s of 1h27m10s (3%)

disruption_tests: [sig-network-edge] Application behind service load balancer with PDB is not disrupted expand_less	1h30m12s
Oct 2 00:23:51.411: Service was unreachable during disruption for at least 2m11s of 1h27m10s (3%):

: [sig-cluster-lifecycle] cluster upgrade should be fast expand_less	1h26m50s
Upgrade took too long: 86.83529349833333

I can see in all the ovs logs ovs is running as systemd. Not sure if we want to override this or wait a few more iterations to see if we get a pass.

openshift-bot · 2020-10-02T00:53:17Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2020-10-02T01:22:57Z

@trozet: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-openstack	`6899e83`	link	`/test e2e-openstack`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2020-10-02T01:32:15Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2020-10-02T03:32:36Z

@trozet: All pull requests linked via external trackers have merged:

openshift/cluster-network-operator#816

Bugzilla bug 1884101 has been moved to the MODIFIED state.

In response to this:

Bug 1884101: Fixes systemd ovs check for ovn/sdn

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added the bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. label Oct 1, 2020

openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Oct 1, 2020

openshift-ci-robot requested review from danwinship and tssurya October 1, 2020 14:32

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 1, 2020

trozet force-pushed the fix_ovs_systemd_check branch from 9df5674 to ac28599 Compare October 1, 2020 14:32

openshift-ci-robot assigned knobunc Oct 1, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 1, 2020

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 1, 2020

trozet force-pushed the fix_ovs_systemd_check branch from ac28599 to c6022ac Compare October 1, 2020 17:15

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Oct 1, 2020

trozet force-pushed the fix_ovs_systemd_check branch from c6022ac to 2fb87d6 Compare October 1, 2020 18:33

trozet force-pushed the fix_ovs_systemd_check branch from 2fb87d6 to 6899e83 Compare October 1, 2020 19:36

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 1, 2020

openshift-ci-robot assigned dcbw Oct 1, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 1, 2020

rbbratta reviewed Oct 1, 2020

View reviewed changes

openshift-merge-robot merged commit dfea1dc into openshift:master Oct 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1884101: Fixes systemd ovs check for ovn/sdn #816

Bug 1884101: Fixes systemd ovs check for ovn/sdn #816

trozet commented Oct 1, 2020

openshift-ci-robot commented Oct 1, 2020

trozet commented Oct 1, 2020

knobunc commented Oct 1, 2020

juanluisvaladas commented Oct 1, 2020

abhat commented Oct 1, 2020

trozet commented Oct 1, 2020

dcbw commented Oct 1, 2020

dcbw commented Oct 1, 2020

dcbw commented Oct 1, 2020

abhat commented Oct 1, 2020

dcbw commented Oct 1, 2020

abhat commented Oct 1, 2020

dcbw commented Oct 1, 2020

openshift-ci-robot commented Oct 1, 2020

trozet commented Oct 1, 2020 •

edited

rbbratta Oct 1, 2020

dcbw Oct 1, 2020

rbbratta Oct 1, 2020

trozet Oct 1, 2020

rbbratta Oct 2, 2020

rbbratta Oct 2, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

dcbw commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 2, 2020

trozet commented Oct 2, 2020 •

edited

openshift-bot commented Oct 2, 2020

openshift-ci-robot commented Oct 2, 2020 •

edited

openshift-bot commented Oct 2, 2020

openshift-ci-robot commented Oct 2, 2020

Bug 1884101: Fixes systemd ovs check for ovn/sdn #816

Bug 1884101: Fixes systemd ovs check for ovn/sdn #816

Conversation

trozet commented Oct 1, 2020

openshift-ci-robot commented Oct 1, 2020

trozet commented Oct 1, 2020

knobunc commented Oct 1, 2020

juanluisvaladas commented Oct 1, 2020

abhat commented Oct 1, 2020

trozet commented Oct 1, 2020

dcbw commented Oct 1, 2020

dcbw commented Oct 1, 2020

dcbw commented Oct 1, 2020

abhat commented Oct 1, 2020

dcbw commented Oct 1, 2020

abhat commented Oct 1, 2020

dcbw commented Oct 1, 2020

openshift-ci-robot commented Oct 1, 2020

trozet commented Oct 1, 2020 • edited

rbbratta Oct 1, 2020

Choose a reason for hiding this comment

dcbw Oct 1, 2020

Choose a reason for hiding this comment

rbbratta Oct 1, 2020

Choose a reason for hiding this comment

trozet Oct 1, 2020

Choose a reason for hiding this comment

rbbratta Oct 2, 2020

Choose a reason for hiding this comment

rbbratta Oct 2, 2020

Choose a reason for hiding this comment

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

dcbw commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 2, 2020

trozet commented Oct 2, 2020 • edited

openshift-bot commented Oct 2, 2020

openshift-ci-robot commented Oct 2, 2020 • edited

openshift-bot commented Oct 2, 2020

openshift-ci-robot commented Oct 2, 2020

trozet commented Oct 1, 2020 •

edited

trozet commented Oct 2, 2020 •

edited

openshift-ci-robot commented Oct 2, 2020 •

edited