Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1884101: Fixes systemd ovs check for ovn/sdn #816

Merged

Conversation

trozet
Copy link
Contributor

@trozet trozet commented Oct 1, 2020

There is a period of time where MCO will lay down the files for 4.6
before it reboots the node into the new OS. If the ovs pods are coming
up at this time they could accidentally think systemd ovs is running,
which it is not. This patch modifies the check to look in systemd, where
the ovs-configuration service will be enabled only when 4.6 is booted.

Signed-off-by: Tim Rozet trozet@redhat.com

@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. label Oct 1, 2020
@openshift-ci-robot
Copy link
Contributor

@trozet: This pull request references Bugzilla bug 1884101, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1884101: Fixes systemd ovs check for ovn/sdn

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Oct 1, 2020
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 1, 2020
@trozet
Copy link
Contributor Author

trozet commented Oct 1, 2020

@runcom @juanluisvaladas @danwinship PTAL

@knobunc
Copy link
Contributor

knobunc commented Oct 1, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 1, 2020
@juanluisvaladas
Copy link
Contributor

As @dcbw and @squeed told me in the past that checking if something is enabled in systemd by checking the file is unreliable.
Given that we control the whole environment and that code freeze is tomorrow, I think it's reasonably safe and I would move forward with this.

@abhat
Copy link
Contributor

abhat commented Oct 1, 2020

As @dcbw and @squeed told me in the past that checking if something is enabled in systemd by checking the file is unreliable.
Given that we control the whole environment and that code freeze is tomorrow, I think it's reasonably safe and I would move forward with this.

Yeah, I think the concern was systemd can have those files put anywhere. But yeah it's ok to move forward with this for now.

@trozet
Copy link
Contributor Author

trozet commented Oct 1, 2020

systemd will put the file where you choose the install target. I don't see how that is unreliable. I guess the unreliable part is, if the service changed to install to a different place then that is a possibility. It's not necessarily a problem for us because we control the ovs-configuration service as part of MCO. It would be more risky if that came from a package rpm in a repo somewhere.

@dcbw
Copy link
Contributor

dcbw commented Oct 1, 2020

I tihnk there are still issues here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/816/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn/1311675419161792512/artifacts/e2e-gcp-ovn/pods/

the OVS pods think system OVS has started, but cant' tail logs. And ovn-controller can't talk to OVS on the host

@dcbw
Copy link
Contributor

dcbw commented Oct 1, 2020

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 1, 2020
@dcbw
Copy link
Contributor

dcbw commented Oct 1, 2020

2020-10-01T15:03:25+00:00 - starting ovs-daemons
ovsdb-server is already running.
ovs-vswitchd is already running.
Enabling remote OVSDB managers.
Enabling udp to port 6081 with iptables.
2020-10-01T15:03:25+00:00 - ovs-daemons running
tail: tail: cannot open '/var/log/openvswitch/ovsdb-server.log' for readingcannot open '/var/log/openvswitch/ovs-vswitchd.log' for reading: No such file or directory: No such file or directory

which indicates we did not succeed in the check this PR changes: if [[ -f '/host/etc/systemd/system/network-online.target.wants/ovs-configuration.service' ]]; then

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Oct 1, 2020
@abhat
Copy link
Contributor

abhat commented Oct 1, 2020

Guessing you changed this to allow the service file to be a symlink to the file on disk at /usr/lib(64)/systemd?

@dcbw
Copy link
Contributor

dcbw commented Oct 1, 2020

Guessing you changed this to allow the service file to be a symlink to the file on disk at /usr/lib(64)/systemd?

Most things in the systemd directories are already symlinks, so -f wouldn't match those, which (we think) caused it to always double-start OVS.

@abhat
Copy link
Contributor

abhat commented Oct 1, 2020

Guessing you changed this to allow the service file to be a symlink to the file on disk at /usr/lib(64)/systemd?

Most things in the systemd directories are already symlinks, so -f wouldn't match those, which (we think) caused it to always double-start OVS.

Yep, that makes sense.

There is a period of time where MCO will lay down the files for 4.6
before it reboots the node into the new OS. If the ovs pods are coming
up at this time they could accidentally think systemd ovs is running,
which it is not. This patch modifies the check to look in systemd, where
the ovs-configuration service will be enabled only when 4.6 is booted.

Signed-off-by: Tim Rozet <trozet@redhat.com>
@dcbw
Copy link
Contributor

dcbw commented Oct 1, 2020

/hold cancel
/lgtm

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 1, 2020
@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 1, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dcbw, knobunc, trozet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [dcbw,knobunc,trozet]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@trozet
Copy link
Contributor Author

trozet commented Oct 1, 2020

as @dcbw pointed out using -e wont work because it will try to dereference the symlink. So -e /host/etc/systemd/system/network-online.target.wants/ovs-configuration.service will try to dereference to /etc/systemd/system/ovs-configuration.service in the container, which is really mounted at /host/etc/systemd/system/ovs-configuration.service. -L will not dereference the symlink so that will work.

# Check to see if ovs is provided by the node:
if [[ -f '/host/usr/local/bin/configure-ovs.sh' ]]; then
if [[ -L '/host/etc/systemd/system/network-online.target.wants/ovs-configuration.service' ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a fail-safe is it possible with dbus to try to start ovs-vswitchd regardless. presumably org.freedesktop.systemd1.Unit.Start is a no-op if ovs-vswitchd is already started.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tried that originally and the short answer is "it's more complicated than we thought, error-prone, and not possible to do yet". Mainly due to containers having different UID, PID space, and stuff like that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the dbus API seems to work inside the ovs container, it starts a stopped ovs-vswitchd and doesn't affect a started ovs-vswitchd

gdbus call --system --dest org.freedesktop.systemd1 --object-path /org/freedesktop/systemd1/unit/ovs_2dvswitchd_2eservice --method org.freedesktop.systemd1.Unit.Start "replace"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont want to start ovs here, we just want to detect if it is running in systemd. Through upgrade process we need container ovs to still work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I meant after this line, once we detect ovs-configuration.service we could add a failsafe ovs-vswitchd start.

If ovs-vswitchd is already running then systemd won't do anything, but other-wise we have a failsafe.

We seem to already have a failsafe start of ovsdb-server on line 63.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--- a/bindata/network/openshift-sdn/sdn-ovs.yaml
+++ b/bindata/network/openshift-sdn/sdn-ovs.yaml
@@ -38,12 +38,12 @@ spec:

           # systemctl cannot be used in a separate PID namespace to reach
           # the systemd running in PID 1. Therefore we need to use the dbus API
-          systemctl_restart(){
+          systemctl_dbus(){
             gdbus call \
               --system \
               --dest org.freedesktop.systemd1 \
-              --object-path /org/freedesktop/systemd1/unit/"$(svc_encode_name ${1})"_2eservice \
-              --method org.freedesktop.systemd1.Unit.Restart "replace"
+              --object-path /org/freedesktop/systemd1/unit/"$(svc_encode_name ${2})"_2eservice \
+              --method org.freedesktop.systemd1.Unit."${1}" "replace"
           }
           svc_encode_name(){
             # systemd encodes some characters, so far we only need to encode
@@ -53,13 +53,14 @@ spec:
           # Check to see if ovs is provided by the node:
           if [[ -f '/host/usr/local/bin/configure-ovs.sh' ]]; then
             echo "openvswitch is running in systemd"
+            systemctl_dbus Start ovs-vswitchd
             # In some very strange corner cases, the owner for /run/openvswitch
             # can be wrong, so we need to clean up and restart.
             ovs_uid=$(chroot /host id -u openvswitch)
             ovs_gid=$(chroot /host id -g openvswitch)
             chown -R "${ovs_uid}:${ovs_gid}" /run/openvswitch
             if [[ ! -S /run/openvswitch/db.sock ]]; then
-              systemctl_restart ovsdb-server
+              systemctl_dbus Restart ovsdb-server
             fi
             # Don't need to worry about restoring flows; this can only change if we've rebooted
             exec tail -F /host/var/log/openvswitch/ovs-vswitchd.log /host/var/log/openvswitch/ovsdb-server.log

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@dcbw
Copy link
Contributor

dcbw commented Oct 1, 2020

Most recent test failures are either Azure resource problems bringing up VMs, or a flake (apiserver terminating gracefully) on the SDN multi job, Windows job is the current AWS outage.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@trozet
Copy link
Contributor Author

trozet commented Oct 2, 2020

@dcbw @knobunc e2e upgrade is failing due to:
fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:296]: Oct 2 00:23:51.411: Service was unreachable during disruption for at least 2m11s of 1h27m10s (3%)

disruption_tests: [sig-network-edge] Application behind service load balancer with PDB is not disrupted expand_less 1h30m12s
Oct 2 00:23:51.411: Service was unreachable during disruption for at least 2m11s of 1h27m10s (3%):
: [sig-cluster-lifecycle] cluster upgrade should be fast expand_less 1h26m50s
Upgrade took too long: 86.83529349833333

I can see in all the ovs logs ovs is running as systemd. Not sure if we want to override this or wait a few more iterations to see if we get a pass.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Oct 2, 2020

@trozet: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-openstack 6899e83 link /test e2e-openstack

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit dfea1dc into openshift:master Oct 2, 2020
@openshift-ci-robot
Copy link
Contributor

@trozet: All pull requests linked via external trackers have merged:

Bugzilla bug 1884101 has been moved to the MODIFIED state.

In response to this:

Bug 1884101: Fixes systemd ovs check for ovn/sdn

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants