New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1884101: Fixes systemd ovs check for ovn/sdn #816
Bug 1884101: Fixes systemd ovs check for ovn/sdn #816
Conversation
@trozet: This pull request references Bugzilla bug 1884101, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
9df5674
to
ac28599
Compare
/lgtm |
Yeah, I think the concern was systemd can have those files put anywhere. But yeah it's ok to move forward with this for now. |
systemd will put the file where you choose the install target. I don't see how that is unreliable. I guess the unreliable part is, if the service changed to install to a different place then that is a possibility. It's not necessarily a problem for us because we control the ovs-configuration service as part of MCO. It would be more risky if that came from a package rpm in a repo somewhere. |
I tihnk there are still issues here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/816/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn/1311675419161792512/artifacts/e2e-gcp-ovn/pods/ the OVS pods think system OVS has started, but cant' tail logs. And ovn-controller can't talk to OVS on the host |
/hold |
which indicates we did not succeed in the check this PR changes: |
ac28599
to
c6022ac
Compare
Guessing you changed this to allow the service file to be a symlink to the file on disk at /usr/lib(64)/systemd? |
Most things in the systemd directories are already symlinks, so -f wouldn't match those, which (we think) caused it to always double-start OVS. |
Yep, that makes sense. |
c6022ac
to
2fb87d6
Compare
There is a period of time where MCO will lay down the files for 4.6 before it reboots the node into the new OS. If the ovs pods are coming up at this time they could accidentally think systemd ovs is running, which it is not. This patch modifies the check to look in systemd, where the ovs-configuration service will be enabled only when 4.6 is booted. Signed-off-by: Tim Rozet <trozet@redhat.com>
2fb87d6
to
6899e83
Compare
/hold cancel |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dcbw, knobunc, trozet The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
as @dcbw pointed out using -e wont work because it will try to dereference the symlink. So -e /host/etc/systemd/system/network-online.target.wants/ovs-configuration.service will try to dereference to /etc/systemd/system/ovs-configuration.service in the container, which is really mounted at /host/etc/systemd/system/ovs-configuration.service. -L will not dereference the symlink so that will work. |
# Check to see if ovs is provided by the node: | ||
if [[ -f '/host/usr/local/bin/configure-ovs.sh' ]]; then | ||
if [[ -L '/host/etc/systemd/system/network-online.target.wants/ovs-configuration.service' ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a fail-safe is it possible with dbus to try to start ovs-vswitchd
regardless. presumably org.freedesktop.systemd1.Unit.Start
is a no-op if ovs-vswitchd
is already started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We tried that originally and the short answer is "it's more complicated than we thought, error-prone, and not possible to do yet". Mainly due to containers having different UID, PID space, and stuff like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the dbus API seems to work inside the ovs
container, it starts a stopped ovs-vswitchd
and doesn't affect a started ovs-vswitchd
gdbus call --system --dest org.freedesktop.systemd1 --object-path /org/freedesktop/systemd1/unit/ovs_2dvswitchd_2eservice --method org.freedesktop.systemd1.Unit.Start "replace"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we dont want to start ovs here, we just want to detect if it is running in systemd. Through upgrade process we need container ovs to still work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I meant after this line, once we detect ovs-configuration.service
we could add a failsafe ovs-vswitchd
start.
If ovs-vswitchd
is already running then systemd
won't do anything, but other-wise we have a failsafe.
We seem to already have a failsafe start of ovsdb-server
on line 63.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--- a/bindata/network/openshift-sdn/sdn-ovs.yaml
+++ b/bindata/network/openshift-sdn/sdn-ovs.yaml
@@ -38,12 +38,12 @@ spec:
# systemctl cannot be used in a separate PID namespace to reach
# the systemd running in PID 1. Therefore we need to use the dbus API
- systemctl_restart(){
+ systemctl_dbus(){
gdbus call \
--system \
--dest org.freedesktop.systemd1 \
- --object-path /org/freedesktop/systemd1/unit/"$(svc_encode_name ${1})"_2eservice \
- --method org.freedesktop.systemd1.Unit.Restart "replace"
+ --object-path /org/freedesktop/systemd1/unit/"$(svc_encode_name ${2})"_2eservice \
+ --method org.freedesktop.systemd1.Unit."${1}" "replace"
}
svc_encode_name(){
# systemd encodes some characters, so far we only need to encode
@@ -53,13 +53,14 @@ spec:
# Check to see if ovs is provided by the node:
if [[ -f '/host/usr/local/bin/configure-ovs.sh' ]]; then
echo "openvswitch is running in systemd"
+ systemctl_dbus Start ovs-vswitchd
# In some very strange corner cases, the owner for /run/openvswitch
# can be wrong, so we need to clean up and restart.
ovs_uid=$(chroot /host id -u openvswitch)
ovs_gid=$(chroot /host id -g openvswitch)
chown -R "${ovs_uid}:${ovs_gid}" /run/openvswitch
if [[ ! -S /run/openvswitch/db.sock ]]; then
- systemctl_restart ovsdb-server
+ systemctl_dbus Restart ovsdb-server
fi
# Don't need to worry about restoring flows; this can only change if we've rebooted
exec tail -F /host/var/log/openvswitch/ovs-vswitchd.log /host/var/log/openvswitch/ovsdb-server.log
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
Most recent test failures are either Azure resource problems bringing up VMs, or a flake (apiserver terminating gracefully) on the SDN multi job, Windows job is the current AWS outage. |
/retest Please review the full test history for this PR and help us cut down flakes. |
2 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@dcbw @knobunc e2e upgrade is failing due to:
I can see in all the ovs logs ovs is running as systemd. Not sure if we want to override this or wait a few more iterations to see if we get a pass. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@trozet: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@trozet: All pull requests linked via external trackers have merged: Bugzilla bug 1884101 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There is a period of time where MCO will lay down the files for 4.6
before it reboots the node into the new OS. If the ovs pods are coming
up at this time they could accidentally think systemd ovs is running,
which it is not. This patch modifies the check to look in systemd, where
the ovs-configuration service will be enabled only when 4.6 is booted.
Signed-off-by: Tim Rozet trozet@redhat.com