New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1886127: [4.5] sdn-ovs: fix liveness probe for downgrade case #837
Bug 1886127: [4.5] sdn-ovs: fix liveness probe for downgrade case #837
Conversation
@squeed: This pull request references Bugzilla bug 1886127, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@knobunc because the bug hierarchy doesn't match what the bot wants, this will probably need manual valid-bug tagging. |
if /usr/bin/ovs-vsctl -t 5 br-exists br0; then /usr/bin/ovs-ofctl -t 5 -O OpenFlow13 probe br0; else true; fi | ||
PID=$(cat /var/run/openvswitch/ovs-vswitchd.pid) | ||
/usr/bin/ovs-appctl -t "/var/run/openvswitch/ovs-vswitchd.${PID}.ctl" -T 5 ofproto/list > /dev/null && | ||
/usr/bin/ovs-vsctl -t 5 show > /dev/null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't care for the br0 to exist anymore? I thought that was needed for liveness to make sure new pods can come up and flows be configured for them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that doesn't affect openvswitch readiness though. It's not a useful check, really.
/retest |
# 2. that ovsdb is responding to queries | ||
# | ||
# Need the manual target file because ovs-appctl doesn't like | ||
# being in a different namespace from ovs-vswitchd. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If OVS is running in systemd we shouldn't be liveness-probing it anyway... I'd just do
if [ -f /host/var/run/ovs-config-executed ]; then
exit 0
fi
for both probes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I thought about that, but I want to halt rollouts if OVS isn't running, regardless of who should be running it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But ovs should always be running (systemd has the unit set to "Restart=Always") if its running under systemd. I think that is what @danwinship was referring to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, as we've found, there are lots of cases where we got this wrong. The readiness probe protects us from taking down the cluster if we miss yet another corner case.
If the readiness probe is false-negative, then the downgrade is hung and we get a phone call. If the probe is false-positive, then we instantly rolled out a cluster-killer.
I really think we should keep the readiness probe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should certainly remove the liveness probe anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(if not the readiness probe)
ovs-appctl doesn't like running in a different pid namespace, so the liveness probe fails when downgrading from 4.6 (which runs it in systemd).
OK, updated. |
@squeed: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: knobunc, squeed The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/override ci/prow/e2e-metal-ipi-ovn-ipv6 |
@knobunc: Overrode contexts on behalf of knobunc: ci/prow/e2e-metal-ipi-ovn-ipv6 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
/test e2e-aws-sdn-multi |
/retest |
1 similar comment
/retest |
@squeed: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@squeed: All pull requests linked via external trackers have merged: Bugzilla bug 1886127 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
ovs-appctl doesn't like running in a different pid namespace, so the liveness probe fails when downgrading from 4.6 (which runs it in systemd).