Bug 1958390: improve SDN's OVS healthcheck and logging #306

alexanderConstantinescu · 2021-05-25T12:48:02Z

This commit addresses three main issues:

Improve logging in ovs.go by adding the args to the log and
indicating clearly which command is being executed. In case certain
exec's fail, it will greatly help narrow down what command failed to
execute
Run a stricter retry for OVS commands. In case we fail exec'ing OVS
commands, fail much more quickly.
Run the OVS healthcheck loop as early as possible, which means: right
after we are sure that OVS is running.

The main reason for doing this is: solving the < OCP 4.6 problem
which was: SDN talking to a N-1 OVS and thinking everything is fine on
startup. If we aren't strict when checking connection errors to OVS
we might miss the upgrade of OVS from N-1 to N and continue thinking
everything is fine, while in fact OVS has been upgraded and had its
DB wiped.

Signed-off-by: Alexander Constantinescu aconstan@redhat.com

I've gone back and had a look at a similar bug we tried to solve last year by PR: #222 . Looking at the logs from the critical moment that problem started occurring, we can see a similar behavior to the BZ which this PR tries to solve.

In that BZ we had the following start-up logs captured by the SDN pod:

2020-10-28T01:35:14.668732158Z I1028 01:35:14.668676  286561 node.go:148] Initializing SDN node of type "redhat/openshift-ovs-networkpolicy" with configured hostname "oapprown02.oap-011.oappro.jp" (IP "10.3.209.35")
2020-10-28T01:35:14.674939982Z I1028 01:35:14.674903  286561 cmd.go:160] Starting node networking (v0.0.0-alpha.0-149-g7b848571)
2020-10-28T01:35:14.674939982Z I1028 01:35:14.674924  286561 node.go:386] Starting openshift-sdn network plugin
2020-10-28T01:35:14.792165944Z I1028 01:35:14.792088  286561 sdn_controller.go:137] [SDN setup] SDN is already set up
2020-10-28T01:35:14.851808517Z I1028 01:35:14.851732  286561 node.go:435] Starting openshift-sdn pod manager
2020-10-28T01:35:14.881036515Z I1028 01:35:14.880905  286561 ovs.go:180] Error executing ovs-ofctl: 2020-10-28T01:35:14Z|00001|vconn_stream|ERR|connection dropped mid-packet
2020-10-28T01:35:14.881036515Z ovs-ofctl: OpenFlow receive failed (Protocol error)
2020-10-28T01:35:15.385553409Z I1028 01:35:15.385497  286561 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2020-10-28T01:35:16.015052622Z I1028 01:35:16.014997  286561 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2020-10-28T01:35:16.800810331Z I1028 01:35:16.800760  286561 ovs.go:180] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2020-10-28T01:35:17.832142547Z I1028 01:35:17.832088  286561 node.go:293] Checking default interface MTU

This clearly shows the SDN connecting to an OVS N - 1 version and thinking everything is fine, then experiencing a couple of ovs-ofctl errors which were very likely percolated down from

sdn/pkg/network/node/sdn_controller.go

Line 150 in 3ac5486

err := plugin.oc.FinishSetupOVS()

when dumping the rule version note...however, since we weren't very strict w.r.t the amount of OVS errors we accepted: we missed the fact that OVS was rolling out at that moment (and that this was in fact the reason to our ovs-ofctl errors).

In this BZ we have very similar logs shown:

2021-05-06T13:01:54.085296844Z I0506 13:01:54.085221  135002 cmd.go:159] Starting node networking (v0.0.0-alpha.0-246-g8210fdc1)
2021-05-06T13:01:54.085296844Z I0506 13:01:54.085244  135002 node.go:340] Starting openshift-sdn network plugin
2021-05-06T13:01:54.414206178Z I0506 13:01:54.414116  135002 sdn_controller.go:139] [SDN setup] full SDN setup required (plugin is not setup)
2021-05-06T13:01:54.863833659Z I0506 13:01:54.863769  135002 node.go:389] Starting openshift-sdn pod manager
2021-05-06T13:01:54.864511982Z I0506 13:01:54.864451  135002 node.go:443] Reattaching pod 'openshift-service-ca/service-ca-8d744cd4-kgthq' to SDN
2021-05-06T13:01:54.947997196Z I0506 13:01:54.947912  135002 pod.go:508] CNI_ADD openshift-service-ca/service-ca-8d744cd4-kgthq got IP 10.128.0.65, ofport 3
2021-05-06T13:01:54.948139894Z I0506 13:01:54.948112  135002 node.go:443] Reattaching pod 'openshift-oauth-apiserver/apiserver-69455fcd4c-pvlz9' to SDN
2021-05-06T13:01:55.039529313Z I0506 13:01:55.039409  135002 pod.go:508] CNI_ADD openshift-oauth-apiserver/apiserver-69455fcd4c-pvlz9 got IP 10.128.0.68, ofport 4
2021-05-06T13:01:55.039869779Z I0506 13:01:55.039812  135002 node.go:443] Reattaching pod 'openshift-dns-operator/dns-operator-7f88f58858-gs9qb' to SDN
2021-05-06T13:01:55.131654614Z I0506 13:01:55.131581  135002 pod.go:508] CNI_ADD openshift-dns-operator/dns-operator-7f88f58858-gs9qb got IP 10.128.0.9, ofport 5
2021-05-06T13:01:55.131765423Z I0506 13:01:55.131743  135002 node.go:443] Reattaching pod 'openshift-apiserver/apiserver-855458b9cf-vjdzr' to SDN
2021-05-06T13:01:55.21618871Z I0506 13:01:55.216119  135002 pod.go:508] CNI_ADD openshift-apiserver/apiserver-855458b9cf-vjdzr got IP 10.128.0.53, ofport 6
2021-05-06T13:01:55.216300817Z I0506 13:01:55.216273  135002 node.go:443] Reattaching pod 'openshift-ingress-operator/ingress-operator-789b58d964-59whm' to SDN
2021-05-06T13:01:55.298253796Z I0506 13:01:55.298184  135002 pod.go:508] CNI_ADD openshift-ingress-operator/ingress-operator-789b58d964-59whm got IP 10.128.0.60, ofport 7
2021-05-06T13:01:55.298344822Z I0506 13:01:55.298324  135002 node.go:443] Reattaching pod 'openshift-cluster-storage-operator/csi-snapshot-controller-566bb766f7-dsnxb' to SDN
2021-05-06T13:01:55.376391929Z I0506 13:01:55.376314  135002 pod.go:508] CNI_ADD openshift-cluster-storage-operator/csi-snapshot-controller-566bb766f7-dsnxb got IP 10.128.0.62, ofport 8
2021-05-06T13:01:55.376503858Z I0506 13:01:55.376479  135002 node.go:443] Reattaching pod 'openshift-authentication-operator/authentication-operator-77bc654b6c-s6btj' to SDN
2021-05-06T13:01:55.45834169Z I0506 13:01:55.458271  135002 pod.go:508] CNI_ADD openshift-authentication-operator/authentication-operator-77bc654b6c-s6btj got IP 10.128.0.55, ofport 9
2021-05-06T13:01:55.458458818Z I0506 13:01:55.458428  135002 node.go:443] Reattaching pod 'openshift-console-operator/console-operator-868dd878c-8j7z9' to SDN
2021-05-06T13:01:55.555805515Z I0506 13:01:55.555742  135002 pod.go:508] CNI_ADD openshift-console-operator/console-operator-868dd878c-8j7z9 got IP 10.128.0.59, ofport 10
2021-05-06T13:01:55.555928798Z I0506 13:01:55.555906  135002 node.go:443] Reattaching pod 'openshift-multus/multus-admission-controller-2rrdj' to SDN
2021-05-06T13:01:55.671468988Z I0506 13:01:55.671415  135002 pod.go:508] CNI_ADD openshift-multus/multus-admission-controller-2rrdj got IP 10.128.0.7, ofport 11
2021-05-06T13:01:55.671655722Z I0506 13:01:55.671638  135002 node.go:443] Reattaching pod 'openshift-multus/network-metrics-daemon-57f4w' to SDN
2021-05-06T13:01:55.73830363Z I0506 13:01:55.738226  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: OpenFlow receive failed (End of file)
2021-05-06T13:01:56.087907254Z I0506 13:01:56.087818  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2021-05-06T13:01:56.245006373Z I0506 13:01:56.244899  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2021-05-06T13:01:56.594228283Z I0506 13:01:56.594160  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2021-05-06T13:01:56.876458321Z I0506 13:01:56.876394  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2021-05-06T13:01:57.226701933Z I0506 13:01:57.226632  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2021-05-06T13:01:57.666125566Z I0506 13:01:57.665981  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2021-05-06T13:01:58.014223508Z I0506 13:01:58.014108  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2021-05-06T13:01:58.649460518Z I0506 13:01:58.649381  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2021-05-06T13:01:58.996941634Z I0506 13:01:58.996887  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2021-05-06T13:01:59.877080318Z I0506 13:01:59.877014  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2021-05-06T13:02:00.224342925Z I0506 13:02:00.224290  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2021-05-06T13:02:01.409368886Z I0506 13:02:01.409292  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2021-05-06T13:02:01.756231495Z I0506 13:02:01.756166  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2021-05-06T13:02:03.32379078Z I0506 13:02:03.323701  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2021-05-06T13:02:03.674365555Z I0506 13:02:03.674314  135002 ovs.go:158] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
2021-05-06T13:02:06.085014Z I0506 13:02:06.084921  135002 pod.go:508] CNI_ADD openshift-multus/network-metrics-daemon-57f4w got IP 10.128.0.69, ofport 12

In this case we again start talking to OVS N - 1, we proceed to attaching all the running pods and experience a moment of error when setting up openshift-multus/network-metrics-daemon-57f4w but again, since we are too lenient w.r.t. OVS and the errors produced we finally end up successfully adding that pod, but missed that OVS has rolled out from under us, and wiped its DB in the process.

/assign @danwinship @tssurya

openshift-ci · 2021-05-25T12:48:06Z

@alexanderConstantinescu: This pull request references Bugzilla bug 1958390, which is invalid:

expected the bug to target the "4.8.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1958390: improve SDN's OVS healthcheck and logging

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

alexanderConstantinescu · 2021-05-25T15:12:49Z

/test e2e-aws

alexanderConstantinescu · 2021-05-25T19:18:17Z

/test e2e-aws

alexanderConstantinescu · 2021-05-27T16:08:18Z

Damn, this is not enough....just experience a 4.5 to 4.6 upgrade that failed, using #308 + cluster-bot

SDN started, connected to OVS N-1 and once OVS upgraded it saw the following errors:

I0527 15:32:09.860666  127900 proxier.go:349] userspace syncProxyRules took 135.654855ms
I0527 15:32:09.966151  127900 proxier.go:1682] Opened local port "nodePort for openshift-ingress/router-default:https" (:31817/tcp)
I0527 15:32:09.966762  127900 proxier.go:1682] Opened local port "nodePort for e2e-k8s-service-lb-available-2586/service-test:" (:32239/tcp)
I0527 15:32:09.966907  127900 proxier.go:1682] Opened local port "nodePort for openshift-ingress/router-default:http" (:31779/tcp)
I0527 15:32:10.012435  127900 service_health.go:98] Opening healthcheck "e2e-k8s-service-lb-available-2586/service-test" on port 32659
I0527 15:32:10.012662  127900 service_health.go:98] Opening healthcheck "openshift-ingress/router-default" on port 30459
I0527 15:32:10.028892  127900 proxy.go:311] openshift-sdn proxy services and endpoints initialized
I0527 15:32:10.028928  127900 cmd.go:172] openshift-sdn network plugin registering startup
I0527 15:32:10.029054  127900 cmd.go:176] openshift-sdn network plugin ready
I0527 15:32:17.727565  127900 pod.go:544] CNI_DEL openshift-multus/multus-admission-controller-7h84q
I0527 15:32:27.258324  127900 pod.go:508] CNI_ADD openshift-multus/multus-admission-controller-5bz46 got IP 10.129.0.100, ofport 29
I0527 15:32:32.835433  127900 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-multus/multus-admission-controller:metrics to [10.128.0.5:8443 10.129.0.100:8443 10.130.0.3:8443]
I0527 15:32:32.835471  127900 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-multus/multus-admission-controller:webhook to [10.128.0.5:6443 10.129.0.100:6443 10.130.0.3:6443]
I0527 15:32:32.857504  127900 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-multus/multus-admission-controller:metrics to [10.128.0.5:8443 10.129.0.100:8443]
I0527 15:32:32.857605  127900 roundrobin.go:217] Delete endpoint 10.130.0.3:8443 for service "openshift-multus/multus-admission-controller:metrics"
I0527 15:32:32.857656  127900 roundrobin.go:267] LoadBalancerRR: Setting endpoints for openshift-multus/multus-admission-controller:webhook to [10.128.0.5:6443 10.129.0.100:6443]
I0527 15:32:32.857741  127900 roundrobin.go:217] Delete endpoint 10.130.0.3:6443 for service "openshift-multus/multus-admission-controller:webhook"
I0527 15:32:33.045265  127900 proxier.go:370] userspace proxy: processing 0 service events
I0527 15:32:33.046816  127900 proxier.go:349] userspace syncProxyRules took 44.846303ms
I0527 15:32:33.222584  127900 proxier.go:370] userspace proxy: processing 0 service events
I0527 15:32:33.223646  127900 proxier.go:349] userspace syncProxyRules took 42.68799ms
E0527 15:33:00.220348  127900 ovs.go:158] Error executing cmd: ovs-ofctl with args: [-O OpenFlow13 dump-flows br0 table=253], output: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
E0527 15:33:00.328547  127900 ovs.go:158] Error executing cmd: ovs-ofctl with args: [-O OpenFlow13 dump-flows br0 table=253], output: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket (Connection refused)
I0527 15:33:00.504646  127900 healthcheck.go:85] SDN healthcheck reconnected to OVS server

OVS started upgrading at:

Starting ovsdb-server.
Configuring Open vSwitch system IDs.
Enabling remote OVSDB managers.
Starting ovs-vswitchd.
Enabling remote OVSDB managers.
2021-05-27 15:33:00 info: Loading previous flows ...
2021-05-27 15:33:00 info: Adding br0 if it doesn't exist ...
2021-05-27 15:33:00 info: Created br0, now adding flows ...
+ ovs-ofctl add-tlv-map br0 ''
2021-05-27T15:33:00Z|00001|vconn|WARN|unix:/var/run/openvswitch/br0.mgmt: version negotiation failed (we support version 0x01, peer supports version 0x04)
ovs-ofctl: br0: failed to connect to socket (Broken pipe)
+ ovs-ofctl -O OpenFlow14 add-groups br0 /var/run/openvswitch/ovs-save.WIL41g6jZP/br0.groups.dump --bundle
2021-05-27T15:33:00Z|00001|vconn|WARN|unix:/var/run/openvswitch/br0.mgmt: version negotiation failed (we support version 0x05, peer supports version 0x04)
ovs-ofctl: br0: failed to connect to socket (Broken pipe)
+ ovs-ofctl -O OpenFlow14 replace-flows br0 /var/run/openvswitch/ovs-save.WIL41g6jZP/br0.flows.dump --bundle
2021-05-27T15:33:00Z|00001|vconn|WARN|unix:/var/run/openvswitch/br0.mgmt: version negotiation failed (we support version 0x05, peer supports version 0x04)
ovs-ofctl: br0: failed to connect to socket (Broken pipe)
+ rm -rf /var/run/openvswitch/ovs-save.WIL41g6jZP
2021-05-27 15:33:00 info: Done restoring the existing flows ...
2021-05-27 15:33:00 info: Remove other config ...
2021-05-27 15:33:00 info: Removed other config ...
2021-05-27T15:33:00.163Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovsdb-server.log
2021-05-27T15:33:00.189Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.13.4
2021-05-27T15:33:00.197Z|00003|jsonrpc|WARN|unix#0: receive error: Connection reset by peer
2021-05-27T15:33:00.197Z|00004|reconnect|WARN|unix#0: connection dropped (Connection reset by peer)
2021-05-27T15:33:00.363Z|00055|bridge|INFO|bridge br0: added interface veth2151d11b on port 22
2021-05-27T15:33:00.363Z|00056|bridge|INFO|bridge br0: using datapath ID 0000bafca422164a
2021-05-27T15:33:00.364Z|00057|connmgr|INFO|br0: added service controller "punix:/var/run/openvswitch/br0.mgmt"
2021-05-27T15:33:00.375Z|00058|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.13.4
2021-05-27T15:33:00.437Z|00059|vconn|WARN|unix#0: version negotiation failed (we support version 0x04, peer supports version 0x01)
2021-05-27T15:33:00.437Z|00060|rconn|WARN|br0<->unix#0: connection dropped (Protocol error)
2021-05-27T15:33:00.443Z|00061|vconn|WARN|unix#1: version negotiation failed (we support version 0x04, peer supports version 0x05)
2021-05-27T15:33:00.443Z|00062|rconn|WARN|br0<->unix#1: connection dropped (Protocol error)
2021-05-27T15:33:00.451Z|00063|vconn|WARN|unix#2: version negotiation failed (we support version 0x04, peer supports version 0x05)
2021-05-27T15:33:00.451Z|00064|rconn|WARN|br0<->unix#2: connection dropped (Protocol error)
2021-05-27T15:33:10.196Z|00005|memory|INFO|6760 kB peak resident set size after 10.0 seconds
2021-05-27T15:33:10.196Z|00006|memory|INFO|cells:1774 monitors:2 sessions:2
2021-05-27T15:33:10.380Z|00065|memory|INFO|78872 kB peak resident set size after 10.1 seconds
2021-05-27T15:33:10.380Z|00066|memory|INFO|handlers:2 ports:29 revalidators:2 rules:4 udpif keys:40

I am lowering the ovsBackoff.Duration to 10 milliseconds. I don't think we should be more lenient than that.

alexanderConstantinescu · 2021-05-27T16:30:09Z

OVS started upgrading at:

2021-05-27 15:33:00 info: Loading previous flows ...
2021-05-27 15:33:00 info: Adding br0 if it doesn't exist ...
2021-05-27 15:33:00 info: Created br0, now adding flows ...
+ ovs-ofctl add-tlv-map br0 ''
2021-05-27T15:33:00Z|00001|vconn|WARN|unix:/var/run/openvswitch/br0.mgmt: version negotiation failed (we support version 0x01, peer supports version 0x04)
ovs-ofctl: br0: failed to connect to socket (Broken pipe)
+ ovs-ofctl -O OpenFlow14 add-groups br0 /var/run/openvswitch/ovs-save.WIL41g6jZP/br0.groups.dump --bundle
2021-05-27T15:33:00Z|00001|vconn|WARN|unix:/var/run/openvswitch/br0.mgmt: version negotiation failed (we support version 0x05, peer supports version 0x04)
ovs-ofctl: br0: failed to connect to socket (Broken pipe)
+ ovs-ofctl -O OpenFlow14 replace-flows br0 /var/run/openvswitch/ovs-save.WIL41g6jZP/br0.flows.dump --bundle
2021-05-27T15:33:00Z|00001|vconn|WARN|unix:/var/run/openvswitch/br0.mgmt: version negotiation failed (we support version 0x05, peer supports version 0x04)
ovs-ofctl: br0: failed to connect to socket (Broken pipe)

A orthogonal problem to the question of "what should the timeout be?", is: the OVS restore script seems to be using the wrong OpenFlow version when writing the flows.....

From the faulty node:

bash-4.4$ oc exec -ti ovs-xw9m5 -n openshift-sdn -- bash
[root@ip-10-0-157-31 /]# ovs-ofctl -O OpenFlow13 dump-flows br0

The output of the ovs-save script is, of course: OpenFlow14

 cat /var/run/openvswitch/flows.sh 
ovs-ofctl add-tlv-map br0 ''
ovs-ofctl -O OpenFlow14 add-groups br0               "/var/run/openvswitch/ovs-save.dMGbvTNeSA/br0.groups.dump"  --bundle
ovs-ofctl -O OpenFlow14 replace-flows br0               "/var/run/openvswitch/ovs-save.dMGbvTNeSA/br0.flows.dump"  --bundle
rm -rf "/var/run/openvswitch/ovs-save.dMGbvTNeSA"

alexanderConstantinescu · 2021-05-28T15:29:22Z

I've verified the behavior now using: #308 and openshift/cluster-network-operator#1111 . The one upgrade I did from 4.5 to 4.6 worked fine and I did not see any of the aforementioned issues. I will continue testing upgrades and try to hit the problem.

danwinship · 2021-06-14T14:44:14Z

pkg/network/node/ovs/ovs.go

@@ -162,7 +162,7 @@ func (ovsif *ovsExec) execWithStdin(cmd string, stdinArgs []string, args ...stri

 	output, err := kcmd.CombinedOutput()
 	if err != nil {
-		klog.V(2).Infof("Error executing %s: %s", cmd, string(output))
+		klog.Errorf("Error executing cmd: %s with args: %v, output: %s", cmd, args, string(output))


Hm... We don't want to call klog.Errorf if there's any chance that the caller expected to get an error... but I think maybe there aren't any such cases in ovs.go?

do output:\n%s since output itself may have newlines in it so it's probably clearer to start it on a new line?

Also, the ovs logging changes should be a separate commit (not necessarily a separate PR) from the healthcheck changes

but I think maybe there aren't any such cases in ovs.go?

Yes, certain places accept an error, but never expect an error.

do output:\n%s since output itself may have newlines in it so it's probably clearer to start it on a new line?

OK

Also, the ovs logging changes should be a separate commit (not necessarily a separate PR) from the healthcheck changes

OK

danwinship · 2021-06-14T15:13:36Z

So healthcheck.go has gotten pretty confused about what it needs to be doing.

In the past, there were lots of legitimate situations where OVS would restart when SDN was running, and in some of those cases, SDN was able to just keep going as though nothing had happened, but in other cases, we needed to give up and start over.

But in 4.6 and later, if OVS restarts while SDN is running, that is always an exceptional situation. So we should just kill almost the entirety of healthcheck.go. All we care about is:

At startup time, don't allow SDN startup to proceed until OVS is running (eg, ovsclient.DialTimeout succeeds).
After startup time, if c.WaitForDisconnect() ever returns, then exit immediately with an error

alexanderConstantinescu · 2021-07-05T12:08:55Z

/bugzilla refresh

openshift-ci · 2021-07-05T12:09:00Z

@alexanderConstantinescu: This pull request references Bugzilla bug 1958390, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.9.0) matches configured target release for branch (4.9.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @zhaozhanqi

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pkg/network/node/sdn_controller.go

pkg/network/node/ovs/ovs.go

danwinship · 2021-07-30T17:53:36Z

(sorry, I wrote this review the other day but apparently forgot to click "Submit")
oh, also, healthcheck.go should probably be renamed to something with ovs in it?

pkg/network/node/healthcheck_ovs.go

Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>

danwinship · 2021-08-23T12:37:58Z

/lgtm

openshift-ci · 2021-08-23T12:38:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alexanderConstantinescu, danwinship

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [danwinship]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2021-08-23T12:42:47Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

alexanderConstantinescu · 2021-08-24T05:46:45Z

/retest

openshift-bot · 2021-08-28T21:01:50Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-08-28T21:14:50Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-ci · 2021-08-28T23:21:55Z

@alexanderConstantinescu: All pull requests linked via external trackers have merged:

openshift/sdn#306

Bugzilla bug 1958390 has been moved to the MODIFIED state.

In response to this:

Bug 1958390: improve SDN's OVS healthcheck and logging

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

alexanderConstantinescu · 2021-08-31T07:52:03Z

/cherry-pick release-4.8

openshift-cherrypick-robot · 2021-08-31T07:52:47Z

@alexanderConstantinescu: #306 failed to apply on top of branch "release-4.8":

Applying: Simplyify SDN healthcheck and exit directly on disconnect
Using index info to reconstruct a base tree...
M	pkg/network/node/healthcheck.go
M	pkg/network/node/sdn_controller.go
A	pkg/util/ovs/ovs.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/network/node/sdn_controller.go
CONFLICT (modify/delete): pkg/network/node/healthcheck.go deleted in Simplyify SDN healthcheck and exit directly on disconnect and modified in HEAD. Version HEAD of pkg/network/node/healthcheck.go left in tree.
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Simplyify SDN healthcheck and exit directly on disconnect
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.8

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot assigned danwinship and tssurya May 25, 2021

openshift-ci bot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels May 25, 2021

openshift-ci bot requested review from rcarrillocruz and squeed May 25, 2021 12:49

alexanderConstantinescu force-pushed the bugfix/1958390 branch 2 times, most recently from 6251e67 to c5cfcbf Compare May 26, 2021 17:51

openshift-ci bot mentioned this pull request May 27, 2021

WIP: test #306 on 4.5->4.6 upgrades #308

Closed

alexanderConstantinescu force-pushed the bugfix/1958390 branch from c5cfcbf to 6a3c133 Compare May 27, 2021 16:15

alexanderConstantinescu mentioned this pull request Jun 10, 2021

Bug 1970453: Replace to correct OpenFlow version when replacing flows openshift/cluster-network-operator#1111

Merged

danwinship reviewed Jun 14, 2021

View reviewed changes

alexanderConstantinescu force-pushed the bugfix/1958390 branch from 6a3c133 to 7bc45af Compare June 15, 2021 13:42

openshift-ci bot mentioned this pull request Jun 15, 2021

WIP: New test #306 on 4.5->4.6 upgrades #318

Closed

alexanderConstantinescu force-pushed the bugfix/1958390 branch from 7bc45af to a76e1f6 Compare July 5, 2021 12:07

openshift-ci bot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Jul 5, 2021

openshift-ci bot removed the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Jul 5, 2021

openshift-ci bot requested a review from zhaozhanqi July 5, 2021 12:09

danwinship reviewed Jul 30, 2021

View reviewed changes

pkg/network/node/sdn_controller.go Outdated Show resolved Hide resolved

pkg/network/node/ovs/ovs.go Outdated Show resolved Hide resolved

alexanderConstantinescu force-pushed the bugfix/1958390 branch from a76e1f6 to 2392532 Compare August 20, 2021 13:06

danwinship reviewed Aug 20, 2021

View reviewed changes

pkg/network/node/healthcheck_ovs.go Outdated Show resolved Hide resolved

pkg/network/node/healthcheck_ovs.go Outdated Show resolved Hide resolved

pkg/network/node/healthcheck_ovs.go Outdated Show resolved Hide resolved

alexanderConstantinescu added 2 commits August 23, 2021 10:57

Simplyify SDN healthcheck and exit directly on disconnect

74d4723

Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>

Be less lenient on errors when exec'ing OVS commands

1e02906

Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>

alexanderConstantinescu force-pushed the bugfix/1958390 branch from 2392532 to 1e02906 Compare August 23, 2021 08:57

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 23, 2021

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 23, 2021

openshift-merge-robot merged commit 7e2e350 into openshift:master Aug 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1958390: improve SDN's OVS healthcheck and logging #306

Bug 1958390: improve SDN's OVS healthcheck and logging #306

alexanderConstantinescu commented May 25, 2021

openshift-ci bot commented May 25, 2021

alexanderConstantinescu commented May 25, 2021

alexanderConstantinescu commented May 25, 2021

alexanderConstantinescu commented May 27, 2021 •

edited

alexanderConstantinescu commented May 27, 2021 •

edited

alexanderConstantinescu commented May 28, 2021

danwinship Jun 14, 2021

alexanderConstantinescu Jun 15, 2021

danwinship commented Jun 14, 2021

alexanderConstantinescu commented Jul 5, 2021

openshift-ci bot commented Jul 5, 2021

danwinship commented Jul 30, 2021

danwinship commented Aug 23, 2021

openshift-ci bot commented Aug 23, 2021

openshift-bot commented Aug 23, 2021

alexanderConstantinescu commented Aug 24, 2021

openshift-bot commented Aug 28, 2021

openshift-bot commented Aug 28, 2021

openshift-ci bot commented Aug 28, 2021

alexanderConstantinescu commented Aug 31, 2021

openshift-cherrypick-robot commented Aug 31, 2021

Bug 1958390: improve SDN's OVS healthcheck and logging #306

Bug 1958390: improve SDN's OVS healthcheck and logging #306

Conversation

alexanderConstantinescu commented May 25, 2021

openshift-ci bot commented May 25, 2021

alexanderConstantinescu commented May 25, 2021

alexanderConstantinescu commented May 25, 2021

alexanderConstantinescu commented May 27, 2021 • edited

alexanderConstantinescu commented May 27, 2021 • edited

alexanderConstantinescu commented May 28, 2021

danwinship Jun 14, 2021

Choose a reason for hiding this comment

alexanderConstantinescu Jun 15, 2021

Choose a reason for hiding this comment

danwinship commented Jun 14, 2021

alexanderConstantinescu commented Jul 5, 2021

openshift-ci bot commented Jul 5, 2021

danwinship commented Jul 30, 2021

danwinship commented Aug 23, 2021

openshift-ci bot commented Aug 23, 2021

openshift-bot commented Aug 23, 2021

alexanderConstantinescu commented Aug 24, 2021

openshift-bot commented Aug 28, 2021

openshift-bot commented Aug 28, 2021

openshift-ci bot commented Aug 28, 2021

alexanderConstantinescu commented Aug 31, 2021

openshift-cherrypick-robot commented Aug 31, 2021

alexanderConstantinescu commented May 27, 2021 •

edited

alexanderConstantinescu commented May 27, 2021 •

edited