New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1958390: improve SDN's OVS healthcheck and logging #306
Bug 1958390: improve SDN's OVS healthcheck and logging #306
Conversation
@alexanderConstantinescu: This pull request references Bugzilla bug 1958390, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test e2e-aws |
1 similar comment
/test e2e-aws |
6251e67
to
c5cfcbf
Compare
Damn, this is not enough....just experience a 4.5 to 4.6 upgrade that failed, using #308 + cluster-bot SDN started, connected to OVS N-1 and once OVS upgraded it saw the following errors:
OVS started upgrading at:
I am lowering the |
c5cfcbf
to
6a3c133
Compare
A orthogonal problem to the question of "what should the timeout be?", is: the OVS restore script seems to be using the wrong OpenFlow version when writing the flows..... From the faulty node:
The output of the
|
I've verified the behavior now using: #308 and openshift/cluster-network-operator#1111 . The one upgrade I did from 4.5 to 4.6 worked fine and I did not see any of the aforementioned issues. I will continue testing upgrades and try to hit the problem. |
pkg/network/node/ovs/ovs.go
Outdated
@@ -162,7 +162,7 @@ func (ovsif *ovsExec) execWithStdin(cmd string, stdinArgs []string, args ...stri | |||
|
|||
output, err := kcmd.CombinedOutput() | |||
if err != nil { | |||
klog.V(2).Infof("Error executing %s: %s", cmd, string(output)) | |||
klog.Errorf("Error executing cmd: %s with args: %v, output: %s", cmd, args, string(output)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm... We don't want to call klog.Errorf
if there's any chance that the caller expected to get an error... but I think maybe there aren't any such cases in ovs.go
?
do output:\n%s
since output
itself may have newlines in it so it's probably clearer to start it on a new line?
Also, the ovs logging changes should be a separate commit (not necessarily a separate PR) from the healthcheck changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I think maybe there aren't any such cases in ovs.go?
Yes, certain places accept an error, but never expect an error.
do output:\n%s since output itself may have newlines in it so it's probably clearer to start it on a new line?
OK
Also, the ovs logging changes should be a separate commit (not necessarily a separate PR) from the healthcheck changes
OK
So In the past, there were lots of legitimate situations where OVS would restart when SDN was running, and in some of those cases, SDN was able to just keep going as though nothing had happened, but in other cases, we needed to give up and start over. But in 4.6 and later, if OVS restarts while SDN is running, that is always an exceptional situation. So we should just kill almost the entirety of
|
6a3c133
to
7bc45af
Compare
7bc45af
to
a76e1f6
Compare
/bugzilla refresh |
@alexanderConstantinescu: This pull request references Bugzilla bug 1958390, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
(sorry, I wrote this review the other day but apparently forgot to click "Submit") |
a76e1f6
to
2392532
Compare
Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>
Signed-off-by: Alexander Constantinescu <aconstan@redhat.com>
2392532
to
1e02906
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alexanderConstantinescu, danwinship The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest-required Please review the full test history for this PR and help us cut down flakes. |
@alexanderConstantinescu: All pull requests linked via external trackers have merged: Bugzilla bug 1958390 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-4.8 |
@alexanderConstantinescu: #306 failed to apply on top of branch "release-4.8":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This commit addresses three main issues:
ovs.go
by adding the args to the log andindicating clearly which command is being executed. In case certain
exec's fail, it will greatly help narrow down what command failed to
execute
commands, fail much more quickly.
after we are sure that OVS is running.
The main reason for doing this is: solving the < OCP 4.6 problem
which was: SDN talking to a N-1 OVS and thinking everything is fine on
startup. If we aren't strict when checking connection errors to OVS
we might miss the upgrade of OVS from N-1 to N and continue thinking
everything is fine, while in fact OVS has been upgraded and had its
DB wiped.
Signed-off-by: Alexander Constantinescu aconstan@redhat.com
I've gone back and had a look at a similar bug we tried to solve last year by PR: #222 . Looking at the logs from the critical moment that problem started occurring, we can see a similar behavior to the BZ which this PR tries to solve.
In that BZ we had the following start-up logs captured by the SDN pod:
This clearly shows the SDN connecting to an OVS N - 1 version and thinking everything is fine, then experiencing a couple of
ovs-ofctl
errors which were very likely percolated down fromsdn/pkg/network/node/sdn_controller.go
Line 150 in 3ac5486
ovs-ofctl
errors).In this BZ we have very similar logs shown:
In this case we again start talking to OVS N - 1, we proceed to attaching all the running pods and experience a moment of error when setting up
openshift-multus/network-metrics-daemon-57f4w
but again, since we are too lenient w.r.t. OVS and the errors produced we finally end up successfully adding that pod, but missed that OVS has rolled out from under us, and wiped its DB in the process./assign @danwinship @tssurya