New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1840385: pkg/daemon: bubble up pivot errors #1868
Bug 1840385: pkg/daemon: bubble up pivot errors #1868
Conversation
@runcom: This pull request references Bugzilla bug 1840385, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
pending a bunch of little testing I want to do /hold |
@runcom: This pull request references Bugzilla bug 1840385, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
d30e08c
to
d03c727
Compare
@runcom: This pull request references Bugzilla bug 1840385, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
Hmm...I think once we have #1766 this would be heading towards obsolete because we can directly change the code here to clearly pass errors between our own binaries, right? |
uhm, we're still running a unit in there https://github.com/openshift/machine-config-operator/pull/1766/files#diff-95e83e4216073d5ba6d128c764d05756R325 so this might still be beneficial as all this does is reporting the error on disk to be read by the MCD and then bubbled up the stack, am I missing something? 🤔 |
@runcom: This pull request references Bugzilla bug 1840385, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
and uhm, ofc this can't work if we don't rebuild the MCD binary with the new pivot code 🤦♂️ |
I think |
|
That Fatal in there isn't even being triggered as the service does start but it later fails and we don't catch that with |
Also, let's make this blocked on #1766 (thanks Colin for pointing that out!) as that helps us with the mcd binary rebuilt that we need now but with that PR, we won't need it :) |
ok, so, let's take a step back. Today, if pivot fails with something like this:
The MCO and MCP won't know about it, what we get is just an unhelpful:
and thus to debug, we have to ask must-gather and go look at each MCD to check why (the reason about is manifest unknown, the BZ reason was Quay not reachable) so, this PR fixes that by reporting the actual pivot error to the MCD and the MCD reports that to the MCP that reports it to the MCO clusterobject. |
I think yes then https://github.com/openshift/machine-config-operator/pull/1766/files#diff-06961b075f1753956d802ba954d2cfb5R1294 (even if we support the old pivot in the if/else branch right?) Although, #1766 isn't bringing fixes to the actual thing that reports the error up so this is still needed https://github.com/openshift/machine-config-operator/pull/1868/files#diff-95e83e4216073d5ba6d128c764d05756R177-R180 - we can't just "Fatal" and abort the MCD https://github.com/openshift/machine-config-operator/pull/1868/files#diff-7df5ea6328c58701b1e764dd5ec480daL48 Alright, I think we're on the same page then and yes #1766 is definitely needed before this and once that lands we still need some bits from this to make it right (again, we can't Fatal in a server process like MCD, but nicely error out and report) |
Sorry, I thought about this some more and you're right. We do want something like this in the MCD case, although...it might be simpler to take all lines from the journal output that start with |
we need to differentiate between errors and logs :/ I'm ok reworking this for that tho no biggie - also, https://github.com/openshift/machine-config-operator/pull/1766/files#diff-06961b075f1753956d802ba954d2cfb5R1286 in the cluster case we still run the service so we still need something that bubbles things up. But again, I think we're on the same page now, I'll hold my breath till #1766 lands and rework this out the way we prefer :) |
/retest |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest |
/retest Please review the full test history for this PR and help us cut down flakes. |
21 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@runcom: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@runcom: All pull requests linked via external trackers have merged: openshift/machine-config-operator#1868. Bugzilla bug 1840385 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
taking a look at BZ 1840385 we can see that we had tons of useful logs in the MCD:
The sad reality is that what we were bubbling up was just:
This patch reports everything we get when failing in pivot so that we won't need to ask a must-gather to debug those situations but the error is visible at the MCP/MCO level.
@cgwalters ptal
The test I'm doing is:
oc edit cm/machine-config-osimageurl -nopenshift-machine-config-operator
)sha256:5eedf858762c5b42b4fbdef68a29ca2e47d81248c75874b14cb313de19f6b925
) (also, any error from pivot is valid to test this PR)oc describe mcp/<pool_name>
(that gets later bubbled up to the MCO cluster object if it's master, especially on install like the BZ error)Signed-off-by: Antonio Murdaca runcom@linux.com