Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1840385: pkg/daemon: bubble up pivot errors #1868

Merged

Conversation

runcom
Copy link
Member

@runcom runcom commented Jun 24, 2020

taking a look at BZ 1840385 we can see that we had tons of useful logs in the MCD:

2020-05-19T13:06:46.445346629Z I0519 13:06:46.085072 2996778 run.go:16] Running: podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b920c6ed07fa42565949ce61f14b3b9dd7bea64d2cfb13539fcea7068ea799f2
2020-05-19T13:06:46.445346629Z Error: error pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b920c6ed07fa42565949ce61f14b3b9dd7bea64d2cfb13539fcea7068ea799f2": unable to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b920c6ed07fa42565949ce61f14b3b9dd7bea64d2cfb13539fcea7068ea799f2: unable to pull image: Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b920c6ed07fa42565949ce61f14b3b9dd7bea64d2cfb13539fcea7068ea799f2: error pinging docker registry quay.io: Get https://quay.io/v2/: EOF
2020-05-19T13:06:46.445346629Z W0519 13:06:46.222937 2996778 run.go:40] podman failed: exit status 125; retrying...

The sad reality is that what we were bubbling up was just:

failed to run pivot: failed to start machine-config-daemon-host.service: exit status 1

This patch reports everything we get when failing in pivot so that we won't need to ask a must-gather to debug those situations but the error is visible at the MCP/MCO level.

@cgwalters ptal

The test I'm doing is:

  • spin up a cluster with this PR
  • modify osImageUrl manually to trigger an update (oc edit cm/machine-config-osimageurl -nopenshift-machine-config-operator)
  • change just the sha256 part of the current osImageUrl to something that would trigger a manifest unknown (i.e. sha256:5eedf858762c5b42b4fbdef68a29ca2e47d81248c75874b14cb313de19f6b925) (also, any error from pivot is valid to test this PR)
  • check that errors are bubbled up to the MCP with the usual oc describe mcp/<pool_name> (that gets later bubbled up to the MCO cluster object if it's master, especially on install like the BZ error)

Signed-off-by: Antonio Murdaca runcom@linux.com

@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. label Jun 24, 2020
@openshift-ci-robot
Copy link
Contributor

@runcom: This pull request references Bugzilla bug 1840385, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1840385: pkg/daemon: bubble up pivot errors

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Jun 24, 2020
@runcom
Copy link
Member Author

runcom commented Jun 24, 2020

pending a bunch of little testing I want to do

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 24, 2020
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 24, 2020
@openshift-ci-robot
Copy link
Contributor

@runcom: This pull request references Bugzilla bug 1840385, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1840385: pkg/daemon: bubble up pivot errors

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@runcom
Copy link
Member Author

runcom commented Jun 24, 2020

/retest

@openshift-ci-robot
Copy link
Contributor

@runcom: This pull request references Bugzilla bug 1840385, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1840385: pkg/daemon: bubble up pivot errors

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@runcom
Copy link
Member Author

runcom commented Jun 24, 2020

/retest

@cgwalters
Copy link
Member

Hmm...I think once we have #1766 this would be heading towards obsolete because we can directly change the code here to clearly pass errors between our own binaries, right?

@runcom
Copy link
Member Author

runcom commented Jun 24, 2020

Hmm...I think once we have #1766 this would be heading towards obsolete because we can directly change the code here to clearly pass errors between our own binaries, right?

uhm, we're still running a unit in there https://github.com/openshift/machine-config-operator/pull/1766/files#diff-95e83e4216073d5ba6d128c764d05756R325 so this might still be beneficial as all this does is reporting the error on disk to be read by the MCD and then bubbled up the stack, am I missing something? 🤔

@openshift-ci-robot
Copy link
Contributor

@runcom: This pull request references Bugzilla bug 1840385, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1840385: pkg/daemon: bubble up pivot errors

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@runcom
Copy link
Member Author

runcom commented Jun 24, 2020

Hmm...I think once we have #1766 this would be heading towards obsolete because we can directly change the code here to clearly pass errors between our own binaries, right?

and uhm, ofc this can't work if we don't rebuild the MCD binary with the new pivot code 🤦‍♂️

@cgwalters
Copy link
Member

I think followPivotJournalLogs() is supposed to be solving this problem today but isn't. I bet it's that we're racing to read the results from the journal with systemd reporting the failure back to us.

@runcom
Copy link
Member Author

runcom commented Jun 24, 2020

I think followPivotJournalLogs() is supposed to be solving this problem today but isn't. I bet it's that we're racing to read the results from the journal with systemd reporting the failure back to us.

followPivotJournalLogs() is meant to stream the logs to the MCD logs (including whatever isn't an error as well) - what this PR does is ensuring that whatever pivot fails with (like in the attached BZ) we report it from the MCO and that isn't done by followPivotJournalLogs right?

@runcom
Copy link
Member Author

runcom commented Jun 24, 2020

followPivotJournalLogs() is meant to stream the logs to the MCD logs (including whatever isn't an error as well) - what this PR does is ensuring that whatever pivot fails with (like in the attached BZ) we report it from the MCO and that isn't done by followPivotJournalLogs right?

func followPivotJournalLogs(stopCh <-chan time.Time) {
	cmd := exec.Command("journalctl", "-f", "-b", "-o", "cat",
		"-u", "rpm-ostreed",
		"-u", "pivot",
		"-u", "machine-config-daemon-host")
	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr
	if err := cmd.Start(); err != nil {
		glog.Fatal(err)
		MCDPivotErr.WithLabelValues("", err.Error()).SetToCurrentTime()
	}

That Fatal in there isn't even being triggered as the service does start but it later fails and we don't catch that with followPivotJournalLogs

@runcom
Copy link
Member Author

runcom commented Jun 24, 2020

Also, let's make this blocked on #1766 (thanks Colin for pointing that out!) as that helps us with the mcd binary rebuilt that we need now but with that PR, we won't need it :)

@cgwalters
Copy link
Member

Wait you're talking about outside of the MCD? I think is involved here is related to what I was fixing in #1802 - there's a double indirection going on in current master in the systemd unit.

I rolled that PR into #1766 - I guess to say this another way I'm 87.9% sure that PR will fix this.

@runcom
Copy link
Member Author

runcom commented Jun 24, 2020

Wait you're talking about outside of the MCD? I think is involved here is related to what I was fixing in #1802 - there's a double indirection going on in current master in the systemd unit.

I rolled that PR into #1766 - I guess to say this another way I'm 87.9% sure that PR will fix this.

ok, so, let's take a step back. Today, if pivot fails with something like this:

I0624 14:14:26.714492   60233 run.go:16] Running: podman pull -q --authfile /var/lib/kubelet/config.json registry.svc.ci.openshift.org/ci-ln-n5zx9bk/stable@sha256:3b143cb20cf3cc0fbd287d785863c4d4d17498bde34cf8ffdb76d507b895afbd
Error: error pulling image "registry.svc.ci.openshift.org/ci-ln-n5zx9bk/stable@sha256:3b143cb20cf3cc0fbd287d785863c4d4d17498bde34cf8ffdb76d507b895afbd": unable to pull registry.svc.ci.openshift.org/ci-ln-n5zx9bk/stable@sha256:3b143cb20cf3cc0fbd287d785863c4d4d17498bde34cf8ffdb76d507b895afbd: unable to pull image: Error initializing source docker://registry.svc.ci.openshift.org/ci-ln-n5zx9bk/stable@sha256:3b143cb20cf3cc0fbd287d785863c4d4d17498bde34cf8ffdb76d507b895afbd: Error reading manifest sha256:3b143cb20cf3cc0fbd287d785863c4d4d17498bde34cf8ffdb76d507b895afbd in registry.svc.ci.openshift.org/ci-ln-n5zx9bk/stable: manifest unknown: manifest unknown
W0624 14:14:27.736186   60233 run.go:40] podman failed: exit status 125; retrying...

The MCO and MCP won't know about it, what we get is just an unhelpful:

failed to run pivot: failed to start machine-config-daemon-host.service: exit status 1

and thus to debug, we have to ask must-gather and go look at each MCD to check why (the reason about is manifest unknown, the BZ reason was Quay not reachable)

so, this PR fixes that by reporting the actual pivot error to the MCD and the MCD reports that to the MCP that reports it to the MCO clusterobject.
Is your PR addressing the case above?

@runcom
Copy link
Member Author

runcom commented Jun 24, 2020

Is your PR addressing the case above?

I think yes then https://github.com/openshift/machine-config-operator/pull/1766/files#diff-06961b075f1753956d802ba954d2cfb5R1294

(even if we support the old pivot in the if/else branch right?)

Although, #1766 isn't bringing fixes to the actual thing that reports the error up so this is still needed https://github.com/openshift/machine-config-operator/pull/1868/files#diff-95e83e4216073d5ba6d128c764d05756R177-R180 - we can't just "Fatal" and abort the MCD https://github.com/openshift/machine-config-operator/pull/1868/files#diff-7df5ea6328c58701b1e764dd5ec480daL48

Alright, I think we're on the same page then and yes #1766 is definitely needed before this and once that lands we still need some bits from this to make it right (again, we can't Fatal in a server process like MCD, but nicely error out and report)

@cgwalters
Copy link
Member

Sorry, I thought about this some more and you're right. We do want something like this in the MCD case, although...it might be simpler to take all lines from the journal output that start with error or so?

@runcom
Copy link
Member Author

runcom commented Jun 24, 2020

Sorry, I thought about this some more and you're right. We do want something like this in the MCD case, although...it might be simpler to take all lines from the journal output that start with error or so?

we need to differentiate between errors and logs :/ I'm ok reworking this for that tho no biggie - also, https://github.com/openshift/machine-config-operator/pull/1766/files#diff-06961b075f1753956d802ba954d2cfb5R1286 in the cluster case we still run the service so we still need something that bubbles things up.

But again, I think we're on the same page now, I'll hold my breath till #1766 lands and rework this out the way we prefer :)

@runcom
Copy link
Member Author

runcom commented Jul 2, 2020

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@runcom
Copy link
Member Author

runcom commented Jul 2, 2020

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

21 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jul 4, 2020

@runcom: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-scaleup-rhel7 e8a5395 link /test e2e-aws-scaleup-rhel7
ci/prow/e2e-ovn-step-registry e8a5395 link /test e2e-ovn-step-registry

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 7cc0875 into openshift:master Jul 4, 2020
@openshift-ci-robot
Copy link
Contributor

@runcom: All pull requests linked via external trackers have merged: openshift/machine-config-operator#1868. Bugzilla bug 1840385 has been moved to the MODIFIED state.

In response to this:

Bug 1840385: pkg/daemon: bubble up pivot errors

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@runcom runcom deleted the bubbleup-pivot-errors branch July 6, 2020 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants