Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/daemon: log pending config to journal #711

Merged
merged 2 commits into from May 8, 2019

Conversation

runcom
Copy link
Member

@runcom runcom commented May 6, 2019

Let's try to avoid losing a file write...

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 6, 2019
@runcom runcom force-pushed the log-pending-config branch 2 times, most recently from cff0cb9 to 534dcca Compare May 6, 2019 22:57
}

pendingConfigStr := fmt.Sprintf(`MESSAGE_ID=34c7912c5dd2454286097f8f92a22e9e
MESSAGE=%s`, pending.GetName())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO write bootid

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -702,6 +692,36 @@ func (dn *Daemon) updateOS(config *mcfgv1.MachineConfig) error {
return nil
}

func (dn *Daemon) readPendingConfig() (string, error) {
// TODO(runcom): msgid has been generated with journalctl --new-id128, move it to a const
journalOutput, err := exec.Command("journalctl", "-b", "-1", "-o", "cat", "MESSAGE_ID=34c7912c5dd2454286097f8f92a22e9e").CombinedOutput()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output to json to read BOOT_ID and MESSAGE

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, we might want to have a customized "machine-config-daemon-pending-config" as msgid - not sure about opinions on this?

return fmt.Errorf("failed to get stdin pipe: %v", err)
}

pendingConfigStr := fmt.Sprintf(`MESSAGE_ID=34c7912c5dd2454286097f8f92a22e9e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use some abstraction into a wrapper that takes key=value strings as varargs or so.

(One other random thought I had is that we systemd-run a service on the host and send our logs to it and have it log them; then we can properly systemctl status openshift-machine-config or so - it would also basically act as a host-side mutex too to make doubly sure there aren't multiple MCDs)

@cgwalters
Copy link
Member

I know I suggested this...but I'm wavering a bit. I'd be a lot happier if we had a stronger idea of what was happening...I really really want to get live access to an affected cluster.

Maybe the best bet is to keep throwing in more logging PRs and see what comes from that.

@runcom
Copy link
Member Author

runcom commented May 7, 2019

I know I suggested this...but I'm wavering a bit. I'd be a lot happier if we had a stronger idea of what was happening...I really really want to get live access to an affected cluster.

Maybe the best bet is to keep throwing in more logging PRs and see what comes from that.

I concur with that - I opened this to validate if it was something which we might pursue..

@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 7, 2019
@runcom
Copy link
Member Author

runcom commented May 7, 2019

/retest

@runcom
Copy link
Member Author

runcom commented May 7, 2019

console/authentication failures

/retest

@ashcrow
Copy link
Member

ashcrow commented May 7, 2019

/retest

@ashcrow
Copy link
Member

ashcrow commented May 7, 2019

level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console: timed out waiting for the condition"

}
return "", nil
}
return "", fmt.Errorf("no pending config found in journal: %v", string(journalOutput))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this end up being a very large error to return? IE: journalOutput containing lots of content?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're filtering out the journal itself only on our message id and even if we reboot 100 times, we only get 100 lines - this error isn unreacheable though, I need to remove it

}

func (dn *Daemon) logPendingConfig(pending *mcfgv1.MachineConfig, isPending int) error {
logger := exec.Command("logger", "--journald")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth noting why writing to stdin is used rather than directly executing the command with the message. At first I was going to recommend simplifying via a direct call until I gave logger a run and realized that direct calling it is awkward (double enter to end input).

@ashcrow
Copy link
Member

ashcrow commented May 7, 2019

/retest

2 similar comments
@imcleod
Copy link
Contributor

imcleod commented May 7, 2019

/retest

@runcom
Copy link
Member Author

runcom commented May 7, 2019

/retest

@runcom runcom force-pushed the log-pending-config branch 2 times, most recently from 769b887 to b0cbf41 Compare May 7, 2019 23:00
@runcom runcom changed the title WIP pkg/daemon: log pending config to journal pkg/daemon: log pending config to journal May 7, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 7, 2019
@runcom
Copy link
Member Author

runcom commented May 8, 2019

oh so nice this worked and we're now reading/writing to journal the pending config

@kikisdeliveryservice
Copy link
Contributor

@runcom are we going this route for now then?

@runcom
Copy link
Member Author

runcom commented May 8, 2019

@runcom are we going this route for now then?

let's hear back from @cgwalters, but I would greatly love to merge this and validate it properly in the upgrade jobs (we need this to land to master in order for the upgrade to pick up this change as the starting point of the job itself).
I share the hesitation with Colin into going this route anyway but I'm still standing towards finding the root cause of the bugzilla and properly handle that. Having said that, as closer as we might be, if this turns out to be stable enough, why not having it.

@cgwalters
Copy link
Member

I think what's pushing me towards merging this the most is that it will make auditing events easier.

The code design looks good to me.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 8, 2019
@imcleod
Copy link
Contributor

imcleod commented May 8, 2019

/retest

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label May 8, 2019
@openshift-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@runcom
Copy link
Member Author

runcom commented May 8, 2019

rebased and re-pushed.
/lgtm

@openshift-ci-robot
Copy link
Contributor

@runcom: you cannot LGTM your own PR.

In response to this:

rebased and re-pushed.
/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@runcom runcom added the lgtm Indicates that a PR is ready to be merged. label May 8, 2019
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label May 8, 2019
@openshift-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@runcom runcom added the lgtm Indicates that a PR is ready to be merged. label May 8, 2019
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label May 8, 2019
@openshift-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@runcom runcom added the lgtm Indicates that a PR is ready to be merged. label May 8, 2019
@openshift-merge-robot openshift-merge-robot merged commit fe5ae49 into openshift:master May 8, 2019
@runcom runcom deleted the log-pending-config branch May 8, 2019 11:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants