Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1708602: pkg/daemon: workaround old util-linux logger in rhel7.6 #734

Merged
merged 4 commits into from May 13, 2019

Conversation

Projects
None yet
8 participants
@runcom
Copy link
Member

commented May 10, 2019

- What I did

RHEL7.6 has an old util-linux's logger which doesn't support the --journald flag to log to journal in a structured way. Work that around by logging a json directly and fall back to that if we're on rhel7.6 then.

- How to verify it

- Description for the changelog

@runcom

This comment has been minimized.

Copy link
Member Author

commented May 10, 2019

/hold

Depends on #733

@runcom

This comment has been minimized.

Copy link
Member Author

commented May 10, 2019

/test e2e-rhel-scaleup

if !loggerWithoutJournalOption() {
return "", nil
}
cmd := "journalctl -o cat | grep OLD_LOGGER"

This comment has been minimized.

Copy link
@cgwalters

cgwalters May 10, 2019

Contributor

Eeek. Scraping all journal messages for that?

How about we detect on startup whether the host is rhel7, and add that as a flag in the daemon struct. That'd be useful for a lot of other things.

This comment has been minimized.

Copy link
@runcom

runcom May 10, 2019

Author Member

we scrape all journal only if --journald isn't supported so we need to do this anyway right? I didn't want to add a specific rhel7 flag as I was just testing for functionality which should be generally better. If rhel7.7 upgrades util linux to a logger that does have --journald, we'd be still using this awful ack if we just test RHEL versions

This comment has been minimized.

Copy link
@cgwalters

cgwalters May 10, 2019

Contributor

If rhel7.7 upgrades util linux to a logger that does have --journald, we'd be still using this awful ack if we just test RHEL versions

Even if it did upgrade which...seems very unlikely (right?) we'd still have an upgrade hazard where for previous versions we'd write one way and then later the other way. Seems better to just always live with one format.

This comment has been minimized.

Copy link
@runcom

runcom May 10, 2019

Author Member

Eeek. Scraping all journal messages for that?

to reply specifically on that. With this patch, we're not always scraping all journal, if !loggerWithoutJournalOption() { takes care of telling us if we need to do that in case logger doesn't have --journald.

This comment has been minimized.

Copy link
@runcom

runcom May 10, 2019

Author Member

Even if it did upgrade which...seems very unlikely (right?) we'd still have an upgrade hazard where for previous versions we'd write one way and then later the other way. Seems better to just always live with one format.

Yeah, agree it's very unlikely indeed, but since we don't need to play with a rhel7 flag today, isn't testing for the logger functionality in itself enough? (I'm super happy to change it anyway)

This comment has been minimized.

Copy link
@cgwalters

cgwalters May 10, 2019

Contributor

OLD_LOGGER seems a bit too generic...while it's unlikely, that string could appear in log messages from any othe process.

(Ideally...we'd have the MCD logging under machine-config-daemon.service that we create on the host. That's another topic)

How about calling it OPENSHIFT_MACHINE_CONFIG_DAEMON_LEGACY_LOG_HACK ? 😉

This comment has been minimized.

Copy link
@runcom

runcom May 10, 2019

Author Member

How about calling it OPENSHIFT_MACHINE_CONFIG_DAEMON_LEGACY_LOG_HACK ?

😂 I'll go for it, that's a good one indeed to avoid collisions

return loggerOutput, err
}
// Super hacky workaround for the above
oldLogger := exec.Command("logger", fmt.Sprintf(`{"MESSAGE": "%s", "BOOT_ID": "%s", "PENDING": "%d", "OLD_LOGGER": "1"}`, pending.GetName(), dn.bootID, isPending))

This comment has been minimized.

Copy link
@cgwalters

cgwalters May 10, 2019

Contributor

Then in this code we wouldn't try and fall back each time, we'd just do:

if (!dn.hostRhel7) {
   logger --journald
} else {
  logger with literal JSON as message
}

This comment has been minimized.

Copy link
@runcom
@@ -690,6 +690,17 @@ func (dn *Daemon) updateOS(config *mcfgv1.MachineConfig) error {
return nil
}

// RHEL 7.6 logger (util-linux) doesn't have the --journald flag
func loggerWithoutJournalOption() bool {
loggerOutput, err := exec.Command("logger", "--journald").CombinedOutput()

This comment has been minimized.

Copy link
@cgwalters

cgwalters May 10, 2019

Contributor

I think we should use logger --help...as running this command just hangs for me; maybe it works because we're not providing stdin?

Or better, do:

if dn.OperatingSystem == machineConfigDaemonOSRHCOS {
    return false
}

?

This comment has been minimized.

Copy link
@runcom

runcom May 10, 2019

Author Member

uhm, does it hang on RHCOS? works fine on rhel7.6 🤔

This comment has been minimized.

Copy link
@runcom

runcom May 10, 2019

Author Member

oh yeah I get it now thanks!

This comment has been minimized.

Copy link
@runcom

runcom May 10, 2019

Author Member

fixed

@@ -701,15 +712,25 @@ func (dn *Daemon) getPendingState() (string, error) {
return "", err
}
if len(journalOutput) == 0 {
return "", nil
if !loggerWithoutJournalOption() {

This comment has been minimized.

Copy link
@cgwalters

cgwalters May 10, 2019

Contributor

note double negative here, probably better as loggerSupportsJournal

This comment has been minimized.

Copy link
@runcom

runcom May 10, 2019

Author Member

fixed

@runcom runcom force-pushed the runcom:fix-old-logger branch 2 times, most recently from 69f2122 to c37d648 May 10, 2019

@runcom

This comment has been minimized.

Copy link
Member Author

commented May 10, 2019

/test e2e-rhel-scaleup

@runcom

This comment has been minimized.

Copy link
Member Author

commented May 10, 2019

/retest

@runcom runcom force-pushed the runcom:fix-old-logger branch from c37d648 to bfbf2f6 May 10, 2019

@runcom

This comment has been minimized.

Copy link
Member Author

commented May 10, 2019

/test e2e-rhel-scaleup

Message string `json:"MESSAGE,omitempty"`
BootID string `json:"BOOT_ID,omitempty"`
Pending string `json:"PENDING,omitempty"`
OldLogger string `json:"OPENSHIFT_MACHINE_CONFIG_DAEMON_LEGACY_LOG_HACK,omitempty"` // unused today

This comment has been minimized.

Copy link
@kikisdeliveryservice

kikisdeliveryservice May 10, 2019

Member

love this. 😆

@ashcrow
Copy link
Member

left a comment

I feel like the logger shell out code should probably end up being encapsulated in it's own struct/func, but for now this seems sane to fix the bug.

@runcom

This comment has been minimized.

Copy link
Member Author

commented May 10, 2019

@vrutkovs just for awareness, not sure who the owner is, but scaleup is broken somehow

EDIT: the last failure there seem to be:

level=fatal msg="failed to initialize the cluster: Working towards 0.0.1-2019-05-10-173304: 100% complete, waiting on authentication: timed out waiting for the condition"
2019/05/10 18:23:30 Container setup in pod e2e-rhel-scaleup failed, exit code 1, reason Error
Waiting for scaleup to complete...Another process exited
2019/05/10 18:23:40 Container test in pod e2e-rhel-scaleup failed, exit code 1, reason Error
@runcom

This comment has been minimized.

Copy link
Member Author

commented May 10, 2019

/test e2e-rhel-scaleup

@runcom

This comment has been minimized.

Copy link
Member Author

commented May 10, 2019

/retest

@runcom

This comment has been minimized.

Copy link
Member Author

commented May 10, 2019

/refresh

@runcom runcom force-pushed the runcom:fix-old-logger branch from bfbf2f6 to acfb35f May 10, 2019

@runcom

This comment has been minimized.

Copy link
Member Author

commented May 10, 2019

/test e2e-rhel-scaleup

@runcom

This comment has been minimized.

Copy link
Member Author

commented May 10, 2019

aws limits.... will retest later...

@runcom

This comment has been minimized.

Copy link
Member Author

commented May 11, 2019

/retest

@runcom

This comment has been minimized.

Copy link
Member Author

commented May 11, 2019

/hold cancel

@@ -734,7 +754,17 @@ BOOT_ID=%s
PENDING=%d`, pendingStateMessageID, pending.GetName(), dn.bootID, isPending)))

logger.Stdin = &pendingState
return logger.CombinedOutput()
loggerOutput, err := logger.CombinedOutput()

This comment has been minimized.

Copy link
@cgwalters

cgwalters May 11, 2019

Contributor

I still think it'd be cleaner to call loggerSupportsJournal once or so on MCD startup rather than trying and failing each time.

This comment has been minimized.

Copy link
@runcom

runcom May 11, 2019

Author Member

oh I can definitely do that if it's for loggerSupportsJournal

This comment has been minimized.

Copy link
@runcom

runcom May 11, 2019

Author Member

should be fixed

@cgwalters

This comment has been minimized.

Copy link
Contributor

commented May 11, 2019

/approve
/retest

pkg/daemon: do not compare pointers
Signed-off-by: Antonio Murdaca <runcom@linux.com>

@runcom runcom force-pushed the runcom:fix-old-logger branch from 6c7df75 to 2948acb May 12, 2019

@openshift-ci-robot openshift-ci-robot added size/M and removed size/L labels May 12, 2019

pkg/daemon: write pending state to disk if old logger
Signed-off-by: Antonio Murdaca <runcom@linux.com>

@openshift-ci-robot openshift-ci-robot added size/L and removed size/M labels May 12, 2019

@runcom

This comment has been minimized.

Copy link
Member Author

commented May 12, 2019

/test e2e-rhel-scaleup

@runcom

This comment has been minimized.

Copy link
Member Author

commented May 12, 2019

ok, this is now stable in my testing against a real scaleup cluster with 3 rhcos masters + 2 centos 7 workers, the fallback to still write on disk made the trick and I'd leave it as a redundancy fallback. We should run our e2e-aws-op in scaleup job though or we risk these kinds of regressions everytime. What bothers me the most though is:

  • rhcos logs to journal just fine, we know it and CI jobs are on our side
  • RHEL7.6 logs to journal works from my testing
  • CentOS 7 logs aren't working, logger just pretends it logged something but it didn't at all
  • the 2 points above are the reason I've re-added the write-to-disk function and calling it now

Also, see openshift/release#3748 (comment)

Follow up from my points above: CentOS7 in scaleup job uses cri-o 1.12.x (wrong and old since kubelet is at 1.13)

@runcom

This comment has been minimized.

Copy link
Member Author

commented May 12, 2019

/retest

The cluster and machineconfig operator go up in the scale up test but the job itself fails. This is not related to this change as it wasn't passing even before this (reason why it was never enabled by default)

@runcom

This comment has been minimized.

Copy link
Member Author

commented May 12, 2019

/test e2e-rhel-scaleup

1 similar comment
@runcom

This comment has been minimized.

Copy link
Member Author

commented May 12, 2019

/test e2e-rhel-scaleup

@runcom

This comment has been minimized.

Copy link
Member Author

commented May 13, 2019

scaleup passed this time..

@ashcrow

This comment has been minimized.

Copy link
Member

commented May 13, 2019

Let's strike while e2e is succeeding. @runcom is this ready for final review/merge?

@runcom

This comment has been minimized.

Copy link
Member Author

commented May 13, 2019

Let's strike while e2e is succeeding. @runcom is this ready for final review/merge?

it is, if @cgwalters @kikisdeliveryservice @LorbusChris can ack on this it would be ready to merge

@LorbusChris

This comment has been minimized.

Copy link
Member

commented May 13, 2019

/lgtm

@openshift-ci-robot

This comment has been minimized.

Copy link

commented May 13, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, LorbusChris, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [LorbusChris,cgwalters,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@runcom

This comment has been minimized.

Copy link
Member Author

commented May 13, 2019

/test e2e-rhel-scaleup

@LorbusChris

This comment has been minimized.

Copy link
Member

commented May 13, 2019

😢

/test e2e-rhel-scaleup
/test e2e-aws

@runcom

This comment has been minimized.

Copy link
Member Author

commented May 13, 2019

/retest

@openshift-bot

This comment has been minimized.

Copy link

commented May 13, 2019

/retest

Please review the full test history for this PR and help us cut down flakes.

@ashcrow

This comment has been minimized.

Copy link
Member

commented May 13, 2019

/retest

@openshift-bot

This comment has been minimized.

Copy link

commented May 13, 2019

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot

This comment has been minimized.

Copy link

commented May 13, 2019

@runcom: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-rhel-scaleup 1817d82 link /test e2e-rhel-scaleup

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 6397877 into openshift:master May 13, 2019

7 of 8 checks passed

ci/prow/e2e-rhel-scaleup Job failed.
Details
ci/prow/e2e-aws Job succeeded.
Details
ci/prow/e2e-aws-op Job succeeded.
Details
ci/prow/e2e-aws-upgrade Job succeeded.
Details
ci/prow/images Job succeeded.
Details
ci/prow/unit Job succeeded.
Details
ci/prow/verify Job succeeded.
Details
tide In merge pool.
Details

@runcom runcom deleted the runcom:fix-old-logger branch May 13, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.