Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve health reporting #1064

Merged
merged 7 commits into from Apr 5, 2019

Conversation

Projects
None yet
3 participants
@lmm
Copy link
Member

commented Mar 25, 2019

Description

Fixes #973

The HealthAggregator now only logs the overall health report if it has changed and unhealthy reporters are logged at WARN.

Todos

  • Tests
  • Documentation
  • Release note

Release Note

Health aggregator now only logs if overall health status has changed. Details of unhealthy reporters are now logged.

@lmm lmm force-pushed the lmm:health-checks branch from de4e89b to 6c56231 Mar 25, 2019

@fasaxc
Copy link
Member

left a comment

Looking forward to having this change in place, will be very useful, thanks! I did find one bug and I've suggested a couple of changes that I think would improve the cleanliness of this code.

// Reset Live to false if that reporter is registered to report liveness and hasn't
// recently said that it is live.
if summary.Live && reporter.reports.Live && (!reporter.latest.Live ||
(reporter.timeout != 0 && time.Since(reporter.timestamp) > reporter.timeout)) {
notReportedLive := !reporter.latest.Live || (reporter.timeout != 0 && time.Since(reporter.timestamp) > reporter.timeout)

This comment has been minimized.

Copy link
@fasaxc

fasaxc Apr 3, 2019

Member

It's usually a lot clearer to use "positive" naming for your variables. Once you name a variable notXXX then it tends to lead to double negatives later, which become hard to follow. I'd flip the sign of this variable and call it stillAlive or just live.

Also, (reporter.timeout != 0 && time.Since(reporter.timestamp) > reporter.timeout) is crying out for a utility method on the reporter. How about putting that behind reporter.TimedOut() bool. Then you'd have

stillAlive := reporter.latest.Live && !reporter.TimedOut()

which would be a lot cleaner.

})

if reporter.reports.Live && notReportedLive || reporter.reports.Ready && notReportedReady {
logFuncs = append(logFuncs, func() {

This comment has been minimized.

Copy link
@fasaxc

fasaxc Apr 3, 2019

Member

Using a slice of funcs here works but deferred execution has quite a complicated mental model that I think it's best to avoid unless it's needed.

Why not collect a couple of slices:

var failedLivenessChecks []string
var failedReadinessChecks []string

and append the names of the failing checks to those.

Then, below when you come to log, do

if len(failingXXXChecks) > 0 {
    log.WithField("failedChecks", failingXXXChecks).Warn("Some XXX checks are failing")
}

This comment has been minimized.

Copy link
@lmm

lmm Apr 3, 2019

Author Member

I think it'll be nicer to split up the failing liveness vs readiness checks as you're hinting here too.

}

log.WithField("summary", summary).Info("Overall health")
// Summary status has changed so update previous status and log.
if summary.Live != aggregator.lastReport.Live && summary.Ready != aggregator.lastReport.Ready {

This comment has been minimized.

Copy link
@fasaxc

fasaxc Apr 3, 2019

Member

I think this will only log if both flags change at the same time, but I think we want to log if either flag changed.

// Summary status has changed so update previous status and log.
if summary.Live != aggregator.lastReport.Live && summary.Ready != aggregator.lastReport.Ready {
aggregator.lastReport = summary
log.WithField("lastSummary", summary).Info("Overall health")

This comment has been minimized.

Copy link
@fasaxc

fasaxc Apr 3, 2019

Member

Now you've put this behind a check, how about changing the log to say "Overall health status changed"

@lmm lmm force-pushed the lmm:health-checks branch from 5cc9134 to 6dea3ce Apr 4, 2019

switch {
case reporter.reports.Live && !stillLive:
failedLivenessChecks = append(failedLivenessChecks, name)
case reporter.reports.Ready && !stillReady:

This comment has been minimized.

Copy link
@fasaxc

fasaxc Apr 4, 2019

Member

This case can only be triggered if the one above is not so a reporter that is neither live nor ready will only show up as not live. I guess that's OK since liveness is more severe than readiness but for debug purposes, might be better to have both bits of information.

// Summary status has changed so update previous status and log.
if summary.Live != aggregator.lastReport.Live || summary.Ready != aggregator.lastReport.Ready {
aggregator.lastReport = summary
log.WithField("lastSummary", summary).Info("Overall health status changed")

This comment has been minimized.

Copy link
@fasaxc

fasaxc Apr 4, 2019

Member

It hink you're logging out the current summary, here, "lastSummary" hints that it's the previous value.

aggregator.lastReport = summary
log.WithField("lastSummary", summary).Info("Overall health status changed")

for _, name := range failedLivenessChecks {

This comment has been minimized.

Copy link
@fasaxc

fasaxc Apr 4, 2019

Member

I was thinking of logging a single log with a list of reporter names without the detailed state of each reporter. I guess logging out each reporter gives us a lot more to go on, though so let's keep your approach of one log per reporter. If you're going to do one log per reporter though, I'd suggest making failedLivenessChecks a []*reporterState so that you can just append the reporter instead of its name. They you won't need to look it up again.

@fasaxc

fasaxc approved these changes Apr 5, 2019

@fasaxc fasaxc merged commit 60b84d0 into projectcalico:master Apr 5, 2019

2 checks passed

license/cla Contributor License Agreement is signed.
Details
semaphoreci The build passed on Semaphore.
Details

@lmm lmm deleted the lmm:health-checks branch Apr 5, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.