Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce a number of error and warn messages to lower logging levels. #3101

Merged
merged 7 commits into from Jul 12, 2018

Conversation

tboam
Copy link
Contributor

@tboam tboam commented Apr 19, 2018

Goals (and why): Reduce the number of messages we log at ERROR and WARN levels. These have caused unnecessary alarm for users and often obscure their own error messages.

Implementation Description (bullets): I've reduced the logging level for a number of log lines. In places by quite a lot. I expect discussion and push back on some of the changes but I want to have the conversations rather than sitting on this.

Concerns (what feedback would you like?): Where have I reduced the logging level too much.

Where should we start reviewing?: Each change is distinct so anywhere you want.

Priority (whenever / two weeks / yesterday): would be good to merge this next week


This change is Reviewable

Copy link
Contributor

@jeremyk-91 jeremyk-91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of things to think about:

  • Some services have debug logs by default (e.g. auth product) but not all do (e.g. shopping).
  • Large internal product does AtlasDB logging at warn or above only, apart from Sweep. We should consider lowering theirs to match internal data platform.

Will schedule diving into this by end of the week.

@diogoholanda
Copy link
Contributor

Yeah, I would assume that debug level logs would not be available in general to address past problems. They are most likely manually enabled after an event.

Copy link
Contributor

@jeremyk-91 jeremyk-91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look through. Most changes are reasonable.

Main logic I used to guide me through these was:

  • For most Cassandra specific usages, you can assume INFO logging is available
  • For other usages, you can only assume WARN logging is available in general, apart from Background Sweep (look at large internal product default logging config for an explanation why)

Mostly agree with the changes. The ones I'm not so sure about would mainly be the Clock Skew Monitor, and possibly Blacklist in CassandraKVS.

@@ -126,7 +126,7 @@ public TokenRangeWritesLogger getTokenRangeWritesLogger() {
tokenRangeWritesLogger.updateTokenRanges(tokenMap.asMapOfRanges().keySet());
return servers;
} catch (Exception e) {
log.error("Couldn't grab new token ranges for token aware cassandra mapping!", e);
log.warn("Couldn't grab new token ranges for token aware cassandra mapping!", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main implication of this is that we lose the perf optimisation we have where we talk directly to the correct nodes, right? If that is correct, warn is probably fine

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that can be/is retried? If so, then the problem is likely to fix itself, possibly without us even realising that it has done so (as I don't think we log when the token ranges were successfully updated, unless it's logged somewhere else).

Again, if refreshTokenRanges is only called occasionally, we could consider a separate PR where we periodically retry the mapping update in the event of a failure. This can then be an info, as human action will probably not be required to remediate the issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's retried every 2 minutes, so I'm bumping it to info and adding that detail to the message.

@@ -247,7 +247,7 @@ private Runnable createFlushTask() {
}
} catch (Throwable t) {
if (!Thread.interrupted()) {
log.error("Error occurred while flushing sweep stats: {}", t, t);
log.warn("Error occurred while flushing sweep stats: {}", t, t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 this is bad but not service-haltingly-bad in most cases.

@@ -462,7 +462,7 @@ private static void checkInstallConfig(AtlasDbConfig config) {
|| config.getSweepReadLimit() != null
|| config.getSweepCandidateBatchHint() != null
|| config.getSweepDeleteBatchHint() != null) {
log.error("Your configuration specifies sweep parameters on the install config. They will be ignored."
log.warn("Your configuration specifies sweep parameters on the install config. They will be ignored."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message (even at ERROR) didn't save one of our internal deployments from believing their install config worked. 😞

I think for the most part I think of logs as something to guide diagnosis or investigation, not so much to be looked at in and of themselves...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate task: this should possibly be a healthcheck warning rather than a warning hidden in the logs. It's not severe enough to force a user to look at it (e.g. by refusing to start), but is still something that should be fixed, because ignoring it could lead people to believe their config is different to what it actually is. This would cause:

  • Best case: less performant than intended sweep
  • Average case: lost time in investigating an issue because it's not obvious enough what the problem is
  • Worst case: full stack failure (in case the configuration options are set in order to avoid a Cassandra OOM or similar)

@@ -75,7 +75,7 @@ public void clockSkew(String server, long skew, long minRequestInterval, long du
}

public void clockWentBackwards(String server, long amount) {
log.error("The clock for server {} went backwards by {} nanoseconds",
log.debug("The clock for server {} went backwards by {} nanoseconds",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these 3 messages, I don't think timelock where this is used logs at DEBUG at all. If we still wanted to monitor incidences of clock skew (e.g. as a precursor to our very contentious #1412) it's probably better to keep these at INFO, or change the default log level in timelock to be DEBUG for this logger specifically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to "change the default log level in timelock to be DEBUG for this logger specifically."

@@ -70,7 +70,7 @@ private boolean isHostHealthy(CassandraClientPoolingContainer container) {
container.runWithPooledResource(CassandraUtils.getValidatePartitioner(config));
return true;
} catch (Exception e) {
log.warn("We tried to add blacklisted host '{}' back into the pool, but got an exception"
log.debug("We tried to add blacklisted host '{}' back into the pool, but got an exception"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very often this is TimedOutException or various Thrift-based things, though I don't know if this one is still useful for debugging?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put this to info, as otherwise the information will just be lost at a point when we're trying to diagnose any blacklisting issues.

However, I would like the message to be improved - what does it mean that we're distrusting it further? When will we next attempt to contact this host? This could potentially be done as a separate, small(?) PR.

@@ -87,7 +87,7 @@ public void decreaseMultiplier() {
// Cut batch size in half, always sweep at least one row.
reduceBatchSizeMultiplier();

log.warn("Sweep failed unexpectedly with candidate batch size {},"
log.info("Sweep failed unexpectedly with candidate batch size {},"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably outside the scope of this PR, but there is a case for a higher log level (WARN I think) if all params are 1.

Copy link
Contributor

@gsheasby gsheasby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two cases where the level was switched to debug and I'm arguing for info.

Otherwise, I have some suggestions for improving our logging story, some of which are fairly trivial, and some of which may require a "tiny task" worth of work. Happy for these to be pushed to a separate PR/github issue, especially those comments where I've mentioned that kind of thing.

@@ -70,7 +70,7 @@ private boolean isHostHealthy(CassandraClientPoolingContainer container) {
container.runWithPooledResource(CassandraUtils.getValidatePartitioner(config));
return true;
} catch (Exception e) {
log.warn("We tried to add blacklisted host '{}' back into the pool, but got an exception"
log.debug("We tried to add blacklisted host '{}' back into the pool, but got an exception"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put this to info, as otherwise the information will just be lost at a point when we're trying to diagnose any blacklisting issues.

However, I would like the message to be improved - what does it mean that we're distrusting it further? When will we next attempt to contact this host? This could potentially be done as a separate, small(?) PR.

SafeArg.of("numTries", numberOfAttempts),
SafeArg.of("maxTotalTries", maxTriesTotal),
SafeArg.of("exceptionClass", ex.getClass().getTypeName()),
UnsafeArg.of("exceptionMessage", ex.getMessage()));
} else {
log.info("Error occurred talking to cassandra. Attempt {} of {}.",
log.debug("Error occurred talking to cassandra. Attempt {} of {}.",
Copy link
Contributor

@gsheasby gsheasby Apr 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this message has been useful for detecting ongoing issues, e.g. when sweep is misbehaving.

Copy link
Contributor

@gsheasby gsheasby Jul 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retracting this comment - we'll log when we error at the last attempt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we log more visibly on the last attempt if it fails?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a WARN if the last attempt fails (see handleExceptionFromRequest and logAndThrowException.

tableRefs.stream().map(TableReference::getQualifiedName).collect(Collectors.joining(", ")));
}

private void logTraceResults(long duration, Set<TableReference> tableRefs, ByteBuffer recvTrace, boolean failed) {
if (failed || duration > tracingPrefs.getMinimumDurationToTraceMillis()) {
log.error("Traced a call to {} that {}took {} ms. It will appear in system_traces with UUID={}",
log.warn("Traced a call to {} that {}took {} ms. It will appear in system_traces with UUID={}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it could even be an info message.

@@ -126,7 +126,7 @@ public TokenRangeWritesLogger getTokenRangeWritesLogger() {
tokenRangeWritesLogger.updateTokenRanges(tokenMap.asMapOfRanges().keySet());
return servers;
} catch (Exception e) {
log.error("Couldn't grab new token ranges for token aware cassandra mapping!", e);
log.warn("Couldn't grab new token ranges for token aware cassandra mapping!", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that can be/is retried? If so, then the problem is likely to fix itself, possibly without us even realising that it has done so (as I don't think we log when the token ranges were successfully updated, unless it's logged somewhere else).

Again, if refreshTokenRanges is only called occasionally, we could consider a separate PR where we periodically retry the mapping update in the event of a failure. This can then be an info, as human action will probably not be required to remediate the issue.

@@ -182,13 +182,13 @@ InetSocketAddress getAddressForHost(String host) throws UnknownHostException {
.filter(predicate)
.collect(Collectors.toSet());
if (filteredHosts.isEmpty()) {
log.error("No hosts match the provided predicate.");
log.info("No hosts match the provided predicate.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion for a quick improvement, while we're here: should also log the predicate, if plausible.

@@ -462,7 +462,7 @@ private static void checkInstallConfig(AtlasDbConfig config) {
|| config.getSweepReadLimit() != null
|| config.getSweepCandidateBatchHint() != null
|| config.getSweepDeleteBatchHint() != null) {
log.error("Your configuration specifies sweep parameters on the install config. They will be ignored."
log.warn("Your configuration specifies sweep parameters on the install config. They will be ignored."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate task: this should possibly be a healthcheck warning rather than a warning hidden in the logs. It's not severe enough to force a user to look at it (e.g. by refusing to start), but is still something that should be fixed, because ignoring it could lead people to believe their config is different to what it actually is. This would cause:

  • Best case: less performant than intended sweep
  • Average case: lost time in investigating an issue because it's not obvious enough what the problem is
  • Worst case: full stack failure (in case the configuration options are set in order to avoid a Cassandra OOM or similar)

@@ -75,7 +75,7 @@ public void clockSkew(String server, long skew, long minRequestInterval, long du
}

public void clockWentBackwards(String server, long amount) {
log.error("The clock for server {} went backwards by {} nanoseconds",
log.debug("The clock for server {} went backwards by {} nanoseconds",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to "change the default log level in timelock to be DEBUG for this logger specifically."

@gsheasby gsheasby self-assigned this May 30, 2018
@gsheasby gsheasby dismissed their stale review July 5, 2018 13:11

I've applied my own comments, but @jeremyk-91 should check my commits.

@gsheasby gsheasby assigned jeremyk-91 and unassigned gsheasby Jul 5, 2018
@gsheasby gsheasby assigned sandorw and unassigned jeremyk-91 Jul 12, 2018
@@ -131,7 +131,9 @@ public TokenRangeWritesLogger getTokenRangeWritesLogger() {
tokenRangeWritesLogger.updateTokenRanges(tokenMap.asMapOfRanges().keySet());
return servers;
} catch (Exception e) {
log.error("Couldn't grab new token ranges for token aware cassandra mapping!", e);
log.info("Couldn't grab new token ranges for token aware cassandra mapping. We will retry in {} seconds.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one's unclear to me - whether it should be warn or info

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception is thrown when:

  • The C* call to get the token ring fails
  • Something is borked and we invent a host that is not in the config (shouldn't happen)

Since we retry every 120 seconds by default, it seems very unlikely to me that any engineer will need to take action as a result of the log message, and therefore INFO seems appropriate to me.

Copy link
Contributor

@sandorw sandorw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@gsheasby gsheasby merged commit 2d9acd2 into develop Jul 12, 2018
@gsheasby gsheasby deleted the logging/reduce-high-logging-levels branch July 12, 2018 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants