New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce a number of error and warn messages to lower logging levels. #3101
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of things to think about:
- Some services have
debug
logs by default (e.g. auth product) but not all do (e.g. shopping). - Large internal product does AtlasDB logging at warn or above only, apart from Sweep. We should consider lowering theirs to match internal data platform.
Will schedule diving into this by end of the week.
Yeah, I would assume that debug level logs would not be available in general to address past problems. They are most likely manually enabled after an event. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a look through. Most changes are reasonable.
Main logic I used to guide me through these was:
- For most Cassandra specific usages, you can assume INFO logging is available
- For other usages, you can only assume WARN logging is available in general, apart from Background Sweep (look at large internal product default logging config for an explanation why)
Mostly agree with the changes. The ones I'm not so sure about would mainly be the Clock Skew Monitor, and possibly Blacklist
in CassandraKVS.
@@ -126,7 +126,7 @@ public TokenRangeWritesLogger getTokenRangeWritesLogger() { | |||
tokenRangeWritesLogger.updateTokenRanges(tokenMap.asMapOfRanges().keySet()); | |||
return servers; | |||
} catch (Exception e) { | |||
log.error("Couldn't grab new token ranges for token aware cassandra mapping!", e); | |||
log.warn("Couldn't grab new token ranges for token aware cassandra mapping!", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main implication of this is that we lose the perf optimisation we have where we talk directly to the correct nodes, right? If that is correct, warn
is probably fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something that can be/is retried? If so, then the problem is likely to fix itself, possibly without us even realising that it has done so (as I don't think we log when the token ranges were successfully updated, unless it's logged somewhere else).
Again, if refreshTokenRanges
is only called occasionally, we could consider a separate PR where we periodically retry the mapping update in the event of a failure. This can then be an info
, as human action will probably not be required to remediate the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's retried every 2 minutes, so I'm bumping it to info and adding that detail to the message.
@@ -247,7 +247,7 @@ private Runnable createFlushTask() { | |||
} | |||
} catch (Throwable t) { | |||
if (!Thread.interrupted()) { | |||
log.error("Error occurred while flushing sweep stats: {}", t, t); | |||
log.warn("Error occurred while flushing sweep stats: {}", t, t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 this is bad but not service-haltingly-bad in most cases.
@@ -462,7 +462,7 @@ private static void checkInstallConfig(AtlasDbConfig config) { | |||
|| config.getSweepReadLimit() != null | |||
|| config.getSweepCandidateBatchHint() != null | |||
|| config.getSweepDeleteBatchHint() != null) { | |||
log.error("Your configuration specifies sweep parameters on the install config. They will be ignored." | |||
log.warn("Your configuration specifies sweep parameters on the install config. They will be ignored." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message (even at ERROR) didn't save one of our internal deployments from believing their install config worked. 😞
I think for the most part I think of logs as something to guide diagnosis or investigation, not so much to be looked at in and of themselves...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate task: this should possibly be a healthcheck warning rather than a warning hidden in the logs. It's not severe enough to force a user to look at it (e.g. by refusing to start), but is still something that should be fixed, because ignoring it could lead people to believe their config is different to what it actually is. This would cause:
- Best case: less performant than intended sweep
- Average case: lost time in investigating an issue because it's not obvious enough what the problem is
- Worst case: full stack failure (in case the configuration options are set in order to avoid a Cassandra OOM or similar)
@@ -75,7 +75,7 @@ public void clockSkew(String server, long skew, long minRequestInterval, long du | |||
} | |||
|
|||
public void clockWentBackwards(String server, long amount) { | |||
log.error("The clock for server {} went backwards by {} nanoseconds", | |||
log.debug("The clock for server {} went backwards by {} nanoseconds", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these 3 messages, I don't think timelock where this is used logs at DEBUG at all. If we still wanted to monitor incidences of clock skew (e.g. as a precursor to our very contentious #1412) it's probably better to keep these at INFO, or change the default log level in timelock to be DEBUG for this logger specifically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to "change the default log level in timelock to be DEBUG for this logger specifically."
@@ -70,7 +70,7 @@ private boolean isHostHealthy(CassandraClientPoolingContainer container) { | |||
container.runWithPooledResource(CassandraUtils.getValidatePartitioner(config)); | |||
return true; | |||
} catch (Exception e) { | |||
log.warn("We tried to add blacklisted host '{}' back into the pool, but got an exception" | |||
log.debug("We tried to add blacklisted host '{}' back into the pool, but got an exception" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very often this is TimedOutException
or various Thrift-based things, though I don't know if this one is still useful for debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put this to info, as otherwise the information will just be lost at a point when we're trying to diagnose any blacklisting issues.
However, I would like the message to be improved - what does it mean that we're distrusting it further? When will we next attempt to contact this host? This could potentially be done as a separate, small(?) PR.
@@ -87,7 +87,7 @@ public void decreaseMultiplier() { | |||
// Cut batch size in half, always sweep at least one row. | |||
reduceBatchSizeMultiplier(); | |||
|
|||
log.warn("Sweep failed unexpectedly with candidate batch size {}," | |||
log.info("Sweep failed unexpectedly with candidate batch size {}," |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably outside the scope of this PR, but there is a case for a higher log level (WARN I think) if all params are 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two cases where the level was switched to debug
and I'm arguing for info
.
Otherwise, I have some suggestions for improving our logging story, some of which are fairly trivial, and some of which may require a "tiny task" worth of work. Happy for these to be pushed to a separate PR/github issue, especially those comments where I've mentioned that kind of thing.
@@ -70,7 +70,7 @@ private boolean isHostHealthy(CassandraClientPoolingContainer container) { | |||
container.runWithPooledResource(CassandraUtils.getValidatePartitioner(config)); | |||
return true; | |||
} catch (Exception e) { | |||
log.warn("We tried to add blacklisted host '{}' back into the pool, but got an exception" | |||
log.debug("We tried to add blacklisted host '{}' back into the pool, but got an exception" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put this to info, as otherwise the information will just be lost at a point when we're trying to diagnose any blacklisting issues.
However, I would like the message to be improved - what does it mean that we're distrusting it further? When will we next attempt to contact this host? This could potentially be done as a separate, small(?) PR.
SafeArg.of("numTries", numberOfAttempts), | ||
SafeArg.of("maxTotalTries", maxTriesTotal), | ||
SafeArg.of("exceptionClass", ex.getClass().getTypeName()), | ||
UnsafeArg.of("exceptionMessage", ex.getMessage())); | ||
} else { | ||
log.info("Error occurred talking to cassandra. Attempt {} of {}.", | ||
log.debug("Error occurred talking to cassandra. Attempt {} of {}.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this message has been useful for detecting ongoing issues, e.g. when sweep is misbehaving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
retracting this comment - we'll log when we error at the last attempt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we log more visibly on the last attempt if it fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's a WARN
if the last attempt fails (see handleExceptionFromRequest
and logAndThrowException
.
tableRefs.stream().map(TableReference::getQualifiedName).collect(Collectors.joining(", "))); | ||
} | ||
|
||
private void logTraceResults(long duration, Set<TableReference> tableRefs, ByteBuffer recvTrace, boolean failed) { | ||
if (failed || duration > tracingPrefs.getMinimumDurationToTraceMillis()) { | ||
log.error("Traced a call to {} that {}took {} ms. It will appear in system_traces with UUID={}", | ||
log.warn("Traced a call to {} that {}took {} ms. It will appear in system_traces with UUID={}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it could even be an info
message.
@@ -126,7 +126,7 @@ public TokenRangeWritesLogger getTokenRangeWritesLogger() { | |||
tokenRangeWritesLogger.updateTokenRanges(tokenMap.asMapOfRanges().keySet()); | |||
return servers; | |||
} catch (Exception e) { | |||
log.error("Couldn't grab new token ranges for token aware cassandra mapping!", e); | |||
log.warn("Couldn't grab new token ranges for token aware cassandra mapping!", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something that can be/is retried? If so, then the problem is likely to fix itself, possibly without us even realising that it has done so (as I don't think we log when the token ranges were successfully updated, unless it's logged somewhere else).
Again, if refreshTokenRanges
is only called occasionally, we could consider a separate PR where we periodically retry the mapping update in the event of a failure. This can then be an info
, as human action will probably not be required to remediate the issue.
@@ -182,13 +182,13 @@ InetSocketAddress getAddressForHost(String host) throws UnknownHostException { | |||
.filter(predicate) | |||
.collect(Collectors.toSet()); | |||
if (filteredHosts.isEmpty()) { | |||
log.error("No hosts match the provided predicate."); | |||
log.info("No hosts match the provided predicate."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion for a quick improvement, while we're here: should also log the predicate, if plausible.
@@ -462,7 +462,7 @@ private static void checkInstallConfig(AtlasDbConfig config) { | |||
|| config.getSweepReadLimit() != null | |||
|| config.getSweepCandidateBatchHint() != null | |||
|| config.getSweepDeleteBatchHint() != null) { | |||
log.error("Your configuration specifies sweep parameters on the install config. They will be ignored." | |||
log.warn("Your configuration specifies sweep parameters on the install config. They will be ignored." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate task: this should possibly be a healthcheck warning rather than a warning hidden in the logs. It's not severe enough to force a user to look at it (e.g. by refusing to start), but is still something that should be fixed, because ignoring it could lead people to believe their config is different to what it actually is. This would cause:
- Best case: less performant than intended sweep
- Average case: lost time in investigating an issue because it's not obvious enough what the problem is
- Worst case: full stack failure (in case the configuration options are set in order to avoid a Cassandra OOM or similar)
@@ -75,7 +75,7 @@ public void clockSkew(String server, long skew, long minRequestInterval, long du | |||
} | |||
|
|||
public void clockWentBackwards(String server, long amount) { | |||
log.error("The clock for server {} went backwards by {} nanoseconds", | |||
log.debug("The clock for server {} went backwards by {} nanoseconds", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to "change the default log level in timelock to be DEBUG for this logger specifically."
I've applied my own comments, but @jeremyk-91 should check my commits.
@@ -131,7 +131,9 @@ public TokenRangeWritesLogger getTokenRangeWritesLogger() { | |||
tokenRangeWritesLogger.updateTokenRanges(tokenMap.asMapOfRanges().keySet()); | |||
return servers; | |||
} catch (Exception e) { | |||
log.error("Couldn't grab new token ranges for token aware cassandra mapping!", e); | |||
log.info("Couldn't grab new token ranges for token aware cassandra mapping. We will retry in {} seconds.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one's unclear to me - whether it should be warn or info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception is thrown when:
- The C* call to get the token ring fails
- Something is borked and we invent a host that is not in the config (shouldn't happen)
Since we retry every 120 seconds by default, it seems very unlikely to me that any engineer will need to take action as a result of the log message, and therefore INFO
seems appropriate to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
Goals (and why): Reduce the number of messages we log at ERROR and WARN levels. These have caused unnecessary alarm for users and often obscure their own error messages.
Implementation Description (bullets): I've reduced the logging level for a number of log lines. In places by quite a lot. I expect discussion and push back on some of the changes but I want to have the conversations rather than sitting on this.
Concerns (what feedback would you like?): Where have I reduced the logging level too much.
Where should we start reviewing?: Each change is distinct so anywhere you want.
Priority (whenever / two weeks / yesterday): would be good to merge this next week
This change is