Reduce a number of error and warn messages to lower logging levels. #3101

tboam · 2018-04-19T11:53:15Z

Goals (and why): Reduce the number of messages we log at ERROR and WARN levels. These have caused unnecessary alarm for users and often obscure their own error messages.

Implementation Description (bullets): I've reduced the logging level for a number of log lines. In places by quite a lot. I expect discussion and push back on some of the changes but I want to have the conversations rather than sitting on this.

Concerns (what feedback would you like?): Where have I reduced the logging level too much.

Where should we start reviewing?: Each change is distinct so anywhere you want.

Priority (whenever / two weeks / yesterday): would be good to merge this next week

This change is

jeremyk-91

Couple of things to think about:

Some services have debug logs by default (e.g. auth product) but not all do (e.g. shopping).
Large internal product does AtlasDB logging at warn or above only, apart from Sweep. We should consider lowering theirs to match internal data platform.

Will schedule diving into this by end of the week.

diogoholanda · 2018-04-19T12:43:03Z

Yeah, I would assume that debug level logs would not be available in general to address past problems. They are most likely manually enabled after an event.

jeremyk-91

Took a look through. Most changes are reasonable.

Main logic I used to guide me through these was:

For most Cassandra specific usages, you can assume INFO logging is available
For other usages, you can only assume WARN logging is available in general, apart from Background Sweep (look at large internal product default logging config for an explanation why)

Mostly agree with the changes. The ones I'm not so sure about would mainly be the Clock Skew Monitor, and possibly Blacklist in CassandraKVS.

jeremyk-91 · 2018-04-20T16:39:56Z

...b-cassandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/pool/CassandraService.java

@@ -126,7 +126,7 @@ public TokenRangeWritesLogger getTokenRangeWritesLogger() {
            tokenRangeWritesLogger.updateTokenRanges(tokenMap.asMapOfRanges().keySet());
            return servers;
        } catch (Exception e) {
-            log.error("Couldn't grab new token ranges for token aware cassandra mapping!", e);
+            log.warn("Couldn't grab new token ranges for token aware cassandra mapping!", e);


The main implication of this is that we lose the perf optimisation we have where we talk directly to the correct nodes, right? If that is correct, warn is probably fine

Is this something that can be/is retried? If so, then the problem is likely to fix itself, possibly without us even realising that it has done so (as I don't think we log when the token ranges were successfully updated, unless it's logged somewhere else).

Again, if refreshTokenRanges is only called occasionally, we could consider a separate PR where we periodically retry the mapping update in the event of a failure. This can then be an info, as human action will probably not be required to remediate the issue.

It's retried every 2 minutes, so I'm bumping it to info and adding that detail to the message.

jeremyk-91 · 2018-04-20T16:40:40Z

atlasdb-client/src/main/java/com/palantir/atlasdb/keyvalue/impl/SweepStatsKeyValueService.java

@@ -247,7 +247,7 @@ private Runnable createFlushTask() {
                }
            } catch (Throwable t) {
                if (!Thread.interrupted()) {
-                    log.error("Error occurred while flushing sweep stats: {}", t, t);
+                    log.warn("Error occurred while flushing sweep stats: {}", t, t);


💯 this is bad but not service-haltingly-bad in most cases.

jeremyk-91 · 2018-04-20T16:42:55Z

atlasdb-config/src/main/java/com/palantir/atlasdb/factory/TransactionManagers.java

@@ -462,7 +462,7 @@ private static void checkInstallConfig(AtlasDbConfig config) {
                || config.getSweepReadLimit() != null
                || config.getSweepCandidateBatchHint() != null
                || config.getSweepDeleteBatchHint() != null) {
-            log.error("Your configuration specifies sweep parameters on the install config. They will be ignored."
+            log.warn("Your configuration specifies sweep parameters on the install config. They will be ignored."


This message (even at ERROR) didn't save one of our internal deployments from believing their install config worked. 😞

I think for the most part I think of logs as something to guide diagnosis or investigation, not so much to be looked at in and of themselves...

Separate task: this should possibly be a healthcheck warning rather than a warning hidden in the logs. It's not severe enough to force a user to look at it (e.g. by refusing to start), but is still something that should be fixed, because ignoring it could lead people to believe their config is different to what it actually is. This would cause:

Best case: less performant than intended sweep

Average case: lost time in investigating an issue because it's not obvious enough what the problem is

Worst case: full stack failure (in case the configuration options are set in order to avoid a Cassandra OOM or similar)

jeremyk-91 · 2018-04-20T16:45:43Z

timelock-impl/src/main/java/com/palantir/atlasdb/timelock/clock/ClockSkewEvents.java

@@ -75,7 +75,7 @@ public void clockSkew(String server, long skew, long minRequestInterval, long du
    }

    public void clockWentBackwards(String server, long amount) {
-        log.error("The clock for server {} went backwards by {} nanoseconds",
+        log.debug("The clock for server {} went backwards by {} nanoseconds",


For these 3 messages, I don't think timelock where this is used logs at DEBUG at all. If we still wanted to monitor incidences of clock skew (e.g. as a precursor to our very contentious #1412) it's probably better to keep these at INFO, or change the default log level in timelock to be DEBUG for this logger specifically.

+1 to "change the default log level in timelock to be DEBUG for this logger specifically."

jeremyk-91 · 2018-04-20T16:51:09Z

atlasdb-cassandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/Blacklist.java

@@ -70,7 +70,7 @@ private boolean isHostHealthy(CassandraClientPoolingContainer container) {
            container.runWithPooledResource(CassandraUtils.getValidatePartitioner(config));
            return true;
        } catch (Exception e) {
-            log.warn("We tried to add blacklisted host '{}' back into the pool, but got an exception"
+            log.debug("We tried to add blacklisted host '{}' back into the pool, but got an exception"


Very often this is TimedOutException or various Thrift-based things, though I don't know if this one is still useful for debugging?

I would put this to info, as otherwise the information will just be lost at a point when we're trying to diagnose any blacklisting issues.

However, I would like the message to be improved - what does it mean that we're distrusting it further? When will we next attempt to contact this host? This could potentially be done as a separate, small(?) PR.

jeremyk-91 · 2018-04-20T16:55:54Z

...b-impl-shared/src/main/java/com/palantir/atlasdb/sweep/AdjustableSweepBatchConfigSource.java

@@ -87,7 +87,7 @@ public void decreaseMultiplier() {
        // Cut batch size in half, always sweep at least one row.
        reduceBatchSizeMultiplier();

-        log.warn("Sweep failed unexpectedly with candidate batch size {},"
+        log.info("Sweep failed unexpectedly with candidate batch size {},"


probably outside the scope of this PR, but there is a case for a higher log level (WARN I think) if all params are 1.

gsheasby

There are two cases where the level was switched to debug and I'm arguing for info.

Otherwise, I have some suggestions for improving our logging story, some of which are fairly trivial, and some of which may require a "tiny task" worth of work. Happy for these to be pushed to a separate PR/github issue, especially those comments where I've mentioned that kind of thing.

gsheasby · 2018-04-23T12:11:17Z

atlasdb-cassandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/Blacklist.java

@@ -70,7 +70,7 @@ private boolean isHostHealthy(CassandraClientPoolingContainer container) {
            container.runWithPooledResource(CassandraUtils.getValidatePartitioner(config));
            return true;
        } catch (Exception e) {
-            log.warn("We tried to add blacklisted host '{}' back into the pool, but got an exception"
+            log.debug("We tried to add blacklisted host '{}' back into the pool, but got an exception"


I would put this to info, as otherwise the information will just be lost at a point when we're trying to diagnose any blacklisting issues.

However, I would like the message to be improved - what does it mean that we're distrusting it further? When will we next attempt to contact this host? This could potentially be done as a separate, small(?) PR.

gsheasby · 2018-04-23T12:13:46Z

.../src/main/java/com/palantir/atlasdb/keyvalue/cassandra/CassandraRequestExceptionHandler.java

                    SafeArg.of("numTries", numberOfAttempts),
                    SafeArg.of("maxTotalTries", maxTriesTotal),
                    SafeArg.of("exceptionClass", ex.getClass().getTypeName()),
                    UnsafeArg.of("exceptionMessage", ex.getMessage()));
        } else {
-            log.info("Error occurred talking to cassandra. Attempt {} of {}.",
+            log.debug("Error occurred talking to cassandra. Attempt {} of {}.",


~~this message has been useful for detecting ongoing issues, e.g. when sweep is misbehaving.~~

retracting this comment - we'll log when we error at the last attempt.

Do we log more visibly on the last attempt if it fails?

Yes, it's a WARN if the last attempt fails (see handleExceptionFromRequest and logAndThrowException.

gsheasby · 2018-04-23T12:14:53Z

atlasdb-cassandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/TracingQueryRunner.java

                tableRefs.stream().map(TableReference::getQualifiedName).collect(Collectors.joining(", ")));
    }

    private void logTraceResults(long duration, Set<TableReference> tableRefs, ByteBuffer recvTrace, boolean failed) {
        if (failed || duration > tracingPrefs.getMinimumDurationToTraceMillis()) {
-            log.error("Traced a call to {} that {}took {} ms. It will appear in system_traces with UUID={}",
+            log.warn("Traced a call to {} that {}took {} ms. It will appear in system_traces with UUID={}",


This looks like it could even be an info message.

gsheasby · 2018-04-23T12:17:51Z

...b-cassandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/pool/CassandraService.java

@@ -126,7 +126,7 @@ public TokenRangeWritesLogger getTokenRangeWritesLogger() {
            tokenRangeWritesLogger.updateTokenRanges(tokenMap.asMapOfRanges().keySet());
            return servers;
        } catch (Exception e) {
-            log.error("Couldn't grab new token ranges for token aware cassandra mapping!", e);
+            log.warn("Couldn't grab new token ranges for token aware cassandra mapping!", e);


Is this something that can be/is retried? If so, then the problem is likely to fix itself, possibly without us even realising that it has done so (as I don't think we log when the token ranges were successfully updated, unless it's logged somewhere else).

Again, if refreshTokenRanges is only called occasionally, we could consider a separate PR where we periodically retry the mapping update in the event of a failure. This can then be an info, as human action will probably not be required to remediate the issue.

gsheasby · 2018-04-23T12:18:44Z

...b-cassandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/pool/CassandraService.java

@@ -182,13 +182,13 @@ InetSocketAddress getAddressForHost(String host) throws UnknownHostException {
                .filter(predicate)
                .collect(Collectors.toSet());
        if (filteredHosts.isEmpty()) {
-            log.error("No hosts match the provided predicate.");
+            log.info("No hosts match the provided predicate.");


Suggestion for a quick improvement, while we're here: should also log the predicate, if plausible.

gsheasby · 2018-04-23T12:24:42Z

atlasdb-config/src/main/java/com/palantir/atlasdb/factory/TransactionManagers.java

@@ -462,7 +462,7 @@ private static void checkInstallConfig(AtlasDbConfig config) {
                || config.getSweepReadLimit() != null
                || config.getSweepCandidateBatchHint() != null
                || config.getSweepDeleteBatchHint() != null) {
-            log.error("Your configuration specifies sweep parameters on the install config. They will be ignored."
+            log.warn("Your configuration specifies sweep parameters on the install config. They will be ignored."


Separate task: this should possibly be a healthcheck warning rather than a warning hidden in the logs. It's not severe enough to force a user to look at it (e.g. by refusing to start), but is still something that should be fixed, because ignoring it could lead people to believe their config is different to what it actually is. This would cause:

Best case: less performant than intended sweep

Average case: lost time in investigating an issue because it's not obvious enough what the problem is

Worst case: full stack failure (in case the configuration options are set in order to avoid a Cassandra OOM or similar)

gsheasby · 2018-04-23T12:27:15Z

timelock-impl/src/main/java/com/palantir/atlasdb/timelock/clock/ClockSkewEvents.java

@@ -75,7 +75,7 @@ public void clockSkew(String server, long skew, long minRequestInterval, long du
    }

    public void clockWentBackwards(String server, long amount) {
-        log.error("The clock for server {} went backwards by {} nanoseconds",
+        log.debug("The clock for server {} went backwards by {} nanoseconds",


+1 to "change the default log level in timelock to be DEBUG for this logger specifically."

…h-logging-levels

@jeremyk-91

I've applied my own comments, but @jeremyk-91 should check my commits.

sandorw · 2018-07-12T12:18:13Z

...b-cassandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/pool/CassandraService.java

@@ -131,7 +131,9 @@ public TokenRangeWritesLogger getTokenRangeWritesLogger() {
            tokenRangeWritesLogger.updateTokenRanges(tokenMap.asMapOfRanges().keySet());
            return servers;
        } catch (Exception e) {
-            log.error("Couldn't grab new token ranges for token aware cassandra mapping!", e);
+            log.info("Couldn't grab new token ranges for token aware cassandra mapping. We will retry in {} seconds.",


This one's unclear to me - whether it should be warn or info

The exception is thrown when:

The C* call to get the token ring fails

Something is borked and we invent a host that is not in the config (shouldn't happen)

Since we retry every 120 seconds by default, it seems very unlikely to me that any engineer will need to take action as a result of the log message, and therefore INFO seems appropriate to me.

sandorw

Looks good

…h-logging-levels

Reduce a number of error and warn messages to lower logging levels.

7dd426e

tboam requested review from jeremyk-91, diogoholanda and sandorw April 19, 2018 11:53

release notes

56df3e1

jeremyk-91 reviewed Apr 19, 2018

View reviewed changes

jeremyk-91 reviewed Apr 20, 2018

View reviewed changes

gsheasby previously requested changes Apr 23, 2018

View reviewed changes

gsheasby self-assigned this May 30, 2018

gsheasby added 3 commits July 5, 2018 13:50

Merge remote-tracking branch 'origin/develop' into logging/reduce-hig…

bdbafaf

…h-logging-levels

Reapply changes to background sweep logging

fb27b5c

Respond to past self's CR comments

80108db

gsheasby assigned jeremyk-91 and unassigned gsheasby Jul 5, 2018

gsheasby added the pr: needs review label Jul 5, 2018

gsheasby assigned sandorw and unassigned jeremyk-91 Jul 12, 2018

sandorw reviewed Jul 12, 2018

View reviewed changes

sandorw approved these changes Jul 12, 2018

View reviewed changes

gsheasby added 2 commits July 12, 2018 14:18

Fix tag in release notes

8cb9bd1

Merge remote-tracking branch 'origin/develop' into logging/reduce-hig…

2133718

…h-logging-levels

gsheasby merged commit 2d9acd2 into develop Jul 12, 2018

gsheasby deleted the logging/reduce-high-logging-levels branch July 12, 2018 13:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce a number of error and warn messages to lower logging levels. #3101

Reduce a number of error and warn messages to lower logging levels. #3101

tboam commented Apr 19, 2018 •

edited by jboreiko

jeremyk-91 left a comment

diogoholanda commented Apr 19, 2018

jeremyk-91 left a comment

jeremyk-91 Apr 20, 2018

gsheasby Apr 23, 2018

gsheasby Jul 5, 2018

jeremyk-91 Apr 20, 2018

jeremyk-91 Apr 20, 2018

gsheasby Apr 23, 2018

jeremyk-91 Apr 20, 2018

gsheasby Apr 23, 2018

jeremyk-91 Apr 20, 2018

gsheasby Apr 23, 2018

jeremyk-91 Apr 20, 2018

gsheasby left a comment

gsheasby Apr 23, 2018

gsheasby Apr 23, 2018 •

edited

gsheasby Jul 5, 2018 •

edited

sandorw Jul 12, 2018

gsheasby Jul 12, 2018

gsheasby Apr 23, 2018

gsheasby Apr 23, 2018

gsheasby Apr 23, 2018

gsheasby Apr 23, 2018

gsheasby Apr 23, 2018

sandorw Jul 12, 2018

gsheasby Jul 12, 2018

sandorw left a comment

Reduce a number of error and warn messages to lower logging levels. #3101

Reduce a number of error and warn messages to lower logging levels. #3101

Conversation

tboam commented Apr 19, 2018 • edited by jboreiko

jeremyk-91 left a comment

Choose a reason for hiding this comment

diogoholanda commented Apr 19, 2018

jeremyk-91 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gsheasby left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gsheasby Apr 23, 2018 • edited

Choose a reason for hiding this comment

gsheasby Jul 5, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sandorw left a comment

Choose a reason for hiding this comment

tboam commented Apr 19, 2018 •

edited by jboreiko

gsheasby Apr 23, 2018 •

edited

gsheasby Jul 5, 2018 •

edited