Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/suppress clientpool stacktrace spam #2373

Closed
wants to merge 9 commits into from

Conversation

gsheasby
Copy link
Contributor

@gsheasby gsheasby commented Sep 20, 2017

Goals (and why): Fixes #2280.

Implementation Description (bullets):

  • For retryable exceptions, logged exception class name and message instead of stack trace
  • (the stack trace is still logged if we run out of retries)
  • Added the query's toString to one of the log messages. The query string typically contains the type of Cassandra query, as well as the column family
  • Added SafeArg calls for arguments that are considered safe.

Concerns (what feedback would you like?):

  • We lose any information that would have been in the stack trace, for cases where queries are often timing out, but never running out of retries. (a) How often is this the case? (b) Will it be sufficient to add the query toString information, and examine the product's slow logs, in order to determine which query is putting a strain on Cassandra?
  • Should we still log this at warn?
  • Are all the safe args truly safe?

Where should we start reviewing?: CassandraClientPool

Priority (whenever / two weeks / yesterday): whenever

Logs before:

15:46:24.032 [main] WARN  c.p.a.k.c.CassandraClientPool - Error occurred talking to cassandra. Attempt 1 of 6.
java.net.SocketTimeoutException: null
	at com.palantir.atlasdb.keyvalue.cassandra.CassandraClientPool.runWithPooledResourceRecordingMetrics(CassandraClientPool.java:617) ~[atlasdb-cassandra/:na]
	at com.palantir.atlasdb.keyvalue.cassandra.CassandraClientPool.runWithRetryOnHost(CassandraClientPool.java:579) ~[atlasdb-cassandra/:na]
	at com.palantir.atlasdb.keyvalue.cassandra.CassandraClientPoolTest.runNoopOnHostWithRetryWithException(CassandraClientPoolTest.java:283) [atlasdb-cassandra/:na]
	at com.palantir.atlasdb.keyvalue.cassandra.CassandraClientPoolTest.shouldRetryOnSameNodeToFailureAndThenRedirect(CassandraClientPoolTest.java:151) [atlasdb-cassandra/:na]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_45]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_45]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_45]
	at java.lang.reflect.Method.invoke(Method.java:497) ~[na:1.8.0_45]
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) [junit-4.12.jar:4.12]
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) [junit-4.12.jar:4.12]
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) [junit-4.12.jar:4.12]
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) [junit-4.12.jar:4.12]
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) [junit-4.12.jar:4.12]
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) [junit-4.12.jar:4.12]
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) [junit-4.12.jar:4.12]
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) [junit-4.12.jar:4.12]
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) [junit-4.12.jar:4.12]
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) [junit-4.12.jar:4.12]
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) [junit-4.12.jar:4.12]
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) [junit-4.12.jar:4.12]
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) [junit-4.12.jar:4.12]
	at org.junit.runners.ParentRunner.run(ParentRunner.java:363) [junit-4.12.jar:4.12]
	at org.junit.runner.JUnitCore.run(JUnitCore.java:137) [junit-4.12.jar:4.12]
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) [junit-rt.jar:na]
	at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51) [junit-rt.jar:na]
	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:237) [junit-rt.jar:na]
	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70) [junit-rt.jar:na]
15:46:24.095 [main] WARN  c.p.a.k.c.CassandraClientPool - Retrying with backoff a query intended for host 0.0.0.0/0.0.0.0:0.
java.net.SocketTimeoutException: null
	at com.palantir.atlasdb.keyvalue.cassandra.CassandraClientPool.runWithPooledResourceRecordingMetrics(CassandraClientPool.java:617) ~[atlasdb-cassandra/:na]
	at com.palantir.atlasdb.keyvalue.cassandra.CassandraClientPool.runWithRetryOnHost(CassandraClientPool.java:579) ~[atlasdb-cassandra/:na]
	at com.palantir.atlasdb.keyvalue.cassandra.CassandraClientPoolTest.runNoopOnHostWithRetryWithException(CassandraClientPoolTest.java:283) [atlasdb-cassandra/:na]
	at com.palantir.atlasdb.keyvalue.cassandra.CassandraClientPoolTest.shouldRetryOnSameNodeToFailureAndThenRedirect(CassandraClientPoolTest.java:151) [atlasdb-cassandra/:na]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_45]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_45]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_45]
	at java.lang.reflect.Method.invoke(Method.java:497) ~[na:1.8.0_45]
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) [junit-4.12.jar:4.12]
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) [junit-4.12.jar:4.12]
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) [junit-4.12.jar:4.12]
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) [junit-4.12.jar:4.12]
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) [junit-4.12.jar:4.12]
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) [junit-4.12.jar:4.12]
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) [junit-4.12.jar:4.12]
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) [junit-4.12.jar:4.12]
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) [junit-4.12.jar:4.12]
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) [junit-4.12.jar:4.12]
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) [junit-4.12.jar:4.12]
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) [junit-4.12.jar:4.12]
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) [junit-4.12.jar:4.12]
	at org.junit.runners.ParentRunner.run(ParentRunner.java:363) [junit-4.12.jar:4.12]
	at org.junit.runner.JUnitCore.run(JUnitCore.java:137) [junit-4.12.jar:4.12]
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) [junit-rt.jar:na]
	at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51) [junit-rt.jar:na]
	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:237) [junit-rt.jar:na]
	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70) [junit-rt.jar:na]
15:46:25.318 [main] WARN  c.p.a.k.c.CassandraClientPool - Error occurred talking to cassandra. Attempt 2 of 6.
java.net.SocketTimeoutException: null
	at com.palantir.atlasdb.keyvalue.cassandra.CassandraClientPool.runWithPooledResourceRecordingMetrics(CassandraClientPool.java:617) ~[atlasdb-cassandra/:na]
	at com.palantir.atlasdb.keyvalue.cassandra.CassandraClientPool.runWithRetryOnHost(CassandraClientPool.java:579) ~[atlasdb-cassandra/:na]
(260 more lines)

Logs after:

15:44:40.029 [main] WARN  c.p.a.k.c.CassandraClientPool - Error occurred talking to cassandra. Attempt 1 of 6. Exception message was: java.net.SocketTimeoutException : null
15:44:40.032 [main] WARN  c.p.a.k.c.CassandraClientPool - Retrying with backoff a query, noop, intended for host 0.0.0.0/0.0.0.0:0. Exception message was: java.net.SocketTimeoutException : null
15:44:41.406 [main] WARN  c.p.a.k.c.CassandraClientPool - Error occurred talking to cassandra. Attempt 2 of 6. Exception message was: java.net.SocketTimeoutException : null
15:44:41.406 [main] WARN  c.p.a.k.c.CassandraClientPool - Retrying with backoff a query, noop, intended for host 0.0.0.0/0.0.0.0:0. Exception message was: java.net.SocketTimeoutException : null
15:44:43.413 [main] WARN  c.p.a.k.c.CassandraClientPool - Error occurred talking to cassandra. Attempt 3 of 6. Exception message was: java.net.SocketTimeoutException : null
15:44:43.413 [main] WARN  c.p.a.k.c.CassandraClientPool - Retrying with backoff a query, noop, intended for host 0.0.0.0/0.0.0.0:0. Exception message was: java.net.SocketTimeoutException : null
15:44:46.727 [main] WARN  c.p.a.k.c.CassandraClientPool - Randomly redirected a query intended for host 0.0.0.0/0.0.0.0:0.
15:44:46.745 [main] WARN  c.p.a.k.c.CassandraClientPool - Error occurred talking to cassandra. Attempt 4 of 6. Exception message was: java.net.SocketTimeoutException : null
15:44:46.745 [main] WARN  c.p.a.k.c.CassandraClientPool - Retrying with backoff a query, noop, intended for host 0.0.0.0/0.0.0.0:1. Exception message was: java.net.SocketTimeoutException : null
15:44:50.834 [main] WARN  c.p.a.k.c.CassandraClientPool - Randomly redirected a query intended for host 0.0.0.0/0.0.0.0:0.
15:44:50.835 [main] WARN  c.p.a.k.c.CassandraClientPool - Error occurred talking to cassandra. Attempt 5 of 6. Exception message was: java.net.SocketTimeoutException : null
15:44:50.835 [main] WARN  c.p.a.k.c.CassandraClientPool - Retrying with backoff a query, noop, intended for host 0.0.0.0/0.0.0.0:3. Exception message was: java.net.SocketTimeoutException : null
15:44:55.775 [main] WARN  c.p.a.k.c.CassandraClientPool - Randomly redirected a query intended for host 0.0.0.0/0.0.0.0:0.
15:44:55.780 [main] ERROR c.p.a.k.c.CassandraClientPool - Tried to connect to cassandra 6 times.
java.net.SocketTimeoutException: null
	at com.palantir.atlasdb.keyvalue.cassandra.CassandraClientPool.runWithPooledResourceRecordingMetrics(CassandraClientPool.java:626) ~[atlasdb-cassandra/:na]
	at com.palantir.atlasdb.keyvalue.cassandra.CassandraClientPool.runWithRetryOnHost(CassandraClientPool.java:583) ~[atlasdb-cassandra/:na]
	at com.palantir.atlasdb.keyvalue.cassandra.CassandraClientPoolTest.runNoopOnHostWithRetryWithException(CassandraClientPoolTest.java:283) [atlasdb-cassandra/:na]
	at com.palantir.atlasdb.keyvalue.cassandra.CassandraClientPoolTest.shouldRetryOnSameNodeToFailureAndThenRedirect(CassandraClientPoolTest.java:151) [atlasdb-cassandra/:na]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_45]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_45]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_45]
	at java.lang.reflect.Method.invoke(Method.java:497) ~[na:1.8.0_45]
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) [junit-4.12.jar:4.12]
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) [junit-4.12.jar:4.12]
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) [junit-4.12.jar:4.12]
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) [junit-4.12.jar:4.12]
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) [junit-4.12.jar:4.12]
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) [junit-4.12.jar:4.12]
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) [junit-4.12.jar:4.12]
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) [junit-4.12.jar:4.12]
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) [junit-4.12.jar:4.12]
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) [junit-4.12.jar:4.12]
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) [junit-4.12.jar:4.12]
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) [junit-4.12.jar:4.12]
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) [junit-4.12.jar:4.12]
	at org.junit.runners.ParentRunner.run(ParentRunner.java:363) [junit-4.12.jar:4.12]
	at org.junit.runner.JUnitCore.run(JUnitCore.java:137) [junit-4.12.jar:4.12]
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) [junit-rt.jar:na]
	at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51) [junit-rt.jar:na]
	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:237) [junit-rt.jar:na]
	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70) [junit-rt.jar:na]

This change is Reviewable

These stack traces spam the logs (and were logged twice per attempt!).
Additionally, they provide minimal information.
This is an attempt to restore the pertinent information lost by the
stack trace. If a deployment is seeing many timeouts that are
successfully retried, they may be able to determine the culprit by
examining the query strings, and concurrently checking the product's
slow-logs (timed out queries will very likely be logged there).
The toString method is called in CassandraClientPool, if the function
is retried.
Exception messages are considered unsafe.
The query string, sadly, isn't safe yet - it mentions a column family,
and we can't guarantee safety of the column family name.
@gsheasby
Copy link
Contributor Author

Note that a refactor in this PR probably clashes with #2363.

@@ -329,15 +332,15 @@ private void checkAndUpdateBlacklist() {
if (isHostHealthy(host)) {
blacklistedHosts.remove(host);
log.error("Added host {} back into the pool after a waiting period and successful health check.",
host);
SafeArg.of("host", host));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's change this to info?

}
}
}
}

private void addToBlacklist(InetSocketAddress badHost) {
blacklistedHosts.put(badHost, System.currentTimeMillis());
log.info("Blacklisted host '{}'", badHost);
log.info("Blacklisted host '{}'", SafeArg.of("badHost", badHost));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's change this to warn?

@@ -348,7 +351,7 @@ private boolean isHostHealthy(InetSocketAddress host) {
return true;
} catch (Exception e) {
log.error("We tried to add {} back into the pool, but got an exception"
+ " that caused us to distrust this host further.", host, e);
+ " that caused us to distrust this host further.", SafeArg.of("host", host), e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warn maybe? and not sure if we want the exception. maybe just the message?

@@ -582,7 +586,12 @@ private void refreshTokenRanges() {
triedHosts.add(hostPool.getHost());
this.<K>handleException(numTries, hostPool.getHost(), e);
if (isRetriableWithBackoffException(e)) {
log.warn("Retrying with backoff a query intended for host {}.", hostPool.getHost(), e);
log.warn("Retrying with backoff a query, {}, intended for host {}. Exception message was: {} : {}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log the sleep duration maybe? and maybe remove the exception?

throw (K) ex;
}
} else {
log.warn("Error occurred talking to cassandra. Attempt {} of {}.", numTries, MAX_TRIES_TOTAL, ex);
log.warn("Error occurred talking to cassandra. Attempt {} of {}. Exception message was: {} : {}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log the exception the first time only

@codecov-io
Copy link

codecov-io commented Sep 22, 2017

Codecov Report

Merging #2373 into develop will decrease coverage by <.01%.
The diff coverage is 61.11%.

Impacted file tree graph

@@              Coverage Diff              @@
##             develop    #2373      +/-   ##
=============================================
- Coverage      59.84%   59.83%   -0.01%     
- Complexity      3328     4617    +1289     
=============================================
  Files            823      823              
  Lines          38558    38580      +22     
  Branches        4004     4005       +1     
=============================================
+ Hits           23075    23085      +10     
- Misses         14035    14045      +10     
- Partials        1448     1450       +2
Impacted Files Coverage Δ Complexity Δ
...lantir/atlasdb/keyvalue/cassandra/CqlExecutor.java 0% <0%> (ø) 0 <0> (ø) ⬇️
...tlasdb/keyvalue/cassandra/CassandraClientPool.java 42.67% <73.33%> (+1.68%) 0 <0> (ø) ⬇️
...ain/java/com/palantir/leader/LeadershipEvents.java 83.33% <0%> (-8.34%) 0% <0%> (ø)
...sdb/keyvalue/impl/AbstractTableMappingService.java 69.81% <0%> (-3.78%) 14% <0%> (+14%)
.../palantir/leader/PaxosLeadershipEventRecorder.java 91.89% <0%> (-2.71%) 0% <0%> (ø)
...main/java/com/palantir/paxos/PaxosLearnerImpl.java 87.8% <0%> (-2.44%) 0% <0%> (ø)
...om/palantir/leader/PaxosLeaderElectionService.java 78.46% <0%> (-1.93%) 0% <0%> (ø)
...n/java/com/palantir/lock/impl/LockServiceImpl.java 83.3% <0%> (+0.35%) 92% <0%> (+1%) ⬆️
.../java/com/palantir/util/debug/StackTraceUtils.java 52.46% <0%> (+1.34%) 22% <0%> (ø) ⬇️
...ain/java/com/palantir/paxos/PaxosAcceptorImpl.java 76.78% <0%> (+3.57%) 0% <0%> (ø) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 063df8e...135c219. Read the comment docs.

int sleepDuration = numTries * 1000 + (ThreadLocalRandom.current().nextInt(1000) - 500);
log.warn("Retrying with backoff ({}ms} a query, {}, intended for host {}.",
SafeArg.of("sleepDuration", sleepDuration),
fn.toString(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should we wrapped in an Arg

throw (K) ex;
}
} else {
log.warn("Error occurred talking to cassandra. Attempt {} of {}.", numTries, MAX_TRIES_TOTAL, ex);
// Only log the actual exception the first time
if (numTries > 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numTries == 1 as per the comment.

gsheasby added a commit that referenced this pull request Oct 5, 2017
* Copy-paste those changes
* Fix logic: numTries == 1 instead of numTries > 1
* Add SafeArg and remove exception from the fast failover message
@gsheasby
Copy link
Contributor Author

gsheasby commented Oct 5, 2017

Superseded by #2432.

@gsheasby gsheasby closed this Oct 5, 2017
gsheasby added a commit that referenced this pull request Oct 6, 2017
* Reapply changes from #2373

* Copy-paste those changes
* Fix logic: numTries == 1 instead of numTries > 1
* Add SafeArg and remove exception from the fast failover message

* newHost -> randomHost

* Fix miscellaneous nits

* Fix or suppress warnings
@gsheasby gsheasby deleted the fix/suppress-clientpool-stacktrace-spam branch October 18, 2017 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants