Fix/suppress ClientPoolImpl stacktrace spam #2432

gsheasby · 2017-10-05T12:34:11Z

Reapplied changes from #2373 after the Startup Ordering PR merged

Copy-paste those changes
Fix logic: numTries == 1 instead of numTries > 1
Add SafeArg and remove exception from the fast failover message
Added a release note

Goals (and why): Fixes #2280.

Implementation Description (bullets):

For retryable exceptions, logged exception class name and message instead of stack trace
(the stack trace is still logged if we run out of retries)
Added the query's toString to one of the log messages. The query string typically contains the type of Cassandra query, as well as the column family
Added SafeArg calls for arguments that are considered safe.

Concerns (what feedback would you like?):

Is there anything extra that I missed?

Where should we start reviewing?: CassandraClientPoolImpl

Priority (whenever / two weeks / yesterday): today - higher priority since one of our internal product gets indigestion when we feed them this spam. @ChrisAlice for SA.

This change is

* Copy-paste those changes * Fix logic: numTries == 1 instead of numTries > 1 * Add SafeArg and remove exception from the fast failover message

hsaraogi · 2017-10-05T12:41:39Z

...cassandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/CassandraClientPoolImpl.java

-                        previousHostPool,
-                        hostPool.getHost());
+                        SafeArg.of("previousHost", previousHostPool),
+                        SafeArg.of("newHost", hostPool.getHost()));


Maybe: newHost -> randomHost

codecov-io · 2017-10-05T13:57:21Z

Codecov Report

Merging #2432 into develop will increase coverage by 0.02%.
The diff coverage is 57.89%.

@@              Coverage Diff              @@
##             develop    #2432      +/-   ##
=============================================
+ Coverage      60.33%   60.36%   +0.02%     
+ Complexity      4683     4447     -236     
=============================================
  Files            833      833              
  Lines          39391    39414      +23     
  Branches        4036     4037       +1     
=============================================
+ Hits           23768    23791      +23     
+ Misses         14135    14133       -2     
- Partials        1488     1490       +2

Impacted Files	Coverage Δ	Complexity Δ
...lantir/atlasdb/keyvalue/cassandra/CqlExecutor.java	`0% <0%> (ø)`	`0 <0> (ø)`	⬇️
...db/keyvalue/cassandra/CassandraClientPoolImpl.java	`45.37% <68.75%> (+1.29%)`	`0 <0> (ø)`	⬇️
...ain/java/com/palantir/paxos/PaxosStateLogImpl.java	`85.59% <0%> (-1.7%)`	`0% <0%> (ø)`
...alantir/lock/client/LockRefreshingLockService.java	`58.82% <0%> (-1.48%)`	`10% <0%> (-1%)`
...tir/timestamp/RequestBatchingTimestampService.java	`78.21% <0%> (-1%)`	`0% <0%> (ø)`
.../com/palantir/atlasdb/table/common/TableTasks.java	`57.69% <0%> (ø)`	`20% <0%> (ø)`	⬇️
...ain/java/com/palantir/paxos/PaxosAcceptorImpl.java	`73.21% <0%> (ø)`	`0% <0%> (ø)`	⬇️
...ain/java/com/palantir/paxos/PaxosProposerImpl.java	`88.33% <0%> (ø)`	`0% <0%> (ø)`	⬇️
...n/java/com/palantir/lock/impl/LockServiceImpl.java	`83.3% <0%> (+0.7%)`	`92% <0%> (+1%)`	⬆️
...rc/main/java/com/palantir/lock/LockCollection.java	`71.42% <0%> (+4.76%)`	`12% <0%> (+1%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9f66333...a15cebf. Read the comment docs.

tpetracca · 2017-10-05T16:57:37Z

...cassandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/CassandraClientPoolImpl.java

-                    log.warn("Retrying with backoff a query intended for host {}.", hostPool.getHost(), e);
+                    // And value between -500 and +500ms to backoff to better spread load on failover
+                    int sleepDuration = numTries * 1000 + (ThreadLocalRandom.current().nextInt(1000) - 500);
+                    log.warn("Retrying with backoff ({}ms} a query, {}, intended for host {}.",


nit: Retrying a query, {}, with backoff of {}ms,

yeah, there's a stray } in there after ms

schlosna · 2017-10-05T19:37:51Z

...cassandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/CassandraClientPoolImpl.java

-                    log.warn("Retrying with backoff a query intended for host {}.", hostPool.getHost(), e);
+                    // And value between -500 and +500ms to backoff to better spread load on failover
+                    int sleepDuration = numTries * 1000 + (ThreadLocalRandom.current().nextInt(1000) - 500);
+                    log.warn("Retrying with backoff ({}ms} a query, {}, intended for host {}.",


yeah, there's a stray } in there after ms

schlosna · 2017-10-05T19:38:51Z

...cassandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/CassandraClientPoolImpl.java

+            log.warn("We tried to add {} back into the pool, but got an exception"
+                            + " that caused us to distrust this host further. Exception message was: {} : {}",
+                    SafeArg.of("host", host),
+                    SafeArg.of("exceptionClass", e.getClass().getTypeName()),


nit: getTypeName is a new one, should be fine though e.getClass().getName() or e.getClass().getCanonicalName() might be simpler

schlosna · 2017-10-05T19:40:20Z

...cassandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/CassandraClientPoolImpl.java

+                    log.warn("Error occurred talking to cassandra. Attempt {} of {}.",
+                            SafeArg.of("numTries", numTries),
+                            SafeArg.of("maxTotalTries", MAX_TRIES_TOTAL),
+                            ex);


shouldn't the exception be the last arg in the if block, not else? I think we could actually do the following to cover all cases so that we include the exception class & message all the time, but only include stacktrace on first occurrence.

log.warn("Error occurred talking to cassandra. Attempt {} of {}. Exception message was: {} : {}", SafeArg.of("numTries", numTries), SafeArg.of("maxTotalTries", MAX_TRIES_TOTAL), SafeArg.of("exceptionClass", ex.getClass().getName()), UnsafeArg.of("exceptionMessage", ex.getMessage()), (numTries == 1) ? ex : null);

....d'oh, I was right the first time!

@gsheasby isn’t your logic here backwards if you’re trying to only log the exception on first failed attempt? I am pretty sure the whole of/else should be replaced with my comment above

Looks correct on develop. We don't need to log exceptionClass or message on the first try, since we log the whole exception.

schlosna · 2017-10-05T19:47:09Z

atlasdb-cassandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/CqlExecutor.java

        } catch (TException e) {
            throw Throwables.throwUncheckedException(e);
        }
    }

+    private FunctionCheckedException<Cassandra.Client, CqlResult, TException> getCqlFunction(String query) {
+        ByteBuffer queryBytes = ByteBuffer.wrap(query.getBytes(StandardCharsets.UTF_8));


do we want to eagerly capture and retain both the query string & converted bytes & byte buffer or should this be inlined into apply below as:

public CqlResult apply(Cassandra.Client client) throws TException { ByteBuffer queryBytes = ByteBuffer.wrap(query.getBytes(StandardCharsets.UTF_8)); return client.execute_cql3_query(queryBytes, Compression.NONE, consistency); }

We need to override toString here to make the log message helpful - see CassandraClientPoolImpl.runWithRetryOnHost.

You can still override toStringas you have it, this just avoids keeping the byte buffer around after executing the query. If we are going to re-execute the query many times, then your original version of eagerly converting to bytes might be better tradeoff

I would expect the query to execute one time in 99.9% of cases.
However, I'm a little confused about the comment about "keeping the byte buffer around" - doesn't it go away when we exit getCqlFunction?

Relatedly - I've been biasing towards keeping the stuff inside clientPool.runWithRetryOnHost as small as possible, ideally just client.do_the_thing(the_arguments).

That way, it's more transparent which code might throw the exception we're passing, and it generally produces cleaner and more testable code - if we do that for long enough, eventually all of the clientPool-calling logic will live inside a small class which we can just mock for testing (and replace with CQL for production).

Does that practice lead to sub-optimal perf somehow?

I don't think this is actually a huge issue as I don't expect the queries to be that large, but my comment was more intended as context around potential issues with capturing references via lambda & anonymous inner classes. By "keeping the byte buffer around", what I mean is that in the current PR, the ByteBuffer queryBytes and String query references are captured as they are referenced in the apply and toString methods respectively. This means that both will live beyond the execution of the getCqlFunction method and until the executeCqlFunctionOnHost exits.

private FunctionCheckedException<Cassandra.Client, CqlResult, TException> getCqlFunction(String query) { return new FunctionCheckedException<Cassandra.Client, CqlResult, TException>() { @Override public CqlResult apply(Cassandra.Client client) throws TException { ByteBuffer queryBytes = ByteBuffer.wrap(query.getBytes(StandardCharsets.UTF_8)); return client.execute_cql3_query(queryBytes, Compression.NONE, consistency); } @Override public String toString() { return query; } }; }

schlosna · 2017-10-05T19:52:01Z

...cassandra/src/test/java/com/palantir/atlasdb/keyvalue/cassandra/CassandraClientPoolTest.java

+        return new FunctionCheckedException<Cassandra.Client, Object, RuntimeException>() {
+            @Override
+            public Object apply(Cassandra.Client input) throws RuntimeException {
+                return null;


nit: this should probably return Void:

private FunctionCheckedException<Cassandra.Client, Void, RuntimeException> noOp() { return new FunctionCheckedException<Cassandra.Client, Void, RuntimeException>() { @Override public Void apply(Cassandra.Client input) throws RuntimeException { return null; } @Override public String toString() { return "no-op"; } }; }

hsaraogi

LGTM, should be good to merge if @schlosna gives a green signal.

schlosna

mostly just informative feedback for the future

schlosna · 2017-10-06T16:39:21Z

...cassandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/CassandraClientPoolImpl.java

                    try {
-                        // And value between -500 and +500ms to backoff to better spread load on failover
-                        Thread.sleep(numTries * 1000 + (ThreadLocalRandom.current().nextInt(1000) - 500));
+                        Thread.sleep(sleepDuration);
                    } catch (InterruptedException i) {
                        throw new RuntimeException(i);


Do we need to reset interrupt here? Could fix in separate PR

schlosna · 2017-10-06T17:20:25Z

atlasdb-cassandra/src/main/java/com/palantir/atlasdb/keyvalue/cassandra/CqlExecutor.java

        } catch (TException e) {
            throw Throwables.throwUncheckedException(e);
        }
    }

+    private FunctionCheckedException<Cassandra.Client, CqlResult, TException> getCqlFunction(String query) {
+        ByteBuffer queryBytes = ByteBuffer.wrap(query.getBytes(StandardCharsets.UTF_8));


I don't think this is actually a huge issue as I don't expect the queries to be that large, but my comment was more intended as context around potential issues with capturing references via lambda & anonymous inner classes. By "keeping the byte buffer around", what I mean is that in the current PR, the ByteBuffer queryBytes and String query references are captured as they are referenced in the apply and toString methods respectively. This means that both will live beyond the execution of the getCqlFunction method and until the executeCqlFunctionOnHost exits.

private FunctionCheckedException<Cassandra.Client, CqlResult, TException> getCqlFunction(String query) { return new FunctionCheckedException<Cassandra.Client, CqlResult, TException>() { @Override public CqlResult apply(Cassandra.Client client) throws TException { ByteBuffer queryBytes = ByteBuffer.wrap(query.getBytes(StandardCharsets.UTF_8)); return client.execute_cql3_query(queryBytes, Compression.NONE, consistency); } @Override public String toString() { return query; } }; }

Reapply changes from #2373

8ef3392

* Copy-paste those changes * Fix logic: numTries == 1 instead of numTries > 1 * Add SafeArg and remove exception from the fast failover message

gsheasby assigned hsaraogi Oct 5, 2017

gsheasby requested review from fsamuel-bs and hsaraogi October 5, 2017 12:34

gsheasby mentioned this pull request Oct 5, 2017

Fix/suppress clientpool stacktrace spam #2373

Closed

hsaraogi reviewed Oct 5, 2017

View reviewed changes

newHost -> randomHost

f52c6a2

tpetracca approved these changes Oct 5, 2017

View reviewed changes

schlosna reviewed Oct 5, 2017

View reviewed changes

gsheasby added 2 commits October 6, 2017 11:25

Fix miscellaneous nits

88f85fb

Fix or suppress warnings

a15cebf

hsaraogi approved these changes Oct 6, 2017

View reviewed changes

schlosna approved these changes Oct 6, 2017

View reviewed changes

gsheasby merged commit 5f828fa into develop Oct 6, 2017

gsheasby deleted the fix/suppress-clientpoolimpl-stacktrace-spam branch October 6, 2017 20:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/suppress ClientPoolImpl stacktrace spam #2432

Fix/suppress ClientPoolImpl stacktrace spam #2432

gsheasby commented Oct 5, 2017 •

edited by jboreiko

hsaraogi Oct 5, 2017

codecov-io commented Oct 5, 2017 •

edited

tpetracca Oct 5, 2017

schlosna Oct 5, 2017

schlosna Oct 5, 2017

schlosna Oct 5, 2017

schlosna Oct 5, 2017

gsheasby Oct 6, 2017

schlosna Oct 18, 2017

gsheasby Oct 18, 2017

schlosna Oct 5, 2017

gsheasby Oct 6, 2017

schlosna Oct 6, 2017

gsheasby Oct 6, 2017

schlosna Oct 6, 2017

schlosna Oct 5, 2017

hsaraogi left a comment

schlosna left a comment

schlosna Oct 6, 2017

schlosna Oct 6, 2017

Fix/suppress ClientPoolImpl stacktrace spam #2432

Fix/suppress ClientPoolImpl stacktrace spam #2432

Conversation

gsheasby commented Oct 5, 2017 • edited by jboreiko

Choose a reason for hiding this comment

codecov-io commented Oct 5, 2017 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsaraogi left a comment

Choose a reason for hiding this comment

schlosna left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gsheasby commented Oct 5, 2017 •

edited by jboreiko

codecov-io commented Oct 5, 2017 •

edited