Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/suppress ClientPoolImpl stacktrace spam #2432

Merged
merged 4 commits into from Oct 6, 2017

Conversation

gsheasby
Copy link
Contributor

@gsheasby gsheasby commented Oct 5, 2017

Reapplied changes from #2373 after the Startup Ordering PR merged

  • Copy-paste those changes
  • Fix logic: numTries == 1 instead of numTries > 1
  • Add SafeArg and remove exception from the fast failover message
  • Added a release note

Goals (and why): Fixes #2280.

Implementation Description (bullets):

  • For retryable exceptions, logged exception class name and message instead of stack trace
    (the stack trace is still logged if we run out of retries)
  • Added the query's toString to one of the log messages. The query string typically contains the type of Cassandra query, as well as the column family
  • Added SafeArg calls for arguments that are considered safe.

Concerns (what feedback would you like?):

  • Is there anything extra that I missed?

Where should we start reviewing?: CassandraClientPoolImpl

Priority (whenever / two weeks / yesterday): today - higher priority since one of our internal product gets indigestion when we feed them this spam. @ChrisAlice for SA.


This change is Reviewable

* Copy-paste those changes
* Fix logic: numTries == 1 instead of numTries > 1
* Add SafeArg and remove exception from the fast failover message
previousHostPool,
hostPool.getHost());
SafeArg.of("previousHost", previousHostPool),
SafeArg.of("newHost", hostPool.getHost()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe: newHost -> randomHost

@codecov-io
Copy link

codecov-io commented Oct 5, 2017

Codecov Report

Merging #2432 into develop will increase coverage by 0.02%.
The diff coverage is 57.89%.

Impacted file tree graph

@@              Coverage Diff              @@
##             develop    #2432      +/-   ##
=============================================
+ Coverage      60.33%   60.36%   +0.02%     
+ Complexity      4683     4447     -236     
=============================================
  Files            833      833              
  Lines          39391    39414      +23     
  Branches        4036     4037       +1     
=============================================
+ Hits           23768    23791      +23     
+ Misses         14135    14133       -2     
- Partials        1488     1490       +2
Impacted Files Coverage Δ Complexity Δ
...lantir/atlasdb/keyvalue/cassandra/CqlExecutor.java 0% <0%> (ø) 0 <0> (ø) ⬇️
...db/keyvalue/cassandra/CassandraClientPoolImpl.java 45.37% <68.75%> (+1.29%) 0 <0> (ø) ⬇️
...ain/java/com/palantir/paxos/PaxosStateLogImpl.java 85.59% <0%> (-1.7%) 0% <0%> (ø)
...alantir/lock/client/LockRefreshingLockService.java 58.82% <0%> (-1.48%) 10% <0%> (-1%)
...tir/timestamp/RequestBatchingTimestampService.java 78.21% <0%> (-1%) 0% <0%> (ø)
.../com/palantir/atlasdb/table/common/TableTasks.java 57.69% <0%> (ø) 20% <0%> (ø) ⬇️
...ain/java/com/palantir/paxos/PaxosAcceptorImpl.java 73.21% <0%> (ø) 0% <0%> (ø) ⬇️
...ain/java/com/palantir/paxos/PaxosProposerImpl.java 88.33% <0%> (ø) 0% <0%> (ø) ⬇️
...n/java/com/palantir/lock/impl/LockServiceImpl.java 83.3% <0%> (+0.7%) 92% <0%> (+1%) ⬆️
...rc/main/java/com/palantir/lock/LockCollection.java 71.42% <0%> (+4.76%) 12% <0%> (+1%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9f66333...a15cebf. Read the comment docs.

log.warn("Retrying with backoff a query intended for host {}.", hostPool.getHost(), e);
// And value between -500 and +500ms to backoff to better spread load on failover
int sleepDuration = numTries * 1000 + (ThreadLocalRandom.current().nextInt(1000) - 500);
log.warn("Retrying with backoff ({}ms} a query, {}, intended for host {}.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Retrying a query, {}, with backoff of {}ms,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, there's a stray } in there after ms

log.warn("Retrying with backoff a query intended for host {}.", hostPool.getHost(), e);
// And value between -500 and +500ms to backoff to better spread load on failover
int sleepDuration = numTries * 1000 + (ThreadLocalRandom.current().nextInt(1000) - 500);
log.warn("Retrying with backoff ({}ms} a query, {}, intended for host {}.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, there's a stray } in there after ms

log.warn("We tried to add {} back into the pool, but got an exception"
+ " that caused us to distrust this host further. Exception message was: {} : {}",
SafeArg.of("host", host),
SafeArg.of("exceptionClass", e.getClass().getTypeName()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: getTypeName is a new one, should be fine though e.getClass().getName() or e.getClass().getCanonicalName() might be simpler

log.warn("Error occurred talking to cassandra. Attempt {} of {}.",
SafeArg.of("numTries", numTries),
SafeArg.of("maxTotalTries", MAX_TRIES_TOTAL),
ex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the exception be the last arg in the if block, not else? I think we could actually do the following to cover all cases so that we include the exception class & message all the time, but only include stacktrace on first occurrence.

            log.warn("Error occurred talking to cassandra. Attempt {} of {}. Exception message was: {} : {}",
                    SafeArg.of("numTries", numTries),
                    SafeArg.of("maxTotalTries", MAX_TRIES_TOTAL),
                    SafeArg.of("exceptionClass", ex.getClass().getName()),
                    UnsafeArg.of("exceptionMessage", ex.getMessage()),
                    (numTries == 1) ? ex : null);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

....d'oh, I was right the first time!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gsheasby isn’t your logic here backwards if you’re trying to only log the exception on first failed attempt? I am pretty sure the whole of/else should be replaced with my comment above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks correct on develop. We don't need to log exceptionClass or message on the first try, since we log the whole exception.

} catch (TException e) {
throw Throwables.throwUncheckedException(e);
}
}

private FunctionCheckedException<Cassandra.Client, CqlResult, TException> getCqlFunction(String query) {
ByteBuffer queryBytes = ByteBuffer.wrap(query.getBytes(StandardCharsets.UTF_8));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to eagerly capture and retain both the query string & converted bytes & byte buffer or should this be inlined into apply below as:

            public CqlResult apply(Cassandra.Client client) throws TException {
                ByteBuffer queryBytes = ByteBuffer.wrap(query.getBytes(StandardCharsets.UTF_8));
                return client.execute_cql3_query(queryBytes, Compression.NONE, consistency);
            }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to override toString here to make the log message helpful - see CassandraClientPoolImpl.runWithRetryOnHost.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can still override toStringas you have it, this just avoids keeping the byte buffer around after executing the query. If we are going to re-execute the query many times, then your original version of eagerly converting to bytes might be better tradeoff

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect the query to execute one time in 99.9% of cases.
However, I'm a little confused about the comment about "keeping the byte buffer around" - doesn't it go away when we exit getCqlFunction?

Relatedly - I've been biasing towards keeping the stuff inside clientPool.runWithRetryOnHost as small as possible, ideally just client.do_the_thing(the_arguments).

That way, it's more transparent which code might throw the exception we're passing, and it generally produces cleaner and more testable code - if we do that for long enough, eventually all of the clientPool-calling logic will live inside a small class which we can just mock for testing (and replace with CQL for production).

Does that practice lead to sub-optimal perf somehow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is actually a huge issue as I don't expect the queries to be that large, but my comment was more intended as context around potential issues with capturing references via lambda & anonymous inner classes. By "keeping the byte buffer around", what I mean is that in the current PR, the ByteBuffer queryBytes and String query references are captured as they are referenced in the apply and toString methods respectively. This means that both will live beyond the execution of the getCqlFunction method and until the executeCqlFunctionOnHost exits.

    private FunctionCheckedException<Cassandra.Client, CqlResult, TException> getCqlFunction(String query) {
        return new FunctionCheckedException<Cassandra.Client, CqlResult, TException>() {
            @Override
            public CqlResult apply(Cassandra.Client client) throws TException {
                ByteBuffer queryBytes = ByteBuffer.wrap(query.getBytes(StandardCharsets.UTF_8));
                return client.execute_cql3_query(queryBytes, Compression.NONE, consistency);
            }

            @Override
            public String toString() {
                return query;
            }
        };
    }

return new FunctionCheckedException<Cassandra.Client, Object, RuntimeException>() {
@Override
public Object apply(Cassandra.Client input) throws RuntimeException {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this should probably return Void:

    private FunctionCheckedException<Cassandra.Client, Void, RuntimeException> noOp() {
        return new FunctionCheckedException<Cassandra.Client, Void, RuntimeException>() {
            @Override
            public Void apply(Cassandra.Client input) throws RuntimeException {
                return null;
            }

            @Override
            public String toString() {
                return "no-op";
            }
        };
    }

Copy link
Contributor

@hsaraogi hsaraogi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, should be good to merge if @schlosna gives a green signal.

Copy link
Contributor

@schlosna schlosna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly just informative feedback for the future

try {
// And value between -500 and +500ms to backoff to better spread load on failover
Thread.sleep(numTries * 1000 + (ThreadLocalRandom.current().nextInt(1000) - 500));
Thread.sleep(sleepDuration);
} catch (InterruptedException i) {
throw new RuntimeException(i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to reset interrupt here? Could fix in separate PR

} catch (TException e) {
throw Throwables.throwUncheckedException(e);
}
}

private FunctionCheckedException<Cassandra.Client, CqlResult, TException> getCqlFunction(String query) {
ByteBuffer queryBytes = ByteBuffer.wrap(query.getBytes(StandardCharsets.UTF_8));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is actually a huge issue as I don't expect the queries to be that large, but my comment was more intended as context around potential issues with capturing references via lambda & anonymous inner classes. By "keeping the byte buffer around", what I mean is that in the current PR, the ByteBuffer queryBytes and String query references are captured as they are referenced in the apply and toString methods respectively. This means that both will live beyond the execution of the getCqlFunction method and until the executeCqlFunctionOnHost exits.

    private FunctionCheckedException<Cassandra.Client, CqlResult, TException> getCqlFunction(String query) {
        return new FunctionCheckedException<Cassandra.Client, CqlResult, TException>() {
            @Override
            public CqlResult apply(Cassandra.Client client) throws TException {
                ByteBuffer queryBytes = ByteBuffer.wrap(query.getBytes(StandardCharsets.UTF_8));
                return client.execute_cql3_query(queryBytes, Compression.NONE, consistency);
            }

            @Override
            public String toString() {
                return query;
            }
        };
    }

@gsheasby gsheasby merged commit 5f828fa into develop Oct 6, 2017
@gsheasby gsheasby deleted the fix/suppress-clientpoolimpl-stacktrace-spam branch October 6, 2017 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants