New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix/suppress ClientPoolImpl stacktrace spam #2432
Conversation
* Copy-paste those changes * Fix logic: numTries == 1 instead of numTries > 1 * Add SafeArg and remove exception from the fast failover message
previousHostPool, | ||
hostPool.getHost()); | ||
SafeArg.of("previousHost", previousHostPool), | ||
SafeArg.of("newHost", hostPool.getHost())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe: newHost -> randomHost
Codecov Report
@@ Coverage Diff @@
## develop #2432 +/- ##
=============================================
+ Coverage 60.33% 60.36% +0.02%
+ Complexity 4683 4447 -236
=============================================
Files 833 833
Lines 39391 39414 +23
Branches 4036 4037 +1
=============================================
+ Hits 23768 23791 +23
+ Misses 14135 14133 -2
- Partials 1488 1490 +2
Continue to review full report at Codecov.
|
log.warn("Retrying with backoff a query intended for host {}.", hostPool.getHost(), e); | ||
// And value between -500 and +500ms to backoff to better spread load on failover | ||
int sleepDuration = numTries * 1000 + (ThreadLocalRandom.current().nextInt(1000) - 500); | ||
log.warn("Retrying with backoff ({}ms} a query, {}, intended for host {}.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Retrying a query, {}, with backoff of {}ms,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, there's a stray }
in there after ms
log.warn("Retrying with backoff a query intended for host {}.", hostPool.getHost(), e); | ||
// And value between -500 and +500ms to backoff to better spread load on failover | ||
int sleepDuration = numTries * 1000 + (ThreadLocalRandom.current().nextInt(1000) - 500); | ||
log.warn("Retrying with backoff ({}ms} a query, {}, intended for host {}.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, there's a stray }
in there after ms
log.warn("We tried to add {} back into the pool, but got an exception" | ||
+ " that caused us to distrust this host further. Exception message was: {} : {}", | ||
SafeArg.of("host", host), | ||
SafeArg.of("exceptionClass", e.getClass().getTypeName()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: getTypeName
is a new one, should be fine though e.getClass().getName()
or e.getClass().getCanonicalName()
might be simpler
log.warn("Error occurred talking to cassandra. Attempt {} of {}.", | ||
SafeArg.of("numTries", numTries), | ||
SafeArg.of("maxTotalTries", MAX_TRIES_TOTAL), | ||
ex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't the exception be the last arg in the if block, not else? I think we could actually do the following to cover all cases so that we include the exception class & message all the time, but only include stacktrace on first occurrence.
log.warn("Error occurred talking to cassandra. Attempt {} of {}. Exception message was: {} : {}",
SafeArg.of("numTries", numTries),
SafeArg.of("maxTotalTries", MAX_TRIES_TOTAL),
SafeArg.of("exceptionClass", ex.getClass().getName()),
UnsafeArg.of("exceptionMessage", ex.getMessage()),
(numTries == 1) ? ex : null);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
....d'oh, I was right the first time!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gsheasby isn’t your logic here backwards if you’re trying to only log the exception on first failed attempt? I am pretty sure the whole of/else should be replaced with my comment above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks correct on develop. We don't need to log exceptionClass or message on the first try, since we log the whole exception.
} catch (TException e) { | ||
throw Throwables.throwUncheckedException(e); | ||
} | ||
} | ||
|
||
private FunctionCheckedException<Cassandra.Client, CqlResult, TException> getCqlFunction(String query) { | ||
ByteBuffer queryBytes = ByteBuffer.wrap(query.getBytes(StandardCharsets.UTF_8)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to eagerly capture and retain both the query string & converted bytes & byte buffer or should this be inlined into apply
below as:
public CqlResult apply(Cassandra.Client client) throws TException {
ByteBuffer queryBytes = ByteBuffer.wrap(query.getBytes(StandardCharsets.UTF_8));
return client.execute_cql3_query(queryBytes, Compression.NONE, consistency);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to override toString
here to make the log message helpful - see CassandraClientPoolImpl.runWithRetryOnHost
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can still override toStringas you have it, this just avoids keeping the byte buffer around after executing the query. If we are going to re-execute the query many times, then your original version of eagerly converting to bytes might be better tradeoff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect the query to execute one time in 99.9% of cases.
However, I'm a little confused about the comment about "keeping the byte buffer around" - doesn't it go away when we exit getCqlFunction
?
Relatedly - I've been biasing towards keeping the stuff inside clientPool.runWithRetryOnHost
as small as possible, ideally just client.do_the_thing(the_arguments)
.
That way, it's more transparent which code might throw the exception we're passing, and it generally produces cleaner and more testable code - if we do that for long enough, eventually all of the clientPool-calling logic will live inside a small class which we can just mock for testing (and replace with CQL for production).
Does that practice lead to sub-optimal perf somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is actually a huge issue as I don't expect the queries to be that large, but my comment was more intended as context around potential issues with capturing references via lambda & anonymous inner classes. By "keeping the byte buffer around", what I mean is that in the current PR, the ByteBuffer queryBytes
and String query
references are captured as they are referenced in the apply
and toString
methods respectively. This means that both will live beyond the execution of the getCqlFunction
method and until the executeCqlFunctionOnHost
exits.
private FunctionCheckedException<Cassandra.Client, CqlResult, TException> getCqlFunction(String query) {
return new FunctionCheckedException<Cassandra.Client, CqlResult, TException>() {
@Override
public CqlResult apply(Cassandra.Client client) throws TException {
ByteBuffer queryBytes = ByteBuffer.wrap(query.getBytes(StandardCharsets.UTF_8));
return client.execute_cql3_query(queryBytes, Compression.NONE, consistency);
}
@Override
public String toString() {
return query;
}
};
}
return new FunctionCheckedException<Cassandra.Client, Object, RuntimeException>() { | ||
@Override | ||
public Object apply(Cassandra.Client input) throws RuntimeException { | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this should probably return Void
:
private FunctionCheckedException<Cassandra.Client, Void, RuntimeException> noOp() {
return new FunctionCheckedException<Cassandra.Client, Void, RuntimeException>() {
@Override
public Void apply(Cassandra.Client input) throws RuntimeException {
return null;
}
@Override
public String toString() {
return "no-op";
}
};
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, should be good to merge if @schlosna gives a green signal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly just informative feedback for the future
try { | ||
// And value between -500 and +500ms to backoff to better spread load on failover | ||
Thread.sleep(numTries * 1000 + (ThreadLocalRandom.current().nextInt(1000) - 500)); | ||
Thread.sleep(sleepDuration); | ||
} catch (InterruptedException i) { | ||
throw new RuntimeException(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to reset interrupt here? Could fix in separate PR
} catch (TException e) { | ||
throw Throwables.throwUncheckedException(e); | ||
} | ||
} | ||
|
||
private FunctionCheckedException<Cassandra.Client, CqlResult, TException> getCqlFunction(String query) { | ||
ByteBuffer queryBytes = ByteBuffer.wrap(query.getBytes(StandardCharsets.UTF_8)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is actually a huge issue as I don't expect the queries to be that large, but my comment was more intended as context around potential issues with capturing references via lambda & anonymous inner classes. By "keeping the byte buffer around", what I mean is that in the current PR, the ByteBuffer queryBytes
and String query
references are captured as they are referenced in the apply
and toString
methods respectively. This means that both will live beyond the execution of the getCqlFunction
method and until the executeCqlFunctionOnHost
exits.
private FunctionCheckedException<Cassandra.Client, CqlResult, TException> getCqlFunction(String query) {
return new FunctionCheckedException<Cassandra.Client, CqlResult, TException>() {
@Override
public CqlResult apply(Cassandra.Client client) throws TException {
ByteBuffer queryBytes = ByteBuffer.wrap(query.getBytes(StandardCharsets.UTF_8));
return client.execute_cql3_query(queryBytes, Compression.NONE, consistency);
}
@Override
public String toString() {
return query;
}
};
}
Reapplied changes from #2373 after the Startup Ordering PR merged
Goals (and why): Fixes #2280.
Implementation Description (bullets):
(the stack trace is still logged if we run out of retries)
Concerns (what feedback would you like?):
Where should we start reviewing?:
CassandraClientPoolImpl
Priority (whenever / two weeks / yesterday): today - higher priority since one of our internal product gets indigestion when we feed them this spam. @ChrisAlice for SA.
This change is