New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes in RPC handling #4486
Fixes in RPC handling #4486
Conversation
@@ -243,7 +246,17 @@ public void configure() { | |||
if (rvrl != null) { | |||
rvrl.remoteValueNotFound(key); | |||
} | |||
return null; | |||
if (!hasSuccesfulResponse) { | |||
RpcException rpcException = new RpcException("No successful response: " + responses); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it's a bit weird that remote node will return SuccessfulResponse{null} when it does not find the entry locally, I think that UnsuccessfulResponse would make more sense...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. Actually we used to send an UnsureResponse
back in such cases (with a check in DistributionResponseGenerator
). But because ClusteredGetCommand
doesn't have a topology, the check we did there was almost guaranteed to be false, so I finally removed it with ISPN-5463.
Besides, it's weird to have a locality check in the response generator, I'd rather remove the ResponseGenerator
and create the Response
in the command itself. (Or maybe in the interceptors...)
Needs rebasing |
@@ -33,7 +33,7 @@ public ClusteredGetResponseValidityFilter(Collection<Address> targets, Address s | |||
public boolean isAcceptable(Response response, Address address) { | |||
if (targets.contains(address)) { | |||
missingResponses--; | |||
if (response instanceof SuccessfulResponse) { | |||
if (response instanceof SuccessfulResponse || response instanceof ExceptionResponse) { | |||
validResponses++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ExceptionResponse
is not a ValidResponse
, so the validResponses
field name is no longer accurate.
@rvansa Are you sure the 3rd commit is needed because of the ISPN-6925 fix and not because of the ISPN-6795 fix? To me, it seems like the ISPN-6925 fix should be sufficient. And this PR could really use a test... |
@danberindei Hi, I've added the tests. |
@Override | ||
protected void createCacheManagers() throws Throwable { | ||
ConfigurationBuilder builder = getDefaultClusteredCacheConfig(CacheMode.DIST_SYNC); | ||
// cache stop takes quite long when the view splits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be waiting for remote txs to finish, because of cacheStopTimeout
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's non-tx cache. This really waits for ST.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I guess that's another bug then :)
@danberindei Rebased & addressed. |
LGTM, @danberindei May I integrate it? |
@@ -33,8 +33,8 @@ public ClusteredGetResponseValidityFilter(Collection<Address> targets, Address s | |||
public boolean isAcceptable(Response response, Address address) { | |||
if (targets.contains(address)) { | |||
missingResponses--; | |||
if (response instanceof SuccessfulResponse) { | |||
validResponses++; | |||
if (response instanceof SuccessfulResponse || response instanceof ExceptionResponse) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pruivo @tristantarrant @galderz @wburns Hi, so we have a stalemate with @danberindei we'd like to decide:
When a remote get is executed (to say, 2 owners), and first owner responds with an exception, there are two things we can do:
- ignore the exception (swallow it) and wait for the second response (possibly without exception) - that's what master does
- report the exception immediately (suggested change)
An exception can be thrown from two reasons:
a) bug in Infinispan code
b) AvailabilityException or similar
Note: AE can be thrown on one owner and not on another under certain timing; it should not happen in steady state
Pros:
- fail fast, both during development (for a)-style exceptions), and for b) as the user code is likely to hit AE more often (so the code will deal with it properly). Swallowing exceptions hides problems.
- when all owners would throw an exception, we won't get just RpcException without any other info (until this PR, it would end up with timeout, other fixes in the PR at least throw RpcException when there's no acceptable response)
- don't do another RPC when the first one already failed - just hoping for the best when we're likely to get the same response
Cons:
- the value can be available, and could be of use to the user (rather than the exception)
- it is a change of behaviour (in a situation when the exception can happen with different timing, though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are talking about Get commands, I vote for option 1 (ignoring exception and waiting for second response).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about OutdatedTopologyExceptions ? (still not a steady state)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tristantarrant OTEs are not thrown with ClusteredGetCommand (yet), because it does not have topologyId. I'd like to change that in another PR (not finished yet), so you can consider it as well - it's quite similar to the AE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other opinions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since others' opinion seems to be null
, I'll take the result as 1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tristantarrant Turns out our current behaviour is actually 2), with the caveat that we delay throwing the exception until we get a response from all the owners. We started behaving like 1) after the staggered gets work, but because of a bug that happened only when staggered gets were actually disabled.
Turns out supporting 1) and reporting meaningful exceptions is also pretty awkward with the ResponseFilter
API, so I agreed with @rvansa to converge to 2) everywhere, at least until we implement our own response handling on top of JChannel
.
Added one more commit, fixing an issue when a request times out but TimeoutException is not thrown. |
7116432
to
170587f
Compare
Rebased to master. |
* also fixes issue when a response marked as suspected was copied as received with null value. * when a timeout is reached, TimeoutException is thrown (possibly with suppressed exceptions)
* updated check in PartitionHandlingInterceptor to test exceptions instead of null return value.
* provided as separate commit to allow running the test easily on master
Integrated, thanks Radim! |
https://issues.jboss.org/browse/ISPN-6925
https://issues.jboss.org/browse/ISPN-6795
The third commit does not have a JIRA as it is a fix of consequences of ISPN-6925.