Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes in RPC handling #4486

Closed
wants to merge 4 commits into from
Closed

Fixes in RPC handling #4486

wants to merge 4 commits into from

Conversation

rvansa
Copy link
Member

@rvansa rvansa commented Aug 4, 2016

https://issues.jboss.org/browse/ISPN-6925
https://issues.jboss.org/browse/ISPN-6795

The third commit does not have a JIRA as it is a fix of consequences of ISPN-6925.

@@ -243,7 +246,17 @@ public void configure() {
if (rvrl != null) {
rvrl.remoteValueNotFound(key);
}
return null;
if (!hasSuccesfulResponse) {
RpcException rpcException = new RpcException("No successful response: " + responses);
Copy link
Member Author

@rvansa rvansa Aug 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it's a bit weird that remote node will return SuccessfulResponse{null} when it does not find the entry locally, I think that UnsuccessfulResponse would make more sense...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Actually we used to send an UnsureResponse back in such cases (with a check in DistributionResponseGenerator). But because ClusteredGetCommand doesn't have a topology, the check we did there was almost guaranteed to be false, so I finally removed it with ISPN-5463.

Besides, it's weird to have a locality check in the response generator, I'd rather remove the ResponseGenerator and create the Response in the command itself. (Or maybe in the interceptors...)

@galderz
Copy link
Member

galderz commented Aug 23, 2016

Needs rebasing

@@ -33,7 +33,7 @@ public ClusteredGetResponseValidityFilter(Collection<Address> targets, Address s
public boolean isAcceptable(Response response, Address address) {
if (targets.contains(address)) {
missingResponses--;
if (response instanceof SuccessfulResponse) {
if (response instanceof SuccessfulResponse || response instanceof ExceptionResponse) {
validResponses++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExceptionResponse is not a ValidResponse, so the validResponses field name is no longer accurate.

@danberindei
Copy link
Member

@rvansa Are you sure the 3rd commit is needed because of the ISPN-6925 fix and not because of the ISPN-6795 fix? To me, it seems like the ISPN-6925 fix should be sufficient.

And this PR could really use a test...

@rvansa
Copy link
Member Author

rvansa commented Sep 2, 2016

@danberindei Hi, I've added the tests.

@Override
protected void createCacheManagers() throws Throwable {
ConfigurationBuilder builder = getDefaultClusteredCacheConfig(CacheMode.DIST_SYNC);
// cache stop takes quite long when the view splits
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be waiting for remote txs to finish, because of cacheStopTimeout?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's non-tx cache. This really waits for ST.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I guess that's another bug then :)

@rvansa
Copy link
Member Author

rvansa commented Sep 13, 2016

@danberindei Rebased & addressed.

@slaskawi
Copy link
Contributor

LGTM, @danberindei May I integrate it?

@@ -33,8 +33,8 @@ public ClusteredGetResponseValidityFilter(Collection<Address> targets, Address s
public boolean isAcceptable(Response response, Address address) {
if (targets.contains(address)) {
missingResponses--;
if (response instanceof SuccessfulResponse) {
validResponses++;
if (response instanceof SuccessfulResponse || response instanceof ExceptionResponse) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pruivo @tristantarrant @galderz @wburns Hi, so we have a stalemate with @danberindei we'd like to decide:

When a remote get is executed (to say, 2 owners), and first owner responds with an exception, there are two things we can do:

  1. ignore the exception (swallow it) and wait for the second response (possibly without exception) - that's what master does
  2. report the exception immediately (suggested change)

An exception can be thrown from two reasons:
a) bug in Infinispan code
b) AvailabilityException or similar

Note: AE can be thrown on one owner and not on another under certain timing; it should not happen in steady state

Pros:

  • fail fast, both during development (for a)-style exceptions), and for b) as the user code is likely to hit AE more often (so the code will deal with it properly). Swallowing exceptions hides problems.
  • when all owners would throw an exception, we won't get just RpcException without any other info (until this PR, it would end up with timeout, other fixes in the PR at least throw RpcException when there's no acceptable response)
  • don't do another RPC when the first one already failed - just hoping for the best when we're likely to get the same response

Cons:

  • the value can be available, and could be of use to the user (rather than the exception)
  • it is a change of behaviour (in a situation when the exception can happen with different timing, though)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are talking about Get commands, I vote for option 1 (ignoring exception and waiting for second response).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about OutdatedTopologyExceptions ? (still not a steady state)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tristantarrant OTEs are not thrown with ClusteredGetCommand (yet), because it does not have topologyId. I'd like to change that in another PR (not finished yet), so you can consider it as well - it's quite similar to the AE.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other opinions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, my vote goes to 1. I guess we'd want the opinions of @wburns and @pruivo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since others' opinion seems to be null, I'll take the result as 1).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tristantarrant Turns out our current behaviour is actually 2), with the caveat that we delay throwing the exception until we get a response from all the owners. We started behaving like 1) after the staggered gets work, but because of a bug that happened only when staggered gets were actually disabled.

Turns out supporting 1) and reporting meaningful exceptions is also pretty awkward with the ResponseFilter API, so I agreed with @rvansa to converge to 2) everywhere, at least until we implement our own response handling on top of JChannel.

@rvansa
Copy link
Member Author

rvansa commented Sep 15, 2016

Added one more commit, fixing an issue when a request times out but TimeoutException is not thrown.
The test included in the PR would fail (without the fix) with RpcException.

@rvansa
Copy link
Member Author

rvansa commented Sep 15, 2016

Rebased to master.

* also fixes issue when a response marked as suspected was copied as received with null value.
* when a timeout is reached, TimeoutException is thrown (possibly with suppressed exceptions)
* updated check in PartitionHandlingInterceptor to test exceptions
  instead of null return value.
* provided as separate commit to allow running the test easily on master
@danberindei
Copy link
Member

Integrated, thanks Radim!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants