Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISPN-1000 - Block new transactions while rehash is in progress #405

Closed
wants to merge 2 commits into from

Conversation

danberindei
Copy link
Member

https://issues.jboss.org/browse/ISPN-1000
Master only

I updated the rebalance algorithm to block transactions in the entire cluster untill all nodes have finished pushing state. The apply state commands will bypass the block since they use the SKIP_LOCKING flag.

In an earlier commit for ISPN-1106 I was catching InterruptedException and wrapping it in an IllegalStateException, now I moved the wait to the RehashControlCommand-related methods only and they don't catch the InterruptedException.

@Sanne
Copy link
Member

Sanne commented Jun 26, 2011

Was it not one of the goals of Infinispan to be always available, even during rehashes ? I thought that in London's meeting Manik had suggested a design in which the transactions log would be used to apply the changes to the rehashing nodes, I didn't think much about it but I was assuming a solution without blocking transactions was the goal.
I don't know if this was discussed, I'm just surprised; this might well be the best that can be implemented.

@danberindei
Copy link
Member Author

Sanne, you mean even if one or more of the nodes is doing a rehash and is blocked, the rest of the nodes should still respond to requests? You are right, we did lose this possibility with these changes.

If you mean that every node should be available all the time, that was never the case. Before the push-based redesign we had indeed a much shorter blocking period, only while draining the transaction log. But Manik suggested that we first make the new algorithm work without the non-blocking part, since concurrent state update was a big source of problems with the previous algorithm.

I admit blocking the entire cluster is yet another step back from the availability POV (note that this only applies to write operations, though). I think it is however a bigger step forward from the consistency POV. I have thought about blocking the transactions on just the sending/receiving nodes, but it's very hard for a node to know how long to wait for state or even if it will receive state from another node for a certain view change (JGroups AND our rehashing algorithm will sometimes skip views).

I have started implementing something that would block a single node and still give us consistency, but I couldn't make it work reliably (yet). In the end I think we will want to enable virtual nodes by default and then almost every rehash will involve all the nodes so it won't hurt that much.

I certainly plan to work on this again when we re-implement non-blocking state transfer, or even sooner if I get any good ideas.

@Sanne
Copy link
Member

Sanne commented Jun 27, 2011

@danberindei thanks for the great explanation.

Sanne, you mean even if one or more of the nodes is doing a rehash and is blocked, the rest of the nodes should still respond to requests? You are right, we did lose this possibility with these changes.

Yes that's what I meant.

But Manik suggested that we first make the new algorithm work without the non-blocking part
+1000 ;)

So this affects only write operations? that's not that bad after all.

Dan Berindei added 2 commits June 27, 2011 13:56
Apply state commands and all commands using the SKIP_LOCKING flag are still allowed while new transactions are blocked.
Transactions are now blocked while rehashing is in progress, they are not resumed immediately after pushing state.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants