Fixes problem with re-electing master on message loss #2091

Merged
merged 1 commit into from Mar 18, 2014

Conversation

Projects
None yet
3 participants
Contributor

jakewins commented Mar 4, 2014

No description provided.

Contributor

rickardoberg commented Mar 5, 2014

ClusterInstanceId cannot be renamed as it is used in message serialization on the wire. Don't merge this before that is reverted.

Owner

tinwelint commented Mar 6, 2014

and also you need to rebase this

@jakewins jakewins Fix issue with cluster bricking if specific message lost when master …
…leaves.

 o When a master leaves, if the leave learned message did not reach the machine with the
   lowest id, the other instances would wait for that instance to start an election
   indefinitely. The instance with the lowest id would realize it was out of date, but
   try and contact the (now dead) master to catch up, and would not retry when that failed.

 o This implementation fixes this by preferring instances more likely to be live and to have
   the paxos instance id we are interested in (see CommonContextStateTest and LearnerStateTest).
 o It also introduces proper retries if the catchup fails.
e2876d8
Contributor

jakewins commented Mar 11, 2014

retest this please

Owner

tinwelint commented Mar 14, 2014

retest this please, since it could be just a flaky test:

Test Result (1 fel / +1)
    org.neo4j.kernel.ha.TxPushStrategyConfigIT.twoRoundRobin

@rickardoberg rickardoberg added a commit that referenced this pull request Mar 18, 2014

@rickardoberg rickardoberg Merge pull request #2091 from jakewins/1.9-learnit
Fixes problem with re-electing master on message loss
5d9e836

@rickardoberg rickardoberg merged commit 5d9e836 into neo4j:1.9-maint Mar 18, 2014

1 check passed

default Merged build finished.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment