Fixes problem with re-electing master on message loss #2091

merged 1 commit into from Mar 18, 2014


None yet
3 participants

jakewins commented Mar 4, 2014

No description provided.


rickardoberg commented Mar 5, 2014

ClusterInstanceId cannot be renamed as it is used in message serialization on the wire. Don't merge this before that is reverted.


tinwelint commented Mar 6, 2014

and also you need to rebase this

@jakewins jakewins Fix issue with cluster bricking if specific message lost when master …

 o When a master leaves, if the leave learned message did not reach the machine with the
   lowest id, the other instances would wait for that instance to start an election
   indefinitely. The instance with the lowest id would realize it was out of date, but
   try and contact the (now dead) master to catch up, and would not retry when that failed.

 o This implementation fixes this by preferring instances more likely to be live and to have
   the paxos instance id we are interested in (see CommonContextStateTest and LearnerStateTest).
 o It also introduces proper retries if the catchup fails.

jakewins commented Mar 11, 2014

retest this please


tinwelint commented Mar 14, 2014

retest this please, since it could be just a flaky test:

Test Result (1 fel / +1)

@rickardoberg rickardoberg added a commit that referenced this pull request Mar 18, 2014

@rickardoberg rickardoberg Merge pull request #2091 from jakewins/1.9-learnit
Fixes problem with re-electing master on message loss

@rickardoberg rickardoberg merged commit 5d9e836 into neo4j:1.9-maint Mar 18, 2014

1 check passed

default Merged build finished.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment