Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve replication over netsplit #111

Merged
merged 5 commits into from
Oct 17, 2018

Conversation

studzien
Copy link
Contributor

This is done on top of #108 and #109.

This PR is an attempt to fix issues discovered by the property based tests that can generate split and join network between certain nodes:

  1. This counterexample reproduced a case when joins are propagated via netsplit (node 1 stores information about sessions connected to node 3 internally), but leaves are not -- when the node 3 sessions are disconnected and the split is healed, these sessions still appear to be connected on node 1.
    The proposed solution makes the filter in observe_removes less strict. I'm not sure what performance implications this might have, though.

  2. This counterexample occasionally reproduces a situation when some of the sessions connected to node 3 never appear on node 2.
    It seems that sometimes a merge of two deltas was possible, when their ranges were not contiguous. The proposed solution makes the deltas adjacency check more strict.

cc @pzel

@chrismccord chrismccord merged commit cc6d675 into phoenixframework:master Oct 17, 2018
@chrismccord
Copy link
Member

Thank you so much! ❤️❤️❤️🐥🔥

@studzien
Copy link
Contributor Author

Our pleasure!
There will be one more coming up soon, fixing an initial sync race condition when more nodes are started, but there's still something failing the property tests sometimes in a non-deterministic way.

So most likely there will be something wrong with the test environment itself, but I want to make sure first before creating a PR :)

@chrismccord
Copy link
Member

Great! I have pushed 1.1.1 with your fixes and look forward to the next one! Thanks for patiently waiting for me to get these merged in. I have to devote the brain cycles anytime I revisit the tracker code :)

Thanks again!

@studzien
Copy link
Contributor Author

@chrismccord Actually the fix I've been referring to above has been already included in this PR :)

I've finally found the reason why the test harness was failing sometimes - simply the time we've been waiting for nodes to synchronize was too short given the Erlang Distributed failure detector time to detect that remote nodes were not reachable via netsplit.

It seems that current master passes the property tests we currently have.
I'm planning to improve their speed significantly, so it's feasible to run longer scenario that we currently do.

If you're interested, I've reorganized counterexamples in our repo and split them into ones caused by issues when starting and stopping the nodes and the ones related to netsplits.

Each of them has a quick note what the issue was related to, and a link to a commit that fixed it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants