Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove a random subset of validators in net_dynamic_hb #385

Merged
merged 10 commits into from Feb 27, 2019
Merged

Remove a random subset of validators in net_dynamic_hb #385

merged 10 commits into from Feb 27, 2019

Conversation

RicoGit
Copy link
Contributor

@RicoGit RicoGit commented Feb 26, 2019

related issue #374

do_drop_and_re_add chooses at random at least 1 node for removing from the network and then re-adding removed nodes back. Cluster always remain correct. In other words before and after removing validators network correctness condition (N=3f+1) is always satisfied.

Removed nodes can be as faulty nodes as well as correct ones. You can see in the log of the test how much is actually nodes will be removed.

Max number of nodes for removing is 8, was chosen 2 nodes
Will remove and re-add nodes {1, 2}

Copy link
Collaborator

@afck afck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far! 👍
I'd just like to make the test slightly more general (see comment at subset_for_remove).

tests/net/mod.rs Outdated Show resolved Hide resolved
tests/net_dynamic_hb.rs Outdated Show resolved Hide resolved
tests/net_dynamic_hb.rs Outdated Show resolved Hide resolved
tests/net_dynamic_hb.rs Outdated Show resolved Hide resolved
tests/net_dynamic_hb.rs Outdated Show resolved Hide resolved
tests/net_dynamic_hb.rs Outdated Show resolved Hide resolved
tests/net_dynamic_hb.rs Outdated Show resolved Hide resolved
tests/net_dynamic_hb.rs Outdated Show resolved Hide resolved
tests/net_dynamic_hb.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@afck afck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! (Just one more nit-pick below.)

tests/net_dynamic_hb.rs Outdated Show resolved Hide resolved
@RicoGit
Copy link
Contributor Author

RicoGit commented Feb 27, 2019

Running tests on the CI server caught an error:

thread 'drop_and_re_add' panicked at 'crank: node failed to process step: Fault { reported_by: 3, faulty_id: 7, fault_kind: HbFault(SubsetFault(BaFault(DuplicateBVal))) }', libcore/result.rs:1009:5

Trying to reproduce the error locally I caught another error:

thread 'drop_and_re_add' panicked at 'crank: network queue empty', src/libcore/option.rs:1008:5

Second error appears when only one node left in the network after removing nodes. I'm trying to figure out what's wrong.

@vkomenda
Copy link
Contributor

@RicoGit, to reproduce locally, add the proptest seed from the failing test run on Travis to your local seeds.

@RicoGit
Copy link
Contributor Author

RicoGit commented Feb 27, 2019

Should I add a static test case for the failed inputs? like this:

#[test]
fn drop_and_re_add_one_node_left() {
    let cfg = TestConfig {
        dimension: NetworkDimension::new(4, 0),
        total_txs: 20,
        batch_size: 10,
        contribution_size: 1,
        seed: [151, 234, 50, 31, 109, 65, 28, 122, 82, 93, 226, 143, 185, 27, 195, 133]
    };
    do_drop_and_re_add(cfg)
}

@vkomenda
Copy link
Contributor

No, just add the seed to net_dynamic_hb.proptest-regressions.

@vkomenda
Copy link
Contributor

@RicoGit
Copy link
Contributor Author

RicoGit commented Feb 27, 2019

I've got it. Awesome! Thanks! But local test (with the seed from CI) produces a different error.

@vkomenda
Copy link
Contributor

Have a look at crank error variants if your local failure is 'crank: network queue empty'. Check that the removed nodes finished sending all their messages before removal.

@afck
Copy link
Collaborator

afck commented Feb 27, 2019

Does the second error always happen whenever we're left with only a single node?
That one may be related to the test (or the test net framework) and not the production code: Of course there are no messages in the queue if there's only one node. A single node will always immediately deliver a batch whenever we provide input.
If it's difficult to make the test handle that case, I'd be happy with making that a TODO for a separate PR, and restricting the test to at least two nodes for now.

The first error is strange, though. We are a bit aggressive with our fault reports and sending the same BVal twice wouldn't break anything in production; but it still shouldn't happen in theory… 🤔

@RicoGit
Copy link
Contributor Author

RicoGit commented Feb 27, 2019

"restricting the test to at least two nodes for now." - Yes! It fixed all the tests. Thanks! I've run a test about ten times for all known seeds.
"I'd be happy with making that a TODO" - I'll make an issue for that. ok?

@afck
Copy link
Collaborator

afck commented Feb 27, 2019

Great, thanks!
@vkomenda: Feel free to merge once you're happy with it.

@vkomenda
Copy link
Contributor

@afck, are you OK with versioning tests/net_dynamic_hb.proptest-regressions? It's reasonable but wasn't done before.

@afck
Copy link
Collaborator

afck commented Feb 27, 2019

Ah, right, that file should be removed, since with a single node it always fails anyway. Good catch!

@vkomenda vkomenda merged commit 3336fa7 into poanetwork:master Feb 27, 2019
@RicoGit RicoGit deleted the net_dyn_hb_improvement branch February 27, 2019 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants