Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

serialize churn #1864

Merged
merged 15 commits into from Oct 30, 2019

Conversation

@jeanphilippeD
Copy link
Contributor

jeanphilippeD commented Oct 23, 2019

Partial fix to allow work on DKG and also better handle Relocation as well.
Closes #1860

Re-enable multiple_joining_nodes test.

Implement serialization of events:
in multiple_joining_nodes the 5 joining nodes are added and no nodes should fail.

Fix multiple issues related to dropping messages
=> backlog messages when not yet an elder
=> Forward routing messages received as Adult to correct destination.
=> Increase Signature accumulator time out and log when incomplete message is dropped.
=> Provide correct proof when sending to a Prefix authority (multiple sections).

Note: The ConnectionRequest should no longer be needed but can be better addressed as part of the follow up PR completing PeerMap removal.

@@ -263,6 +276,7 @@ impl Chain {
.collect(),
};

self.process_churn = false;

This comment has been minimized.

Copy link
@madadam

madadam Oct 24, 2019

Contributor

Should we only flip this to false when the section info is for our own section?

This comment has been minimized.

Copy link
@jeanphilippeD

jeanphilippeD Oct 24, 2019

Author Contributor

Yes, I'll push the updated branch if you are looking at it.

This comment has been minimized.

Copy link
@madadam

madadam Oct 24, 2019

Contributor

Sure. Or if you want me to wait with the review a bit, let me know.

@jeanphilippeD

This comment has been minimized.

Copy link
Contributor Author

jeanphilippeD commented Oct 24, 2019

Sure. Or if you want me to wait with the review a bit, let me know.

Good to get early feedback. pushed the additional change I'm working with. still having issue with simultaneous_joining_nodes after a few 10s iterations.

@@ -304,6 +317,20 @@ impl Chain {
| AccumulatingEvent::Relocate(_) => (),
}

let start_churn_event = match event {
AccumulatingEvent::Online(_) | AccumulatingEvent::Offline(_) => true,

This comment has been minimized.

Copy link
@madadam

madadam Oct 24, 2019

Contributor

Also Relocate.

src/chain/chain.rs Outdated Show resolved Hide resolved
jeanphilippeD added 11 commits Oct 29, 2019
Messages were still dropped when not in Elder state.
Ensure the message are backlogged, and trace appropriatly.

Also improve error logging.

Test:
Clippy + soak test
Mostly allow multiple_joining_nodes, and also avoid issue when mutliple
churn happens together, this should fit well with Relocate.

Address some issues with ignored connection request that lead to message
being dropped when mutliple churn happen at once. This will not be an
issue once we do not use the ConnectionrRequest/Response pattern.

multiple_joining_nodes still fail during soak testing and so is still
disabled. These failure seem related to the connection not being
established as needed which is currently being worked on.

Test:
clippy / soak tests mock
A node may still be in the Adult state when it start receiving routing
hop messages. The code did not do the necessary forwarding and so a
message could be swallowed by these nodes.

Ensure message are sent appropriately.

Test:
Soak test + clippy
Fix a specific seed that failed because nodes were not added after
split because the first backloged event was ignored.

Test:
Run soak test
Direct message are used to accumulate signature for SMD.
If a node that still think itself as an adult or joining node receive
them, it means it is one of the delivery group and should process them.
 => guarantees of SMD: A good node that receive the message
    will forward it.

Similarly, if we receive parsec gossip as a joining node, we should
process it as an adult.

Test:
Run clippy + local soak tests.
We need to set handled_genesis_event always when we see the genesis
block so we can start handling backlog events.

Also add tracing to better understand failures.

Test:
Soak tests + clippy
simultaneous_joining_nodes_three_section_with_one_ready_to_split failed
with seed Some([3563455648, 3905152850, 354682460, 4089474830]) because
messages where processed before the node had caught up with parsec and
thus could not trust the message.

An additional issue not fixed here was also uncovered:
 => A section should use the last signature known by the section parent
    of both split sibbling until both sibling have send an
    acknoledgement. This would avoid late node to think they are the
    destination of a section they do not trust.
The multiple_joining_nodes was failing because message signature did
not accumulate as they were expiring. Add log to show this problem and
increase timeout to avoid it.

Test:
clippy + soak test.
Avoid slow down all non churn tests:
Local test
with n = 40 -> 3 seconds
with n = 25 -> less than 1 second

We still split into 4 sections or more.

Test:
Soak test.
multiple_joining_nodes failed with seed after 1000 iterations:
Some([3916974036, 312593624, 819085280, 2177096903])
panicked at 'Elder(550551..(01)) Untrusted SignedRoutingMessage {
content: RoutingMessage { src: Section(name: c00000..), dst: PrefixSection(prefix: Prefix(0)),
content: NeighbourInfo(EldersInfo(prefix: Prefix(11), ...}, prev_hash_len: 1, version: 16)) },

The problem is we pick the knowledge of the first item in the destination
in this case 00, and so 01 does not have enough info.

Ensure we handle prefix authority correctly by taking as much of the
proof as needed by any section covered by the prefix.

Test:
Verify seed fixed.
Soak test + clippy
@jeanphilippeD jeanphilippeD force-pushed the jeanphilippeD:serialize_churn branch from eb1653b to e7fe727 Oct 29, 2019
@@ -43,12 +43,13 @@ pub struct JoiningPeer {
network_service: NetworkService,
routing_msg_filter: RoutingMessageFilter,
msg_backlog: Vec<SignedRoutingMessage>,

This comment has been minimized.

Copy link
@madadam

madadam Oct 29, 2019

Contributor

I would rename this one to routing_msg_backlog now that we have direct_msg_backlog too, for explicitness.

@jeanphilippeD jeanphilippeD force-pushed the jeanphilippeD:serialize_churn branch from e7fe727 to ca6e4cb Oct 29, 2019
@jeanphilippeD jeanphilippeD changed the title WIP: serializing churn serialize churn Oct 29, 2019
@jeanphilippeD jeanphilippeD marked this pull request as ready for review Oct 29, 2019
@jeanphilippeD jeanphilippeD requested review from madadam and octol Oct 29, 2019
@jeanphilippeD

This comment has been minimized.

Copy link
Contributor Author

jeanphilippeD commented Oct 29, 2019

Fixed review comment.

Copy link
Contributor

madadam left a comment

Looks good. Left a couple of mostly minor points.

@@ -77,6 +77,8 @@ pub struct Chain {
/// Temporary. Counting the accumulated prune events. Only used in tests until tests that
/// actually tests pruning is in place.
parsec_prune_accumulated: usize,
/// Marker indicating we are processing churn event
process_churn: bool,

This comment has been minimized.

Copy link
@madadam

madadam Oct 29, 2019

Contributor

Nitpick: maybe rename to processing_churn or churn_in_progress ?

src/chain/chain.rs Show resolved Hide resolved
@@ -57,6 +59,8 @@ pub struct SharedState {
pub their_knowledge: BTreeMap<Prefix<XorName>, u64>,
/// Recent keys removed from their_keys
pub their_recent_keys: VecDeque<(Prefix<XorName>, SectionKeyInfo)>,
/// Backlog of completed event that need to be processed when churn complete.

This comment has been minimized.

Copy link
@madadam

madadam Oct 29, 2019

Contributor

Grammar: "Backlog of completed events that need to be processed when churn completes"

}
}

#[allow(clippy::cognitive_complexity)]

This comment has been minimized.

Copy link
@madadam

madadam Oct 29, 2019

Contributor

Could the cognitive complexity be reduced by extracting some of the repetitive bits?

src/states/adult.rs Outdated Show resolved Hide resolved
Err(err) => debug!("{} - {:?}", self, err),
}
} else {
self.direct_msg_backlog.push((pub_id, msg));

This comment has been minimized.

Copy link
@madadam

madadam Oct 29, 2019

Contributor

Is this in preparation to support demotions? Because currently the only transition from Elder is either to BootstrappingPeer or to Terminated, in neither of which we carry the backlogs over.

This comment has been minimized.

Copy link
@jeanphilippeD

jeanphilippeD Oct 29, 2019

Author Contributor

It seemed sensible to keep the same flow for the different state.
Mostly it is a way to handle seeing a Transition sensibly, i.e we should stop processing the message when seeing the first transition.

It is likely that it will make more sense for demotion, but it seem to make sense to push the message back as a final state, and it the the state transition that decide what to do with any backloged messages?

This comment has been minimized.

Copy link
@madadam

madadam Oct 29, 2019

Contributor

It's just currently if a message ever ended in the backlog here, it would stay there forever. But that might change when demotions are implemented, so I think this is fine.

@@ -721,7 +725,8 @@ pub fn verify_section_invariants_for_nodes(nodes: &[TestNode], elder_size: usize
}

pub fn verify_section_invariants_between_nodes(nodes: &[TestNode]) {
let mut sections: BTreeMap<Prefix<XorName>, (XorName, BTreeSet<XorName>)> = BTreeMap::new();
type NodeSectionInfo = (XorName, Prefix<XorName>, u64, BTreeSet<XorName>);

This comment has been minimized.

Copy link
@madadam

madadam Oct 29, 2019

Contributor

I'd prefer this to be an actual struct with properly named fields, for clarity.

@@ -1137,6 +1154,35 @@ impl Base for Elder {
&mut self.timer
}

fn finish_handle_transition(&mut self, outbox: &mut dyn EventBox) -> Transition {
debug!("{} - State changed finish to Elder.", self);

This comment has been minimized.

Copy link
@madadam

madadam Oct 29, 2019

Contributor

Weird wording. What about "State change to Elder finished" or just "State changed to Elder"?

@jeanphilippeD

This comment has been minimized.

Copy link
Contributor Author

jeanphilippeD commented Oct 29, 2019

Addressed review comments.

@octol
octol approved these changes Oct 29, 2019
Copy link
Contributor

madadam left a comment

One small typo, but let's merge this anyway to not waste time on trivial issues, We can fix it in a later PR.

@@ -284,27 +287,23 @@ impl Base for Adult {
}

fn finish_handle_transition(&mut self, outbox: &mut dyn EventBox) -> Transition {
debug!("{} - State changed to Adult.", self);
debug!("{} - State changed to Adult finished.", self);

This comment has been minimized.

Copy link
@madadam

madadam Oct 30, 2019

Contributor

"changed" -> "change"

@jeanphilippeD jeanphilippeD merged commit aa2eab6 into maidsafe:fleming Oct 30, 2019
16 checks passed
16 checks passed
Check (ubuntu-latest)
Details
Test (ubuntu-latest)
Details
Check (windows-latest)
Details
Test (windows-latest)
Details
Check (macOS-latest)
Details
Test (macOS-latest)
Details
Test (ubuntu-latest, --release --features mock)
Details
Test (ubuntu-latest, --release)
Details
Test (windows-latest, --release --features mock)
Details
Test (windows-latest, --release)
Details
Test (macOS-latest, --release --features mock)
Details
Test (macOS-latest, --release)
Details
Rustfmt
Details
Clippy
Details
Travis CI - Pull Request Build Passed
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
@jeanphilippeD jeanphilippeD deleted the jeanphilippeD:serialize_churn branch Oct 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.