Skip to content
This repository has been archived by the owner on Jun 25, 2021. It is now read-only.

serialize churn #1864

Merged
merged 15 commits into from Oct 30, 2019
Merged

serialize churn #1864

merged 15 commits into from Oct 30, 2019

Conversation

jeanphilippeD
Copy link
Contributor

@jeanphilippeD jeanphilippeD commented Oct 23, 2019

Partial fix to allow work on DKG and also better handle Relocation as well.
Closes #1860

Re-enable multiple_joining_nodes test.

Implement serialization of events:
in multiple_joining_nodes the 5 joining nodes are added and no nodes should fail.

Fix multiple issues related to dropping messages
=> backlog messages when not yet an elder
=> Forward routing messages received as Adult to correct destination.
=> Increase Signature accumulator time out and log when incomplete message is dropped.
=> Provide correct proof when sending to a Prefix authority (multiple sections).

Note: The ConnectionRequest should no longer be needed but can be better addressed as part of the follow up PR completing PeerMap removal.

@@ -263,6 +276,7 @@ impl Chain {
.collect(),
};

self.process_churn = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we only flip this to false when the section info is for our own section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll push the updated branch if you are looking at it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Or if you want me to wait with the review a bit, let me know.

@jeanphilippeD
Copy link
Contributor Author

Sure. Or if you want me to wait with the review a bit, let me know.

Good to get early feedback. pushed the additional change I'm working with. still having issue with simultaneous_joining_nodes after a few 10s iterations.

@@ -304,6 +317,20 @@ impl Chain {
| AccumulatingEvent::Relocate(_) => (),
}

let start_churn_event = match event {
AccumulatingEvent::Online(_) | AccumulatingEvent::Offline(_) => true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also Relocate.

src/chain/chain.rs Outdated Show resolved Hide resolved
Jean-Philippe DUFRAIGNE added 11 commits October 29, 2019 11:16
Mostly allow multiple_joining_nodes, and also avoid issue when mutliple
churn happens together, this should fit well with Relocate.

Address some issues with ignored connection request that lead to message
being dropped when mutliple churn happen at once. This will not be an
issue once we do not use the ConnectionrRequest/Response pattern.

multiple_joining_nodes still fail during soak testing and so is still
disabled. These failure seem related to the connection not being
established as needed which is currently being worked on.

Test:
clippy / soak tests mock
Direct message are used to accumulate signature for SMD.
If a node that still think itself as an adult or joining node receive
them, it means it is one of the delivery group and should process them.
 => guarantees of SMD: A good node that receive the message
    will forward it.

Similarly, if we receive parsec gossip as a joining node, we should
process it as an adult.

Test:
Run clippy + local soak tests.
We need to set handled_genesis_event always when we see the genesis
block so we can start handling backlog events.

Also add tracing to better understand failures.

Test:
Soak tests + clippy
Fix a specific seed that failed because nodes were not added after
split because the first backloged event was ignored.

Test:
Run soak test
A node may still be in the Adult state when it start receiving routing
hop messages. The code did not do the necessary forwarding and so a
message could be swallowed by these nodes.

Ensure message are sent appropriately.

Test:
Soak test + clippy
Messages were still dropped when not in Elder state.
Ensure the message are backlogged, and trace appropriatly.

Also improve error logging.

Test:
Clippy + soak test
simultaneous_joining_nodes_three_section_with_one_ready_to_split failed
with seed Some([3563455648, 3905152850, 354682460, 4089474830]) because
messages where processed before the node had caught up with parsec and
thus could not trust the message.

An additional issue not fixed here was also uncovered:
 => A section should use the last signature known by the section parent
    of both split sibbling until both sibling have send an
    acknoledgement. This would avoid late node to think they are the
    destination of a section they do not trust.
The multiple_joining_nodes was failing because message signature did
not accumulate as they were expiring. Add log to show this problem and
increase timeout to avoid it.

Test:
clippy + soak test.
Avoid slow down all non churn tests:
Local test
with n = 40 -> 3 seconds
with n = 25 -> less than 1 second

We still split into 4 sections or more.

Test:
Soak test.
multiple_joining_nodes failed with seed after 1000 iterations:
Some([3916974036, 312593624, 819085280, 2177096903])
panicked at 'Elder(550551..(01)) Untrusted SignedRoutingMessage {
content: RoutingMessage { src: Section(name: c00000..), dst: PrefixSection(prefix: Prefix(0)),
content: NeighbourInfo(EldersInfo(prefix: Prefix(11), ...}, prev_hash_len: 1, version: 16)) },

The problem is we pick the knowledge of the first item in the destination
in this case 00, and so 01 does not have enough info.

Ensure we handle prefix authority correctly by taking as much of the
proof as needed by any section covered by the prefix.

Test:
Verify seed fixed.
Soak test + clippy
@@ -43,12 +43,13 @@ pub struct JoiningPeer {
network_service: NetworkService,
routing_msg_filter: RoutingMessageFilter,
msg_backlog: Vec<SignedRoutingMessage>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename this one to routing_msg_backlog now that we have direct_msg_backlog too, for explicitness.

@jeanphilippeD jeanphilippeD changed the title WIP: serializing churn serialize churn Oct 29, 2019
@jeanphilippeD jeanphilippeD marked this pull request as ready for review October 29, 2019 14:16
@jeanphilippeD
Copy link
Contributor Author

Fixed review comment.

Copy link
Contributor

@madadam madadam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Left a couple of mostly minor points.

@@ -77,6 +77,8 @@ pub struct Chain {
/// Temporary. Counting the accumulated prune events. Only used in tests until tests that
/// actually tests pruning is in place.
parsec_prune_accumulated: usize,
/// Marker indicating we are processing churn event
process_churn: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: maybe rename to processing_churn or churn_in_progress ?

src/chain/chain.rs Show resolved Hide resolved
@@ -57,6 +59,8 @@ pub struct SharedState {
pub their_knowledge: BTreeMap<Prefix<XorName>, u64>,
/// Recent keys removed from their_keys
pub their_recent_keys: VecDeque<(Prefix<XorName>, SectionKeyInfo)>,
/// Backlog of completed event that need to be processed when churn complete.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar: "Backlog of completed events that need to be processed when churn completes"

}
}

#[allow(clippy::cognitive_complexity)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the cognitive complexity be reduced by extracting some of the repetitive bits?

src/states/adult.rs Outdated Show resolved Hide resolved
Err(err) => debug!("{} - {:?}", self, err),
}
} else {
self.direct_msg_backlog.push((pub_id, msg));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this in preparation to support demotions? Because currently the only transition from Elder is either to BootstrappingPeer or to Terminated, in neither of which we carry the backlogs over.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed sensible to keep the same flow for the different state.
Mostly it is a way to handle seeing a Transition sensibly, i.e we should stop processing the message when seeing the first transition.

It is likely that it will make more sense for demotion, but it seem to make sense to push the message back as a final state, and it the the state transition that decide what to do with any backloged messages?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just currently if a message ever ended in the backlog here, it would stay there forever. But that might change when demotions are implemented, so I think this is fine.

@@ -721,7 +725,8 @@ pub fn verify_section_invariants_for_nodes(nodes: &[TestNode], elder_size: usize
}

pub fn verify_section_invariants_between_nodes(nodes: &[TestNode]) {
let mut sections: BTreeMap<Prefix<XorName>, (XorName, BTreeSet<XorName>)> = BTreeMap::new();
type NodeSectionInfo = (XorName, Prefix<XorName>, u64, BTreeSet<XorName>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer this to be an actual struct with properly named fields, for clarity.

@@ -1137,6 +1154,35 @@ impl Base for Elder {
&mut self.timer
}

fn finish_handle_transition(&mut self, outbox: &mut dyn EventBox) -> Transition {
debug!("{} - State changed finish to Elder.", self);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird wording. What about "State change to Elder finished" or just "State changed to Elder"?

@jeanphilippeD
Copy link
Contributor Author

Addressed review comments.

Copy link
Contributor

@madadam madadam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small typo, but let's merge this anyway to not waste time on trivial issues, We can fix it in a later PR.

@@ -284,27 +287,23 @@ impl Base for Adult {
}

fn finish_handle_transition(&mut self, outbox: &mut dyn EventBox) -> Transition {
debug!("{} - State changed to Adult.", self);
debug!("{} - State changed to Adult finished.", self);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"changed" -> "change"

@jeanphilippeD jeanphilippeD merged commit aa2eab6 into maidsafe:fleming Oct 30, 2019
@jeanphilippeD jeanphilippeD deleted the serialize_churn branch October 31, 2019 16:35
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make churn event handling sequential
3 participants