serialize churn #1864
serialize churn #1864
Conversation
src/chain/chain.rs
Outdated
@@ -263,6 +276,7 @@ impl Chain { | |||
.collect(), | |||
}; | |||
|
|||
self.process_churn = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we only flip this to false
when the section info is for our own section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll push the updated branch if you are looking at it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Or if you want me to wait with the review a bit, let me know.
Good to get early feedback. pushed the additional change I'm working with. still having issue with simultaneous_joining_nodes after a few 10s iterations. |
@@ -304,6 +317,20 @@ impl Chain { | |||
| AccumulatingEvent::Relocate(_) => (), | |||
} | |||
|
|||
let start_churn_event = match event { | |||
AccumulatingEvent::Online(_) | AccumulatingEvent::Offline(_) => true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also Relocate
.
Mostly allow multiple_joining_nodes, and also avoid issue when mutliple churn happens together, this should fit well with Relocate. Address some issues with ignored connection request that lead to message being dropped when mutliple churn happen at once. This will not be an issue once we do not use the ConnectionrRequest/Response pattern. multiple_joining_nodes still fail during soak testing and so is still disabled. These failure seem related to the connection not being established as needed which is currently being worked on. Test: clippy / soak tests mock
Direct message are used to accumulate signature for SMD. If a node that still think itself as an adult or joining node receive them, it means it is one of the delivery group and should process them. => guarantees of SMD: A good node that receive the message will forward it. Similarly, if we receive parsec gossip as a joining node, we should process it as an adult. Test: Run clippy + local soak tests.
We need to set handled_genesis_event always when we see the genesis block so we can start handling backlog events. Also add tracing to better understand failures. Test: Soak tests + clippy
Fix a specific seed that failed because nodes were not added after split because the first backloged event was ignored. Test: Run soak test
A node may still be in the Adult state when it start receiving routing hop messages. The code did not do the necessary forwarding and so a message could be swallowed by these nodes. Ensure message are sent appropriately. Test: Soak test + clippy
Messages were still dropped when not in Elder state. Ensure the message are backlogged, and trace appropriatly. Also improve error logging. Test: Clippy + soak test
simultaneous_joining_nodes_three_section_with_one_ready_to_split failed with seed Some([3563455648, 3905152850, 354682460, 4089474830]) because messages where processed before the node had caught up with parsec and thus could not trust the message. An additional issue not fixed here was also uncovered: => A section should use the last signature known by the section parent of both split sibbling until both sibling have send an acknoledgement. This would avoid late node to think they are the destination of a section they do not trust.
The multiple_joining_nodes was failing because message signature did not accumulate as they were expiring. Add log to show this problem and increase timeout to avoid it. Test: clippy + soak test.
Avoid slow down all non churn tests: Local test with n = 40 -> 3 seconds with n = 25 -> less than 1 second We still split into 4 sections or more. Test: Soak test.
multiple_joining_nodes failed with seed after 1000 iterations: Some([3916974036, 312593624, 819085280, 2177096903]) panicked at 'Elder(550551..(01)) Untrusted SignedRoutingMessage { content: RoutingMessage { src: Section(name: c00000..), dst: PrefixSection(prefix: Prefix(0)), content: NeighbourInfo(EldersInfo(prefix: Prefix(11), ...}, prev_hash_len: 1, version: 16)) }, The problem is we pick the knowledge of the first item in the destination in this case 00, and so 01 does not have enough info. Ensure we handle prefix authority correctly by taking as much of the proof as needed by any section covered by the prefix. Test: Verify seed fixed. Soak test + clippy
eb1653b
to
e7fe727
Compare
src/states/joining_peer.rs
Outdated
@@ -43,12 +43,13 @@ pub struct JoiningPeer { | |||
network_service: NetworkService, | |||
routing_msg_filter: RoutingMessageFilter, | |||
msg_backlog: Vec<SignedRoutingMessage>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename this one to routing_msg_backlog
now that we have direct_msg_backlog
too, for explicitness.
e7fe727
to
ca6e4cb
Compare
Fixed review comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Left a couple of mostly minor points.
src/chain/chain.rs
Outdated
@@ -77,6 +77,8 @@ pub struct Chain { | |||
/// Temporary. Counting the accumulated prune events. Only used in tests until tests that | |||
/// actually tests pruning is in place. | |||
parsec_prune_accumulated: usize, | |||
/// Marker indicating we are processing churn event | |||
process_churn: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: maybe rename to processing_churn
or churn_in_progress
?
src/chain/shared_state.rs
Outdated
@@ -57,6 +59,8 @@ pub struct SharedState { | |||
pub their_knowledge: BTreeMap<Prefix<XorName>, u64>, | |||
/// Recent keys removed from their_keys | |||
pub their_recent_keys: VecDeque<(Prefix<XorName>, SectionKeyInfo)>, | |||
/// Backlog of completed event that need to be processed when churn complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar: "Backlog of completed events that need to be processed when churn completes"
src/chain/shared_state.rs
Outdated
} | ||
} | ||
|
||
#[allow(clippy::cognitive_complexity)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the cognitive complexity be reduced by extracting some of the repetitive bits?
Err(err) => debug!("{} - {:?}", self, err), | ||
} | ||
} else { | ||
self.direct_msg_backlog.push((pub_id, msg)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this in preparation to support demotions? Because currently the only transition from Elder
is either to BootstrappingPeer
or to Terminated
, in neither of which we carry the backlogs over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seemed sensible to keep the same flow for the different state.
Mostly it is a way to handle seeing a Transition sensibly, i.e we should stop processing the message when seeing the first transition.
It is likely that it will make more sense for demotion, but it seem to make sense to push the message back as a final state, and it the the state transition that decide what to do with any backloged messages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just currently if a message ever ended in the backlog here, it would stay there forever. But that might change when demotions are implemented, so I think this is fine.
tests/mock_network/utils.rs
Outdated
@@ -721,7 +725,8 @@ pub fn verify_section_invariants_for_nodes(nodes: &[TestNode], elder_size: usize | |||
} | |||
|
|||
pub fn verify_section_invariants_between_nodes(nodes: &[TestNode]) { | |||
let mut sections: BTreeMap<Prefix<XorName>, (XorName, BTreeSet<XorName>)> = BTreeMap::new(); | |||
type NodeSectionInfo = (XorName, Prefix<XorName>, u64, BTreeSet<XorName>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer this to be an actual struct with properly named fields, for clarity.
src/states/elder/mod.rs
Outdated
@@ -1137,6 +1154,35 @@ impl Base for Elder { | |||
&mut self.timer | |||
} | |||
|
|||
fn finish_handle_transition(&mut self, outbox: &mut dyn EventBox) -> Transition { | |||
debug!("{} - State changed finish to Elder.", self); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird wording. What about "State change to Elder finished" or just "State changed to Elder"?
Test: Clippy + soak test.
Addressed review comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small typo, but let's merge this anyway to not waste time on trivial issues, We can fix it in a later PR.
@@ -284,27 +287,23 @@ impl Base for Adult { | |||
} | |||
|
|||
fn finish_handle_transition(&mut self, outbox: &mut dyn EventBox) -> Transition { | |||
debug!("{} - State changed to Adult.", self); | |||
debug!("{} - State changed to Adult finished.", self); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"changed" -> "change"
Partial fix to allow work on DKG and also better handle Relocation as well.
Closes #1860
Re-enable multiple_joining_nodes test.
Implement serialization of events:
in multiple_joining_nodes the 5 joining nodes are added and no nodes should fail.
Fix multiple issues related to dropping messages
=> backlog messages when not yet an elder
=> Forward routing messages received as Adult to correct destination.
=> Increase Signature accumulator time out and log when incomplete message is dropped.
=> Provide correct proof when sending to a Prefix authority (multiple sections).
Note: The ConnectionRequest should no longer be needed but can be better addressed as part of the follow up PR completing PeerMap removal.