Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomos node tree overlay #415

Merged
merged 12 commits into from
Oct 5, 2023
Merged

Nomos node tree overlay #415

merged 12 commits into from
Oct 5, 2023

Conversation

bacv
Copy link
Member

@bacv bacv commented Sep 18, 2023

Added tree overlay to nomos-node. Some configuration parameters were missing in example config.yaml, so the file was updated.

@bacv bacv self-assigned this Sep 18, 2023
@bacv bacv added the node label Sep 18, 2023
@bacv bacv added this to the Nomos testnet (playground) milestone Sep 18, 2023
Copy link
Collaborator

@danielSanchezQ danielSanchezQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the overlay config will have to go away eventually as it must be calculated from chain instead. Overall looks good. I just do not know we we didn't mix a couple branches in this PR.

Comment on lines 14 to 16
#[serde(default = "MixnetClientConfig::default_max_retries")]
pub max_retries: usize,
#[serde(default = "MixnetClientConfig::default_retry_delay")]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this from another branch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the another branch had defaults for MixnetNode, I've added these for MixnetClient config parsing.

Comment on lines 57 to 64
da:
da_protocol:
num_attestations: 1
backend:
max_capacity: 10
evicting_period:
secs: 3600
nanos: 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, this is not from this pr or is it?

Copy link
Member Author

@bacv bacv Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is from this PR, had to add these changes for config to work. Mentioned it in the description:

Some configuration parameters were missing in example config.yaml, so the file was updated.

Added these changes to PR because they are small, these are separate commits:
image

@bacv
Copy link
Member Author

bacv commented Sep 19, 2023

@zeegomo - After changing FlatOverlay to TreeOverlay in nomos node and running integration tests, ten happy path nodes network had some panicking nodes and failed to progress with views. @danielSanchezQ helped debug it and error handling for this case was added. Please review.

I've also increased the unhappy case timeout, tree overlay added more latency.

Comment on lines +232 to +233
tracing::debug!("Failed to gather initial votes");
return Event::None;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zeegomo This should be ok right? I have been working with Gusto and in reality if failing to gather the initial votes nodes should catch up anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failing to gather initial votes means we go to the timeout routine, why does this happen now?

@zeegomo
Copy link
Contributor

zeegomo commented Sep 19, 2023

I'll review it after our meeting 👍🏼

Copy link
Contributor

@zeegomo zeegomo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What were the panics you saw in tests? I suspect they were caused by a timeout triggering the closing of the vote stream.
Those changes seems ok (and we were looking to add them), but since the tree overlay with 1 committee is exactly like a flat one, I'd like to investigate why we need the additional time

I've also increased the unhappy case timeout, tree overlay added more latency.

As in, this shouldn't be the case for a 1 committee tree overlay

Comment on lines +232 to +233
tracing::debug!("Failed to gather initial votes");
return Event::None;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failing to gather initial votes means we go to the timeout routine, why does this happen now?

@danielSanchezQ
Copy link
Collaborator

What were the panics you saw in tests? I suspect they were caused by a timeout triggering the closing of the vote stream. Those changes seems ok (and we were looking to add them), but since the tree overlay with 1 committee is exactly like a flat one, I'd like to investigate why we need the additional time

I've also increased the unhappy case timeout, tree overlay added more latency.

As in, this shouldn't be the case for a 1 committee tree overlay

Yes, issue need to be investigated. Apparantly after some number of nodes, some get unresponsive 🤔 , starting with the first leader.

@zeegomo
Copy link
Contributor

zeegomo commented Sep 19, 2023

I think it's worth investigating this issue now that the playground is still small, debugging something like this at greater scales is almost impossible

@bacv
Copy link
Member Author

bacv commented Sep 19, 2023

What were the panics you saw in tests? I suspect they were caused by a timeout triggering the closing of the vote stream.
Those changes seems ok (and we were looking to add them), but since the tree overlay with 1 committee is exactly like a flat one, I'd like to investigate why we need the additional time

The panic was caused by executing unreachable code here: https://github.com/logos-co/nomos-node/pull/415/files#diff-12bf2e7f638b9635b046010a05a3ac31b6c7abf580576c6a6b04bca1f33a394fL48-L83

It's most likely caused by the tree overlay. The increase in the time it takes to complete the test cases is probably related to the added error handling that triggers unhappy path?

I think it's worth investigating this issue now that the playground is still small, debugging something like this at greater scales is almost impossible

Agree, I'll ping you how we could organize this.

@bacv
Copy link
Member Author

bacv commented Sep 19, 2023

Moved code unrelated to tree overlay to #423

@youngjoon-lee
Copy link
Contributor

I merged #425 to this branch, but still we need #426 to be merged to make the ten-nodes-happy test passed.

* report unhappy blocks in the happy path test
@youngjoon-lee
Copy link
Contributor

youngjoon-lee commented Oct 5, 2023

I merged all fixes to this PR. If CI passes, I think we can consider merging this.

@bacv bacv merged commit 99d77f7 into master Oct 5, 2023
7 checks passed
@bacv bacv deleted the nomos-node-tree-overlay branch October 5, 2023 08:56
@youngjoon-lee youngjoon-lee linked an issue Oct 20, 2023 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

vote_stream is sometimes closed before the tally is met
5 participants