Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to broadcast network messages in parallel #1409

Merged
merged 15 commits into from
Sep 11, 2023

Conversation

vstakhov
Copy link
Contributor

@vstakhov vstakhov commented Sep 5, 2023

This PR addresses multiple issues pending:

  • Update orchestra to the recent version and test how the node performs
  • Add some useful metrics for outbound network bridge
  • Try to send incoming network requests to all subsystems without blocking on some particular subsystem in that loop
  • Fix all incompatibilities between orchestra and polkadot code (e.g. malus node)

@vstakhov vstakhov added A0-needs_burnin Pull request needs to be tested on a live validator node before merge. DevOps is notified via matrix T0-node This PR/Issue is related to the topic “node”. A3-backport Pull request is already reviewed well in another branch. I4-refactor Code needs refactoring. T8-polkadot This PR/Issue is related to/affects the Polkadot network. labels Sep 5, 2023
@vstakhov vstakhov marked this pull request as ready for review September 6, 2023 14:32
@vstakhov vstakhov force-pushed the vstakhov-network-bridge-refactor branch from 1e2dc02 to c900596 Compare September 7, 2023 09:46
polkadot/node/network/bridge/src/metrics.rs Outdated Show resolved Hide resolved
}

// Here we wait for all the delayed messages to be sent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to add a metric that tells us how much we wait here. This will be helpful to measure the level of back pressure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best metric type here will be a Histogram, on the other hand it will be quite expensive in terms of the metrics space. WDYT about adding a histogram with low amount of buckets (like 0, 1, 5, 10)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say it would be sufficient to just have a counter that measure how much time is spent there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we still need a histogram for time I suppose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could also add a metric for the size of the delayed_messages queue

@eskimor
Copy link
Member

eskimor commented Sep 7, 2023

  • Update orchestra to the recent version and test how the node performs

How does it perform?

@@ -1038,21 +1041,58 @@ fn dispatch_collation_event_to_all_unbounded(
}
}

fn try_send_validation_event<E>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the name a bit misleading. The try suggest that we don't wait when full - which is true, but we also don't give up. Maybe send_or_queue_validation_event?

@@ -1038,21 +1041,58 @@ fn dispatch_collation_event_to_all_unbounded(
}
}

fn try_send_validation_event<E>(
event: E,
sender: &mut (impl overseer::NetworkBridgeRxSenderTrait + overseer::SubsystemSender<E>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add the trait bound to the where clause

}

// Here we wait for all the delayed messages to be sent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could also add a metric for the size of the delayed_messages queue

@vstakhov
Copy link
Contributor Author

vstakhov commented Sep 8, 2023

  • Update orchestra to the recent version and test how the node performs

How does it perform?

I have tested it on Versi and it performed quite good with a cluster of malus nodes. However, I have not compared it's performance with the baseline due to clash with AB testing. With the metrics added, we can probably check that metrics and decide.

@eskimor
Copy link
Member

eskimor commented Sep 11, 2023

Ideally we would indeed know the impact this has, so yes I would appreciate some A/B testing.

@eskimor eskimor merged commit 44dbb73 into master Sep 11, 2023
114 checks passed
@eskimor eskimor deleted the vstakhov-network-bridge-refactor branch September 11, 2023 18:33
vstakhov added a commit that referenced this pull request Sep 17, 2023
@vstakhov vstakhov mentioned this pull request Sep 17, 2023
vstakhov added a commit that referenced this pull request Sep 18, 2023
Futures channels that are used by default has a side effect of
`Sender::Clone` that efficiently increases the capacity of the bounded
channel by one. This PR fixes the undesired backpressure removal that
was caused by the #1409. This issue has been discovered by @sandreim
during Versi testing and needs to be treated as critical that should not
be included in any release without this reversion.

This PR reverts the original behaviour.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A0-needs_burnin Pull request needs to be tested on a live validator node before merge. DevOps is notified via matrix A3-backport Pull request is already reviewed well in another branch. I4-refactor Code needs refactoring. T0-node This PR/Issue is related to the topic “node”. T8-polkadot This PR/Issue is related to/affects the Polkadot network.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants