-
Notifications
You must be signed in to change notification settings - Fork 317
feat(multipath): add back path congestion metrics #3669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: Frando/mp-metrics-basics
Are you sure you want to change the base?
Conversation
|
Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/3669/docs/iroh/ Last updated: 2025-11-18T10:49:06Z |
2f4092f to
21448f3
Compare
| @@ -220,6 +226,7 @@ impl EndpointStateActor { | |||
| trace!("actor started"); | |||
| let idle_timeout = MaybeFuture::None; | |||
| tokio::pin!(idle_timeout); | |||
| let mut metrics_interval = time::interval(METRICS_INTERVAL); | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be cfged as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failed to get the cfg be happy with tokio::select!. Maybe should be forever-pending then if the cfg is not active?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, sth like that I have done in the past
flub
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to limit this to figure out what the easy and interesting counters to add. Nothing more. If we need congestion metrics we can do that in a later pass.
| @@ -75,6 +78,9 @@ const APPLICATION_ABANDON_PATH: u8 = 30; | |||
| /// in a high frequency, and to keep data about previous path around for subsequent connections. | |||
| const ACTOR_MAX_IDLE_TIMEOUT: Duration = Duration::from_secs(60); | |||
|
|
|||
| /// Interval in which connection and path metrics are emitted. | |||
| const METRICS_INTERVAL: Duration = Duration::from_secs(10); | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Soo, I'm going to argue that this is fundamentally doing metrics wrong. Metrics increment counters or record historgrams at discreet points. If you are recording something on a timer you were supposed to provide the metrics that can compute the thing, and the metrics backend can compute it on the fly over the range it wants, using sampling intervals it wants etc etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be discreet points to record connection latency at? Open/close path events?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is metrics backend code. I'll have a much easier time with this PR if we don't try and do anything like that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| fn record_metrics(&mut self) { | ||
| #[cfg(feature = "metrics")] | ||
| for state in self.connections.values_mut() { | ||
| state.record_metrics_periodic(&self.metrics, self.selected_path.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry if I am missing sth, but should this be the path for the specific connection that we pass in here, this seems like a generic one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a single selected path per ~~connection ~~remote endpoint. This is passed through here, because the metrics tracking code further down uses the RTT of the selected path as the connection's RTT. (There is no connection-level RTT exposed anymore with multipath on the quinn level).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but why is it the same for all connections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the selected path logic works on the level of the remote endpoint, and the result is then applied to all connections.
This is present on main. It is not new, just ported from main to feat-multipath. See #3606 and #3491. Happy to not do this here and only the simple counters initially, but IIRC these are the metrics we are most interested in to compare feat-multipath to main via n0des? Correct me if I'm wrong please, @Arqu . |
112180e to
5bb726f
Compare
|
I've split this PR: The basic, simple counter metrics are now in #3672. This PR is now based on 3672 and adds back the congestion metrics. |
5bb726f to
281e84b
Compare
Description
Based on #3672
Add back some of the path congestion metrics added in #3491. @Arqu please check my port of these metrics and if the calculations are correct. I commented-out the metrics added in #3491 that I didn't know how to port. But I think I got the important ones.
Breaking Changes
Notes & open questions
Change checklist
quic-rpciroh-gossipiroh-blobsdumbpipesendme