Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Gossipsub] Add optional metrics #2235

Closed
wants to merge 13 commits into from
Closed

Conversation

AgeManning
Copy link
Contributor

@AgeManning AgeManning commented Sep 17, 2021

These adds a feature flag which enables a set of metrics for gossipsub which can be helpful in analysing the mesh behaviour and general health of the network.

This doesn't fit into the libp2p metrics framework (which relies on actual behaviour messages) instead we instrument various sections of the gossipsub code to report and update a general metrics struct.

The metrics keep track of various parameters, but importantly tracks events within the mesh, such as peer churn and causes for the churn. We also keep track of IWANT and duplicates being received which are important metrics for tuning gossipsub parameters for specific networks.

For those that don't compile with the new feature gossipsub-metrics Gossipsub should remain unchanged.

AgeManning and others added 3 commits September 13, 2021 08:43
* Added message count stuff

* Message count maps use MeshIndex instead of PeerId

* Added validated message count

* Count rejected messages too..

* simplified adding new counts significantly

* some cleanup and stop resetting counts

* Added more metrics and cleanup

* added peer slot to logging on score

* added more metrics for prune cases

* added churn_topic

* added debug logging for churn counts

* immutable borrow

* fixed log

* improved logging even more..

* ensure message counts structure exists

* Forgot to add prune case

* cleaned up debugging with macro

* fixed unsubscribed condition

* renamed some things for clarity

* Optimized Performance with Vec Instead of BTreeMap

* Updated comment for clarity

* removed redundant category and optimized

* fixed small edge case

* Unified slot-related HashMaps into 1 Structure

* Convert Fields to Array for Passing to Functions

* Added Method for Iterating over Metrics

* small fix

* ran cargo fmt

* ran clippy

* One last refactor using enums. Looking great now.

* Fixed bug with assert and improved error message

* fixed issue with release builds

* Make minimal changes to master

* Added tracking of assigning slot

* added slot_metrics_topics() & extra info to logs

* Made slot_metrics_topics() smarter

* First draft

* Update metric design

* Add new metric and cleanup

* Cleanup code & Remove Redundant map Lookups (#147)

* Simplify & Remove Redundant Map Lookups

* Cleanup code/Remove Redundant lookups in heartbeat

* Commented and renamed things for clarity

* More renaming / reorganizing for clarity

* Fixed clippy's unnecessary clone warning

* one last renaming..

Co-authored-by: Mark Mackey <ethereumdreamer@gmail.com>
Co-authored-by: ethDreamer <37123614+ethDreamer@users.noreply.github.com>
Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Always in favor of more instrumentation. Sorry for the delay here.

Feature flag

I wonder whether the gossipsub-metrics and the metrics feature are needed. Off the top of my head, I would suggest putting all metrics related logic behind #[cfg(debug_assertions)]. In debug mode, no one would complain about the overhead of metric tracking. In release mode metric tracking would not happen. Noisiness of the metrics output in debug mode can be controlled via log levels.

Comparison to libp2p-metrics

This doesn't fit into the libp2p metrics framework

Correct. If I understand this pull request correctly, your goal is to be able to do ad-hoc debugging and not continuous monitoring, right? For the latter, I would very much like for libp2p-metrics to support gossipsub. Let me know in case you would like to collaborate on that.

As an example what is already possible today:

https://kademlia-exporter.max-inden.de/d/Pfr0Fj6Mk/rust-libp2p?orgId=1&var-data_source=Prometheus&var-instance=kademlia-exporter-ipfs%3A8080&var-instance=kademlia-exporter-kusama%3A8080&var-instance=kademlia-exporter-polkadot%3A8080&from=now-24h&to=now

/* For debug purposes
use env_logger::{Builder, Env};
// Add this line to relevant tests.
Builder::from_env(Env::default().default_filter_or("info")).init();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just add let _ = env_logger::try_init(); as the first line of each test, like we do in most other crates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i'll modify this in the next iteration, if we decide to go down the libp2p-metrics route

@AgeManning
Copy link
Contributor Author

Correct. If I understand this pull request correctly, your goal is to be able to do ad-hoc debugging and not continuous monitoring, right? For the latter, I would very much like for libp2p-metrics to support gossipsub. Let me know in case you would like to collaborate on that.

This is actually for optional continuous monitoring. I have set up a bunch of prometheus and grafana metrics for it also. When I skimmed the libp2p-metrics crate it looked like it was just attaching to messages sent to by the behaviour. I guess in order to incorporate these metrics, we'd have to add a bunch more enum types (which I guess can't hurt).

After looking at the kademlia metrics you have, I think something like that would be good for gossipsub, however i'm less interested in active connections more how the mesh and scoring is behaving. Some of the metrics added in this PR could probably be generalised to libp2p-metrics.

Anyway, yeah, would be good to collaborate, depending on how much time you have, otherwise, if you give me a very rough overview, i can go figure it out and integrate into libp2p-metrics.

@mxinden
Copy link
Member

mxinden commented Oct 6, 2021

After looking at the kademlia metrics you have, I think something like that would be good for gossipsub, however i'm less interested in active connections more how the mesh and scoring is behaving. Some of the metrics added in this PR could probably be generalised to libp2p-metrics.

Great. I think you know best what information is relevant. It would need to be exposed via GossipsubEvent.

Anyway, yeah, would be good to collaborate, depending on how much time you have, otherwise, if you give me a very rough overview, i can go figure it out and integrate into libp2p-metrics.

I created a first draft at #2277. Do you want to build on top of that?

Happy to help further. Also feel free to reach out with any Prometheus or Open Metrics questions. I have been part of the maintainer team of the Prometheus project in a past life.

@divagant-martian
Copy link
Contributor

divagant-martian commented Oct 6, 2021

Looking into the metrics that @AgeManning added I think those that would make sense to expose via a behaviour event are few. Maybe a middle ground would be to port these new internal metrics to open_metrics_client::registry::Registry under a feature flag (as in this PR). Not sure how to expose them to libp2p-metrics, but do you think it would be feasible somehow?

@AgeManning
Copy link
Contributor Author

I've had a bit of a look further into the libp2p-metrics and wanted to raise a few things.

This PR introduces a new concept of slots which helps track how a mesh is behaving. Essentially the mesh contains a fixed number of slots and peers can enter/leave for various reasons. We can track the churn and reasons for why peers are leaving the mesh to get an idea of how the scoring is working and general behaviour of peers in and out of our meshes.

The current design is to internally collect all this data and assemble it to some consumable form, which is currently exposed via gossipsub.metrics().

There's two things to consider I think:

  1. We dont really want all this metric overhead for users who do not care about monitoring the metrics. I went down a feature route because I didn't want to add any extra bloat to the ram in storing any state. However, it might be nicer to have these adjustable beyond compile-time and instead add them as part of the config. The metrics variable will still be initialized but will not be populated, I guess if the config is not set.
  2. Transitioning to the libp2p-metrics design is non-trivial and potentially sub-optimal I think. From what I can tell, we'd have to be firing quite a lot of events out from the behaviour (such as everytime a message is validated, or we receive a duplicate that gets filtered) which happen very often on a network. Then somewhere in the libp2p-metrics crate we'd probably have to aggregate this data and collect it back into a more succinct useable form. I guess its a matter of where the aggregation of data happens, it seems to make more intuitive sense to do it internally in the behaviour. I guess some of the metrics can work well (ones that happen less often, like broken promises). In summary I agree with @divagant-martian that potentially we dont want to expose all the events we are tracking in the new metrics.

Curious about your thoughts on these.

@mxinden
Copy link
Member

mxinden commented Oct 11, 2021

The current design is to internally collect all this data and assemble it to some consumable form, which is currently exposed via gossipsub.metrics().

What is your preferred way of exposing these metrics? Do you call gossipsub.metrics() and then pass the values to corresponding Prometheus metrics, where later on a Prometheus server scrapes some endpoint on the process? Or do you just log the values of gossipsub.metrics() to stderr?

There's two things to consider I think:

  1. We dont really want all this metric overhead for users who do not care about monitoring the metrics. I went down a feature route because I didn't want to add any extra bloat to the ram in storing any state. However, it might be nicer to have these adjustable beyond compile-time and instead add them as part of the config. The metrics variable will still be initialized but will not be populated, I guess if the config is not set.

👍 agreed. Also since Rust features are additive, thus running two gossipsub instances would either requiring metric collection in both or none.

  1. Transitioning to the libp2p-metrics design is non-trivial and potentially sub-optimal I think. From what I can tell, we'd have to be firing quite a lot of events out from the behaviour (such as everytime a message is validated, or we receive a duplicate that gets filtered) which happen very often on a network. Then somewhere in the libp2p-metrics crate we'd probably have to aggregate this data and collect it back into a more succinct useable form. I guess its a matter of where the aggregation of data happens, it seems to make more intuitive sense to do it internally in the behaviour. I guess some of the metrics can work well (ones that happen less often, like broken promises). In summary I agree with @divagant-martian that potentially we dont want to expose all the events we are tracking in the new metrics.

Agreed that bubbling up all the information via events is not feasible. Some middleground as suggested by @divagant-martian sounds like an approach worth exploring.

Looking into the metrics that @AgeManning added I think those that would make sense to expose via a behaviour event are few. Maybe a middle ground would be to port these new internal metrics to open_metrics_client::registry::Registry under a feature flag (as in this PR). Not sure how to expose them to libp2p-metrics, but do you think it would be feasible somehow?

I am in favor of this approach. GossipSub::new could take an Option<&mut Registry>. Users could then do:

let mut metric_registry = Registry::default();
let metrics = Metrics::new(&mut metric_registry);
let gossipsub = GossipSub::new(Some(registry::sub_registry_with_prefix("gossipsub")), [...]);

@divagant-martian
Copy link
Contributor

let mut metric_registry = Registry::default();
let metrics = Metrics::new(&mut metric_registry);
let gossipsub = GossipSub::new(Some(registry::sub_registry_with_prefix("gossipsub")), [...]);

This would require to make Gossipsub dependent on a non-static lifetime, making the type itself no longer static, which conflicts with the requirements of NetworkBehaviour: 'static. Working on an alternative

@mxinden
Copy link
Member

mxinden commented Oct 13, 2021

let mut metric_registry = Registry::default();
let metrics = Metrics::new(&mut metric_registry);
let gossipsub = GossipSub::new(Some(registry::sub_registry_with_prefix("gossipsub")), [...]);

This would require to make Gossipsub dependent on a non-static lifetime, making the type itself no longer static, which conflicts with the requirements of NetworkBehaviour: 'static. Working on an alternative

GossipSub::new would not capture the reference to Registry, but instead, only register the corresponding metrics and then drop the reference. Thus, there is no change to the lifetime requirements of the GossipSub struct itself.

Does that make sense @divagant-martian?

@divagant-martian
Copy link
Contributor

Yes, noted too late. Thanks 👍

@AgeManning
Copy link
Contributor Author

Closing in favour of #2346

@AgeManning AgeManning closed this Dec 13, 2021
@AgeManning AgeManning deleted the gossip-inspect branch February 15, 2022 05:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants