Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

R4R: prometheus exporter in substrate #4511

Closed

Conversation

nodebreaker0-0
Copy link
Contributor

@nodebreaker0-0 nodebreaker0-0 commented Dec 30, 2019

We are B-Harvest working on w3f granted project
(https://github.com/w3f/Web3-collaboration/blob/master/grants/speculative/substrate%20x%20(prometheus%20%2B%20grafana)%20by%20B-Harvest.md).

This PR is a draft implementation to request a review to be merged from parity team. We will do necessary modification if raised.

Please refer to README.md for detailed function information.

Create new due to error:
[#4505 ]

@parity-cla-bot
Copy link

It looks like @nodebreaker0-0 signed our Contributor License Agreement. 👍

Many thanks,

Parity Technologies CLA Bot

@expenses
Copy link
Contributor

expenses commented Jan 2, 2020

Hi, thanks for your PR! Have you considered integrating these changes with the existing metrics recording set up? #4187

@expenses
Copy link
Contributor

expenses commented Jan 2, 2020

@expenses To expand, it would be nice to be able to do

record_metrics!(
    a => 1,
    b => 2,
    ...
)

and have it go to either the built-in metrics server or the Prometheus server depending on a --metrics-server flag or something similar.

@nodebreaker0-0
Copy link
Contributor Author

Integrating the metric server is a good idea. but,
Your code must include Prometheus, and you will need to test a bit more because of the wasm and no_std compatibility issues.

The reason for using prometheus is due to grafana extensibility and to use the function completely, you must use prometheus code as below.
ex) prometheus :: register
record_metrics! Let's consider whether we can integrate the basic metrics we received through

But the heart of the project is v0.4, which will be integrated later.
https://github.com/nodebreaker0-0/substrate/blob/prometheus_v0.4/utils/prometheus/src/expansion.rs

Further consideration is needed for integrations that take into account extended metrics.
https://github.com/nodebreaker0-0/substrate/blob/prometheus_v0.4/utils/prometheus/README.md#seting-grafana

First, If the code runs without any problem,
I think we should consider metrics and macro integration,

@nodebreaker0-0 nodebreaker0-0 changed the title WIP: prometheus exporter in substrate R4R: prometheus exporter in substrate Jan 2, 2020
@expenses
Copy link
Contributor

expenses commented Jan 3, 2020

Your code must include Prometheus, and you will need to test a bit more because of the wasm and no_std compatibility issues.

This shouldn't be a problem, as I think that the metrics recording will only happen outside the runtime.

@Hyung-bharvest
Copy link

Your code must include Prometheus, and you will need to test a bit more because of the wasm and no_std compatibility issues.

This shouldn't be a problem, as I think that the metrics recording will only happen outside the runtime.

Hi. I think the metric serving part should be separated from grafana-data-source so that we can connect our metric into the separated module. If you can separate it soon(2weeks from now)(we spent too much time on this project as now), we can migrate the metric part to newly separated metric serving module.

@gavofyork gavofyork added the A3-in_progress Pull request is in progress. No review needed at this stage. label Jan 3, 2020
Copy link
Contributor

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am one of the maintainers of the Prometheus project, thus naturally I am very happy about this effort.

I don't think we should expose both a Granfana endpoint as well as a Prometheus endpoint. Given that Grafana can connect to a Prometheus server and given that the Prometheus metric format is adopted by many software projects and is pretty much the industry standard, I would prefer a Prometheus endpoint. But before we rush anything, this should be properly discussed.

client/service/src/config.rs Outdated Show resolved Hide resolved
utils/prometheus/README.md Outdated Show resolved Hide resolved
utils/prometheus/README.md Outdated Show resolved Hide resolved
utils/prometheus/README.md Outdated Show resolved Hide resolved
utils/prometheus/README.md Outdated Show resolved Hide resolved
utils/prometheus/src/lib.rs Outdated Show resolved Hide resolved
utils/prometheus/src/lib.rs Outdated Show resolved Hide resolved
utils/prometheus/src/lib.rs Outdated Show resolved Hide resolved
utils/prometheus/src/lib.rs Outdated Show resolved Hide resolved
utils/prometheus/src/metrics.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@expenses expenses left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks decent in general. A few nitpicks and things to tidy up around gauge creation etc.

client/service/src/builder.rs Outdated Show resolved Hide resolved
utils/prometheus/Cargo.toml Outdated Show resolved Hide resolved
utils/prometheus/src/lib.rs Outdated Show resolved Hide resolved
utils/prometheus/src/lib.rs Outdated Show resolved Hide resolved
utils/prometheus/src/lib.rs Outdated Show resolved Hide resolved
utils/prometheus/src/metrics.rs Outdated Show resolved Hide resolved
utils/prometheus/src/metrics.rs Outdated Show resolved Hide resolved
utils/prometheus/src/metrics.rs Outdated Show resolved Hide resolved
utils/prometheus/src/metrics.rs Outdated Show resolved Hide resolved
utils/prometheus/src/metrics.rs Outdated Show resolved Hide resolved
@expenses
Copy link
Contributor

I've made a PR on this fork with a few changes :^) nodebreaker0-0#2

log::warn!("Unable to send metrics to grafana: {:?}", e);
}
fn send_prometheus(span_datum: SpanDatum) {
unimplemented!()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mxinden do you know of a way that we could implement this API? It seems like prometheus gauges etc are intended to created as static refs etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is that you can't implement it without creating a macro.

Or you can simply make a static reference.
https://github.com/tikv/rust-prometheus/blob/master/examples/example_edition_2018.rs

Copy link
Contributor

@mattrutherford mattrutherford Jan 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't yet understand fully how prometheus works, but to properly support this I think we would need to be able to:

  • batch together all values (spans) collected since last scrape
  • partition by target and name (by the use of labels perhaps).

With all targets currently in the code enabled, we can be approaching 200 measurements per second just across the runtime. Is this feasible to do while preserving timestamp (which we would need to explicitly provide for this receiver)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I'll have a go at this later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be approaching 200 measurements

@mattrutherford 200 measurements are not an issue for Prometheus. Prometheus can easily handle couple of mega bytes of metric data per scrape. See kubernetes/kube-state-metrics#498 for some numbers.

batch together all values (spans) collected since last scrape

This sounds more like we want a Histogram, right?


Would it be possible to tackle the effort within client/tracing as a follow up pull request? I would like to keep this one small. What do you think @mattrutherford @expenses @nodebreaker0-0?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's good to be clear on terminology, but that wasn't my main concern - only that if we're ripping out the Grafana server as part of the introduction of this Prometheus feature, is whether we'd lose this functionality to plot individual observations* - however that seems unavoidable based on what you said.

*For my use-case it's OK because I use the Telemetry Receiver in substrate-tracing to send the data to substrate-analytics which uses a PostgreSQL datasource, which in turn can be queried by Grafana (because I want to archive the data); however this is not trivial to set up, so the question is - do we really want to kill substrate Grafana server yet? It's a relatively new feature so maybe not many people use it, but I think it's something we should consider as part of our decision to remove it. Particularly as the current primary use-case for substrate-tracing is for profiling and that might be handy for developers to have any easy way to look at this.

Pinging @marcio-diaz @shawntabrizi @DarkEld3r in case you have opinion on this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo the Grafana server in its current state isn't really good at doing anything. If Prometheus isn't going to work for tracing, we should probably find something that is. I'm happy enough to leave the Grafana server in for the moment until that happens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, good to know - and just to be clear that I have no problem removing it, if it's not going to be a big detriment to people - just wanted to make sure everyone was aware of full implications

Copy link

@Hyung-bharvest Hyung-bharvest Jan 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think tracing other's nodes is not (should not) the purpose of the prometheus exporter. It is purely for monitoring operator's own nodes. Extracting information from nodes operated by others is not sounding right as a perspective of information privacy.

I guess tracing functionality should remain in telemetry, and it should be also "off as default" for privacy of node operators.

And storing and querying historical data for own nodes can be done on monitoring server(grafana), and it should not burden the node itself.

Please let me know if I misunderstood this context.

Copy link
Contributor

@mattrutherford mattrutherford Jan 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I can see what the misunderstanding is here and it warrants some more explanation - tracing is only capturing local data; which we then have the option to send somewhere, either Log (output tracing data via logger), Grafana server (built-in) or Telemetry. When using telemetry to send the tracing data, it is not expected (or desired) to send to the default telemetry url, so we override that via the cli, with eg: --telemetry-url 'ws://localhost:8080 9' to send the data to an analytics server (often running on the same machine, but can be anywhere you want).

@nodebreaker0-0
Copy link
Contributor Author

Fixed input validation in / metrics and enabling --no-prometheus.

Copy link
Contributor

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for introducing the --no-prometheus flag and restricting exposition to /metrics!

.map_err(Error::Http)
} else {
Response::builder()
.status(StatusCode::NOT_FOUND)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line and the ones below would need to be indented. See Response::builder().status(...) above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

complete

client/tracing/src/lib.rs Outdated Show resolved Hide resolved
#[structopt(long = "prometheus-port", value_name = "PORT")]
pub prometheus_port: Option<u16>,

/// Disable connecting to the Substrate prometheus.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Disable connecting to the Substrate prometheus.
/// Do not expose a Prometheus metric endpoint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One does not disable others to connect to it, but the exposition in itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

complete


/// Disable connecting to the Substrate prometheus.
///
/// prometheus is on by default on global chains.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// prometheus is on by default on global chains.
/// Prometheus metric endpoint is enabled by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

complete

@mxinden
Copy link
Contributor

mxinden commented Feb 10, 2020

As far as I can tell ccb3179 is missing the following change in order to compile:

diff --git a/client/cli/src/params.rs b/client/cli/src/params.rs
index 128c66b30..1a4b28443 100644
--- a/client/cli/src/params.rs
+++ b/client/cli/src/params.rs
@@ -299,7 +299,6 @@ arg_enum! {
        pub enum TracingReceiver {
                Log,
                Telemetry,
-               Prometheus,
        }
 }
 
@@ -308,7 +307,6 @@ impl Into<sc_tracing::TracingReceiver> for TracingReceiver {
                match self {
                        TracingReceiver::Log => sc_tracing::TracingReceiver::Log,
                        TracingReceiver::Telemetry => sc_tracing::TracingReceiver::Telemetry,
-                       TracingReceiver::Prometheus => sc_tracing::TracingReceiver::Prometheus,
                }
        }
 }

Copy link
Contributor

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me once:

  1. The branch compiles (see small diff above).

  2. The merge conflicts have been addressed.

Thanks for the hard work and again thanks for bearing with us.

@nodebreaker0-0
Copy link
Contributor Author

This looks good to me once:

  1. The branch compiles (see small diff above).
  2. The merge conflicts have been addressed.

Thanks for the hard work and again thanks for bearing with us.

Done

Copy link
Contributor

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good to go from my side. (One small comment, but that can be deferred to a follow up.)

@expenses do you have any further comments?


let _ = to_spawn_tx.unbounded_send((
Box::pin(future),
From::from("prometheus-on-block")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what "prometheus-on-block" relates to. I would expect something like prometheus-server or prometheus-endpoint. Can you explain your thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand a little bit about what this was, so I copied the Telemetry, but I think it's the prometheus-endpoint

Copy link
Contributor

@expenses expenses left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I have a few nitpicks around style (missing newlines at EOF etc) but I don't want to delay the PR further.

@NikVolf
Copy link
Contributor

NikVolf commented Feb 17, 2020

@nodebreaker0-0 you need to update Cargo.lock so that CI happy

run cargo test --all and commit the diff

@nodebreaker0-0
Copy link
Contributor Author

Real Done!.
Thank you for joining me, see you again at Milestone 2.

@gnunicorn
Copy link
Contributor

please merge master and we are ready to merge.

@gnunicorn gnunicorn added A7-looksgoodcantmerge and removed A3-in_progress Pull request is in progress. No review needed at this stage. labels Feb 18, 2020
@mxinden
Copy link
Contributor

mxinden commented Feb 19, 2020

Closing here in favor of #4981 which includes the required master merge.

I am hoping to get #4981 into master before the code freeze on Friday.

@mxinden mxinden closed this Feb 19, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet