Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add stagedsync headers metrics #498

Merged
merged 37 commits into from
Dec 22, 2022

Conversation

entropidelic
Copy link
Contributor

@entropidelic entropidelic commented Dec 16, 2022

This PR adds the following metrics:

  • stages.headers.counter: Number of headers successfully retrived
  • stages.headers.timeout_error: Number of timeout errors while requesting headers
  • stages.headers.validation_errors: Number of validation errores while requesting headers
  • stages.headers.unexpected_errors: Number of unexpected errors while requesting headers
  • stages.headers.request_time: Elapsed time of successful header requests

The prometheus exporter is modularized intro its own module. Also, a description for the metrics is added and the metrics docs is updated.

@codecov-commenter
Copy link

codecov-commenter commented Dec 16, 2022

Codecov Report

Merging #498 (5b27e7f) into main (80b34a9) will increase coverage by 0.36%.
The diff coverage is 37.50%.

@@            Coverage Diff             @@
##             main     #498      +/-   ##
==========================================
+ Coverage   72.53%   72.90%   +0.36%     
==========================================
  Files         243      247       +4     
  Lines       24584    24688     +104     
==========================================
+ Hits        17833    17998     +165     
+ Misses       6751     6690      -61     
Impacted Files Coverage Δ
bin/reth/src/lib.rs 100.00% <ø> (ø)
bin/reth/src/metrics_describer.rs 0.00% <0.00%> (ø)
bin/reth/src/node/mod.rs 0.00% <0.00%> (ø)
bin/reth/src/prometheus_exporter.rs 0.00% <0.00%> (ø)
crates/stages/src/lib.rs 100.00% <ø> (ø)
crates/stages/src/stages_metrics_describer.rs 0.00% <0.00%> (ø)
crates/stages/src/stages/headers.rs 97.68% <100.00%> (+0.03%) ⬆️
crates/stages/src/stages_metrics.rs 100.00% <100.00%> (ø)
crates/net/network/src/state.rs 51.03% <0.00%> (ø)
... and 13 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@entropidelic entropidelic marked this pull request as ready for review December 16, 2022 22:03
@Rjected Rjected added A-staged-sync Related to staged sync (pipelines and stages) C-enhancement New feature or request labels Dec 16, 2022
Copy link
Collaborator

@mattsse mattsse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm supportive of this in general, but defer to @onbjerg here.

what are the implications if we run this without prometheus? are these just no-ops then?

use metrics_exporter_prometheus::PrometheusBuilder;
use metrics_util::layers::{PrefixLayer, Stack};

pub(crate) fn initialize_prometheus_exporter() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of panicking here, let's add a try_ option as well, we should have eyre as error handling here so letting this return a result instead is nicer.

Comment on lines 99 to 95
histogram!("stages.headers.request_time", start.elapsed());
counter!("stages.headers.counter", res.len() as u64);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably want these metrics, but it would be nicer to extract this to a separate call, so it doesn't clutter this function here.

@@ -107,20 +113,23 @@ impl<DB: Database, D: HeaderDownloader, C: Consensus, H: HeadersClient, S: Statu
}
Err(e) => match e {
DownloadError::Timeout => {
increment_counter!("stages.headers.timeout_errors");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same


use metrics::{describe_counter, describe_histogram};

pub(crate) fn describe_metrics() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add brief, one-line, docs here.


pub(crate) fn describe_metrics() {
// Describe stagedsync headers metrics
describe_counter!("stages.headers.counter", "Number of headers successfully retrived");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, not entirely sure if this should be here. we should find a way to describe the metrics in the crates they belong to for maximum reusability

@@ -92,6 +95,9 @@ impl<DB: Database, D: HeaderDownloader, C: Consensus, H: HeadersClient, S: Statu
while let Some(headers) = stream.next().await {
match headers.into_iter().collect::<Result<Vec<_>, _>>() {
Ok(res) => {
// Register elapsed time of request
histogram!("stages.headers.request_time", start.elapsed());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably be in the downloader itself since we might replace this with a concurrent downloader (and in that case this metric is wrong)

@entropidelic
Copy link
Contributor Author

thank you @mattsse and @onbjerg for the feeback. The PR is now ready for a second review.

I removed the headers request time metric for now, we will be addressing that in another PR, directly in the downloader itself as suggested by @onbjerg

Copy link
Member

@gakonst gakonst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this - let's just inline the functions so we don't pay overheads when the macro is not enabled

use metrics::{counter, increment_counter};
use reth_interfaces::p2p::error::DownloadError;

pub(crate) fn update_headers_metrics(n_headers: u64) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we inline(always) these calls?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of individual calls, I'd rather have a StageMetrics struct with Gauge/Counters as fields that are mapped to the registered metrics. And this struct then encapsulates all the metrics that we have, this is easier to maintain than standalone metric functions.

I'm not sure how the exact workflow looks like in metrics but imo it should be possible to register them and create a Gauge/Counter for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sg

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if this sounds good:
I will implement a struct HeaderMetrics like so,

struct HeaderMetrics {
    pub headers_counter: Counter,
    pub headers_timeout_errors: Counter,
    pub headers_validation_errors: Counter,
    pub headers_unexpected_errors: Counter,
}

This struct could be initialized inside the HeaderStage, in a field metrics and when a metric should be updated, e.g. the headers counter, in the code this could be done with

self.metrics.headers_counter.increment(res.len() as u64);

Is this consitent with your suggestions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds right. @mattsse?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have refactored the metrics to encapsulate them in the HeaderMetrics struct now. Let me know what you think @mattsse!

@@ -53,6 +56,14 @@ How the metrics are exposed to the end-user is determined by the CLI.

[^1]: The top-level namespace is added by the CLI using [`metrics_util::layers::PrefixLayer`][metrics_util.PrefixLayer].

### Current metrics
#### StagedSync Headers
- `stages.headers.counter`: Number of headers successfully retrived
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `stages.headers.counter`: Number of headers successfully retrived
- `stages.headers.counter`: Number of headers successfully retrieved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

use metrics_util::layers::{PrefixLayer, Stack};
use std::net::SocketAddr;

pub(crate) fn initialize(listen_addr: SocketAddr) -> eyre::Result<()> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eventually, the exporter will be spawned with our TaskManager.

I'd like a separate function that returns the recorder and exporter.

perhaps even a type that bundles those and allows to configure the Stack, like a builder or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be ok to create a separate issue for this and tackle it in another PR?

use metrics::{counter, increment_counter};
use reth_interfaces::p2p::error::DownloadError;

pub(crate) fn update_headers_metrics(n_headers: u64) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of individual calls, I'd rather have a StageMetrics struct with Gauge/Counters as fields that are mapped to the registered metrics. And this struct then encapsulates all the metrics that we have, this is easier to maintain than standalone metric functions.

I'm not sure how the exact workflow looks like in metrics but imo it should be possible to register them and create a Gauge/Counter for it.

Copy link
Member

@rkrasiuk rkrasiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work overall! Supportive of @mattsse's suggestion to use structs for metrics.

In the future, we might want to start exporting the time spent on downloading each batch

docs/design/metrics.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@mattsse mattsse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smol nit

Comment on lines 19 to 26
pub fn new() -> Self {
Self {
headers_counter: register_counter!("stages.headers.counter"),
headers_timeout_errors: register_counter!("stages.headers.timeout_errors"),
headers_validation_errors: register_counter!("stages.headers.validation_errors"),
headers_unexpected_errors: register_counter!("stages.headers.unexpected_errors"),
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant, new, move to default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

Copy link
Collaborator

@mattsse mattsse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just needs rebase

@mattsse
Copy link
Collaborator

mattsse commented Dec 22, 2022

final blocker:

https://github.com/paradigmxyz/reth/actions/runs/3758567946/jobs/6387061410

replace new with default

@mattsse mattsse merged commit b12939d into paradigmxyz:main Dec 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-staged-sync Related to staged sync (pipelines and stages) C-enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants