Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
8e6dc0b
add namespace to profiling
rodrigo-o Nov 7, 2025
5b5e3d6
initial testing of instumentation
rodrigo-o Nov 10, 2025
042b266
remove rpc namespaced instrumentation from breakdown panels
rodrigo-o Nov 10, 2025
e873b85
initial rpc dashboards
rodrigo-o Nov 10, 2025
cc97128
made some organization changes in the dashboard
rodrigo-o Nov 10, 2025
14dbeaf
make rpc dashboard repeatable by instance
rodrigo-o Nov 10, 2025
01bab1e
Merge branch 'main' into rpc-instrumentation-and-panels
rodrigo-o Nov 11, 2025
6dc47b9
separate Engine API from RPC API in instrumentation
rodrigo-o Nov 11, 2025
d0885bb
change total sum by average in pie charts of rpc
rodrigo-o Nov 11, 2025
6f82153
move from increase to rate for calculating the avg
rodrigo-o Nov 11, 2025
1b6cefa
go without any increase or rate
rodrigo-o Nov 11, 2025
c7d20d6
rework panels for engine api
rodrigo-o Nov 11, 2025
fe92933
reworked the RPC api panels
rodrigo-o Nov 11, 2025
bc00add
Added instances to the rpc and engine panels
rodrigo-o Nov 11, 2025
86fccd8
remove unintended overrides
rodrigo-o Nov 11, 2025
5968fd3
remove intemediary doc
rodrigo-o Nov 12, 2025
f75a0ac
initial rework of the instrumentation to avoid complex observers in p…
rodrigo-o Nov 12, 2025
6c4a9f5
go back to use just call to avoid most of rpc.rs changes made
rodrigo-o Nov 12, 2025
024b3f7
Add namespace to profiling and explicitly set it up in the previous i…
rodrigo-o Nov 12, 2025
c51eeea
fixed an issue with the way namespace is send through #instrument
rodrigo-o Nov 12, 2025
35d9cbd
Merge branch 'main' into rpc-instrumentation-and-panels
rodrigo-o Nov 13, 2025
28ec41f
enhanced record_debug and remove magic namespacing given targets
rodrigo-o Nov 13, 2025
6e5b041
format
rodrigo-o Nov 13, 2025
f6a401b
apply suggestions
rodrigo-o Nov 13, 2025
cd1534d
missed suggestions
rodrigo-o Nov 13, 2025
2948d97
updated cargo lock for l2 quote-gen
rodrigo-o Nov 13, 2025
39bd789
remove Cow given that it was already cloning
rodrigo-o Nov 13, 2025
5b167ab
Add comment about why we use String instead of &'static str in the na…
rodrigo-o Nov 13, 2025
7bece1b
Add rpc results and refactor rpc instrumentation to be aoutside of pr…
rodrigo-o Nov 13, 2025
20c57b3
first error rates panel addition
rodrigo-o Nov 13, 2025
609318b
Enhancement to panel placing across the RPC and engine rows
rodrigo-o Nov 13, 2025
b1a170e
Merge branch 'main' into rpc_error_rates
rodrigo-o Nov 13, 2025
ca2d22b
move some panels from rpc and engine from 5m to instead
rodrigo-o Nov 13, 2025
e890efa
limit the life of exetension to the let timer block
rodrigo-o Nov 14, 2025
f569eaf
Merge branch 'rpc-instrumentation-and-panels' into rpc_error_rates
rodrigo-o Nov 14, 2025
91223a7
Merge branch 'main' into rpc_error_rates
rodrigo-o Nov 17, 2025
bb58393
updated references to rpc_request instead of the oldi functions
rodrigo-o Nov 17, 2025
87b639c
Readded engine success/error rate and fixed some dashboard issues
rodrigo-o Nov 17, 2025
12b9b28
Merge branch 'main' into rpc_error_rates
rodrigo-o Nov 17, 2025
d4d70a1
removed registry and moved the function to mod
rodrigo-o Nov 17, 2025
e17ef12
fix an unnecesary diff
rodrigo-o Nov 17, 2025
7f34016
small change for consistency
rodrigo-o Nov 17, 2025
b609274
simplify record_async_duration documentation
rodrigo-o Nov 17, 2025
8fb43a8
Merge branch 'main' into rpc_error_rates
rodrigo-o Nov 18, 2025
0b6f697
Change function_name for method to avoid changes in future PRs
rodrigo-o Nov 18, 2025
73a355a
Change title to reflect panels correctly
rodrigo-o Nov 18, 2025
1329b03
updated documentation
rodrigo-o Nov 18, 2025
cdf40a3
Merge branch 'main' into rpc_error_rates
rodrigo-o Nov 19, 2025
2ffb1cf
Merge branch 'main' into rpc_error_rates
rodrigo-o Nov 20, 2025
20b971a
fix after merge
rodrigo-o Nov 20, 2025
ea64d05
formatting
rodrigo-o Nov 20, 2025
108536d
chore(l1): enhance error rate panels and promote them to its own row …
rodrigo-o Nov 21, 2025
9611ee5
use static str as error
rodrigo-o Nov 24, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions cmd/ethrex/initializers.rs
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ use ethrex_common::types::Genesis;
use ethrex_config::networks::Network;

use ethrex_metrics::profiling::{FunctionProfilingLayer, initialize_block_processing_profile};
use ethrex_metrics::rpc::initialize_rpc_metrics;
use ethrex_p2p::rlpx::initiator::RLPxInitiator;
use ethrex_p2p::{
discv4::peer_table::PeerTable,
Expand Down Expand Up @@ -89,6 +90,7 @@ pub fn init_metrics(opts: &Options, tracker: TaskTracker) {
);

initialize_block_processing_profile();
initialize_rpc_metrics();

tracker.spawn(metrics_api);
}
Expand Down
9 changes: 4 additions & 5 deletions crates/blockchain/metrics/api.rs
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
use axum::{Router, routing::get};

use crate::profiling::gather_profiling_metrics;

use crate::{
MetricsApiError, blocks::METRICS_BLOCKS, process::METRICS_PROCESS, transactions::METRICS_TX,
MetricsApiError, blocks::METRICS_BLOCKS, gather_default_metrics, process::METRICS_PROCESS,
transactions::METRICS_TX,
};

pub async fn start_prometheus_metrics_api(
Expand Down Expand Up @@ -32,10 +31,10 @@ pub(crate) async fn get_metrics() -> String {
};

ret_string.push('\n');
match gather_profiling_metrics() {
match gather_default_metrics() {
Ok(string) => ret_string.push_str(&string),
Err(_) => {
tracing::error!("Failed to register METRICS_PROFILING");
tracing::error!("Failed to gather default Prometheus metrics");
return String::new();
}
};
Expand Down
23 changes: 23 additions & 0 deletions crates/blockchain/metrics/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ pub mod l2;
pub mod process;
#[cfg(feature = "api")]
pub mod profiling;
#[cfg(feature = "api")]
pub mod rpc;
#[cfg(any(feature = "api", feature = "transactions"))]
pub mod transactions;

Expand Down Expand Up @@ -70,3 +72,24 @@ pub enum MetricsError {
#[error("MetricsL2Error {0}")]
FromUtf8Error(#[from] std::string::FromUtf8Error),
}

#[cfg(feature = "api")]
/// Returns all metrics currently registered in Prometheus' default registry.
///
/// Both profiling and RPC metrics register with this default registry, and the
/// metrics API surfaces them by calling this helper.
pub fn gather_default_metrics() -> Result<String, MetricsError> {
use prometheus::{Encoder, TextEncoder};

let encoder = TextEncoder::new();
let metric_families = prometheus::gather();

let mut buffer = Vec::new();
encoder
.encode(&metric_families, &mut buffer)
.map_err(|e| MetricsError::PrometheusErr(e.to_string()))?;

let res = String::from_utf8(buffer)?;

Ok(res)
}
48 changes: 5 additions & 43 deletions crates/blockchain/metrics/profiling.rs
Original file line number Diff line number Diff line change
@@ -1,17 +1,18 @@
use prometheus::{Encoder, HistogramTimer, HistogramVec, TextEncoder, register_histogram_vec};
use std::{future::Future, sync::LazyLock};
use prometheus::{HistogramTimer, HistogramVec, register_histogram_vec};
use std::sync::LazyLock;
use tracing::{
Subscriber,
field::{Field, Visit},
span::{Attributes, Id},
};
use tracing_subscriber::{Layer, layer::Context, registry::LookupSpan};

use crate::MetricsError;

pub static METRICS_BLOCK_PROCESSING_PROFILE: LazyLock<HistogramVec> =
LazyLock::new(initialize_histogram_vec);

// Metrics defined in this module register into the Prometheus default registry.
// The metrics API exposes them by calling `gather_default_metrics()`.

fn initialize_histogram_vec() -> HistogramVec {
register_histogram_vec!(
"function_duration_seconds",
Expand Down Expand Up @@ -111,45 +112,6 @@ where
}
}

/// Records the duration of an async operation in the function profiling histogram.
///
/// This provides a lightweight alternative to the `#[instrument]` attribute when you need
/// manual control over timing instrumentation, such as in RPC handlers.
///
/// # Parameters
/// * `namespace` - Category for the metric (e.g., "rpc", "engine", "block_execution")
/// * `function_name` - Name identifier for the operation being timed
/// * `future` - The async operation to time
///
/// Use this function when you need to instrument an async operation for duration metrics,
/// but cannot or do not want to use the `#[instrument]` attribute (for example, in RPC handlers).
pub async fn record_async_duration<Fut, T>(namespace: &str, function_name: &str, future: Fut) -> T
where
Fut: Future<Output = T>,
{
let timer = METRICS_BLOCK_PROCESSING_PROFILE
.with_label_values(&[namespace, function_name])
.start_timer();

let output = future.await;
timer.observe_duration();
output
}

pub fn gather_profiling_metrics() -> Result<String, MetricsError> {
let encoder = TextEncoder::new();
let metric_families = prometheus::gather();

let mut buffer = Vec::new();
encoder
.encode(&metric_families, &mut buffer)
.map_err(|e| MetricsError::PrometheusErr(e.to_string()))?;

let res = String::from_utf8(buffer)?;

Ok(res)
}

pub fn initialize_block_processing_profile() {
METRICS_BLOCK_PROCESSING_PROFILE.reset();
}
85 changes: 85 additions & 0 deletions crates/blockchain/metrics/rpc.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
use prometheus::{CounterVec, HistogramVec, register_counter_vec, register_histogram_vec};
use std::{future::Future, sync::LazyLock};

pub static METRICS_RPC_REQUEST_OUTCOMES: LazyLock<CounterVec> =
LazyLock::new(initialize_rpc_outcomes_counter);

pub static METRICS_RPC_DURATION: LazyLock<HistogramVec> =
LazyLock::new(initialize_rpc_duration_histogram);

// Metrics defined in this module register into the Prometheus default registry.
// The metrics API exposes them by calling `gather_default_metrics()`.

fn initialize_rpc_outcomes_counter() -> CounterVec {
register_counter_vec!(
"rpc_requests_total",
"Total number of RPC requests partitioned by namespace, method, and outcome",
&["namespace", "method", "outcome", "error_kind"],
)
.unwrap()
}

fn initialize_rpc_duration_histogram() -> HistogramVec {
register_histogram_vec!(
"rpc_request_duration_seconds",
"Histogram of RPC request handling duration partitioned by namespace and method",
&["namespace", "method"],
)
.unwrap()
}

/// Represents the outcome of an RPC request when recording metrics.
#[derive(Clone)]
pub enum RpcOutcome {
Success,
Error(&'static str),
}

impl RpcOutcome {
fn as_label(&self) -> &'static str {
match self {
RpcOutcome::Success => "success",
RpcOutcome::Error(_) => "error",
}
}

fn error_kind(&self) -> &str {
match self {
RpcOutcome::Success => "",
RpcOutcome::Error(kind) => kind,
}
}
}

pub fn record_rpc_outcome(namespace: &str, method: &str, outcome: RpcOutcome) {
METRICS_RPC_REQUEST_OUTCOMES
.with_label_values(&[namespace, method, outcome.as_label(), outcome.error_kind()])
.inc();
}

pub fn initialize_rpc_metrics() {
METRICS_RPC_REQUEST_OUTCOMES.reset();
METRICS_RPC_DURATION.reset();
}

/// Records the duration of an async operation in the RPC request duration histogram.
///
/// This provides a lightweight alternative to the `#[instrument]` attribute.
///
/// # Parameters
/// * `namespace` - Category for the metric (e.g., "rpc", "engine", "block_execution")
/// * `method` - Name identifier for the operation being timed
/// * `future` - The async operation to time
///
pub async fn record_async_duration<Fut, T>(namespace: &str, method: &str, future: Fut) -> T
where
Fut: Future<Output = T>,
{
let timer = METRICS_RPC_DURATION
.with_label_values(&[namespace, method])
.start_timer();

let output = future.await;
timer.observe_duration();
output
}
42 changes: 37 additions & 5 deletions crates/networking/rpc/rpc.rs
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ use bytes::Bytes;
use ethrex_blockchain::Blockchain;
use ethrex_blockchain::error::ChainError;
use ethrex_common::types::Block;
use ethrex_metrics::profiling::record_async_duration;
use ethrex_metrics::rpc::{RpcOutcome, record_async_duration, record_rpc_outcome};
use ethrex_p2p::peer_handler::PeerHandler;
use ethrex_p2p::sync_manager::SyncManager;
use ethrex_p2p::types::Node;
Expand Down Expand Up @@ -196,16 +196,48 @@ pub trait RpcHandler: Sized {
Ok(RpcNamespace::Engine) => "engine",
_ => "rpc",
};
let method = req.method.as_str();

let result =
record_async_duration(
namespace,
method,
async move { request.handle(context).await },
)
.await;

let outcome = match &result {
Ok(_) => RpcOutcome::Success,
Err(err) => RpcOutcome::Error(get_error_kind(err)),
};
record_rpc_outcome(namespace, method, outcome);

record_async_duration(namespace, req.method.as_str(), async move {
request.handle(context).await
})
.await
result
}

async fn handle(&self, context: RpcApiContext) -> Result<Value, RpcErr>;
}

fn get_error_kind(err: &RpcErr) -> &'static str {
match err {
RpcErr::MethodNotFound(_) => "MethodNotFound",
RpcErr::WrongParam(_) => "WrongParam",
RpcErr::BadParams(_) => "BadParams",
RpcErr::MissingParam(_) => "MissingParam",
RpcErr::TooLargeRequest => "TooLargeRequest",
RpcErr::BadHexFormat(_) => "BadHexFormat",
RpcErr::UnsuportedFork(_) => "UnsuportedFork",
RpcErr::Internal(_) => "Internal",
RpcErr::Vm(_) => "Vm",
RpcErr::Revert { .. } => "Revert",
RpcErr::Halt { .. } => "Halt",
RpcErr::AuthenticationError(_) => "AuthenticationError",
RpcErr::InvalidForkChoiceState(_) => "InvalidForkChoiceState",
RpcErr::InvalidPayloadAttributes(_) => "InvalidPayloadAttributes",
RpcErr::UnknownPayload(_) => "UnknownPayload",
}
}

pub const FILTER_DURATION: Duration = {
if cfg!(test) {
Duration::from_secs(1)
Expand Down
37 changes: 32 additions & 5 deletions docs/developers/l1/dashboards.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,16 +94,21 @@ Collapsed row that surfaces the `namespace="engine"` Prometheus timers so you ca

![Engine API row](img/engine_api_row.png)

### Engine Request Rate by Method
Shows how many Engine API calls per second we process, split by JSON-RPC method and averaged across the currently selected dashboard range.
### Engine Total Time per Method
Pie chart that shows where Engine time is spent across methods over the selected range. Quickly surfaces which endpoints dominate total processing time.

![Engine Request Rate by Method](img/engine_request_rate_by_method.png)
![Engine Total Time per Method](img/engine_total_time_per_method.png)

### Engine Latency by Methods (Avg Duration)
Bar gauge of the historical average latency per Engine method over the selected time range.

![Engine Latency by Methods](img/engine_latency_by_methods.png)

### Engine Request Rate by Method
Shows how many Engine API calls per second we process, split by JSON-RPC method and averaged across the currently selected dashboard range.

![Engine Request Rate by Method](img/engine_request_rate_by_method.png)

### Engine Latency by Method
Live timeseries that tries to correlate to the per-block execution time by showing real-time latency per Engine method with an 18 s lookback window.

Expand All @@ -117,10 +122,10 @@ Another collapsed row focused on the public JSON-RPC surface (`namespace="rpc"`)

![RPC API row](img/rpc_api_row.png)

### RPC Time per Method
### RPC Total Time per Method
Pie chart that shows where RPC time is spent across methods over the selected range. Quickly surfaces which endpoints dominate total processing time.

![RPC Time per Method](img/rpc_time_per_method.png)
![RPC Total Time per Method](img/rpc_total_time_per_method.png)

### Slowest RPC Methods
Table listing the highest average-latency methods over the active dashboard range. Used to prioritise optimisation or caching efforts.
Expand All @@ -139,6 +144,28 @@ Live timeseries that tries to correlate to the per-block execution time by showi

_**Limitations**: The RPC latency views inherit the same windowing caveats as the Engine charts: averages use the dashboard time range while the live chart relies on an 18 s window._

## Engine and RPC Error rates

Collapsed row showing error rates for both Engine and RPC APIs side by side and a deagreagated panel by method and kind of error. Each panel repeats per instance to be able to compare behaviour across nodes.

![Engine and RPC Error rates row](img/engine_and_rpc_error_rates_row.png)

### Engine Success/Error Rate
Shows the rate of successful vs. failed Engine API requests per second.

![Engine Success/Error Rate](img/engine_success_error_rate.png)

### RPC Success/Error Rate
Shows the rate of successful vs. failed RPC API requests per second.

![RPC Success/Error Rate](img/rpc_success_error_rate.png)

### Engine and RPC Errors % by Method and Kind

Deaggregated view of error percentages split by method and error kind for both Engine and RPC APIs. The % are calculated against total requests for a particular method, so all different error percentage for a method should sum up to the percentage of errors for that method.

![Engine and RPC Errors % by Method and Kind](img/engine_and_rpc_errors_by_method_and_kind.png)

## Process and server info

Row panels showing process-level and host-level metrics to help you monitor resource usage and spot potential issues.
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/developers/l1/img/engine_api_row.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/developers/l1/img/rpc_api_row.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed docs/developers/l1/img/rpc_time_per_method.png
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading