graph: Add endpoint metrics #4430

mangas · 2023-03-07T15:57:04Z

Add a generic mechanism to count errors based on host url

mangas · 2023-03-07T16:04:09Z

graph/Cargo.toml

@@ -17,7 +17,7 @@ chrono = "0.4.23"
 envconfig = "0.10.0"
 Inflector = "0.11.3"
 isatty = "0.1.9"
-reqwest = { version = "0.11.2", features = ["json", "stream", "multipart"] }
+reqwest = { version = "0.11.14", features = ["json", "stream", "multipart"] }


This is unrelated to the rest of the change, was just looking into middleware options and noticed this was "outdated"

mangas · 2023-03-08T19:24:01Z

graph/src/firehose/endpoints.rs

    // we need to -1 because there will always be a reference
    // inside FirehoseEndpoints that is not used (is always cloned).
    pub fn has_subgraph_capacity(self: &Arc<Self>) -> bool {
        self.subgraph_limit
            .has_capacity(Arc::strong_count(self).saturating_sub(1))
    }

-    pub async fn get_block<M>(
+    fn new_client(


not sure if it's useful to have a macro here or if it would hurt readability

lutter

Generally looks good. One thing I didn't understand is why this needs a separate thread - couldn't handling the per-host counters also be done inline from EndpointMetrics.success etc. ? Ultimately, it's just changing an atomic counter - I would assume that the implementation of the channel involves more costly locks, like a Mutex

chain/ethereum/examples/firehose.rs

lutter · 2023-03-08T23:07:07Z

graph/src/endpoint.rs

+/// HostCount is the underlying structure to keep the count,
+/// we require that all the hosts are known ahead of time, this way we can
+/// avoid locking since we don't need to modify the entire struture.
+type HostCount = Arc<HashMap<Host, AtomicU64>>;


For the comment, it would be good to say what constitutes a Host here. On first reading ,that's not clear.

I've added the comment to Host instead

lutter · 2023-03-08T23:08:11Z

graph/src/endpoint.rs

+type HostCount = Arc<HashMap<Host, AtomicU64>>;
+
+#[derive(Debug, Eq, PartialEq, Hash, Clone)]
+pub struct Host(Box<str>);


You could use Word here which is also a Box<str>, but that's minor

I think Word is a terrible name, I copied a lot of the implementation but I wouldn't use it anywhere else by choice. There is just no scenario where someone reads "Word" and understand what that means in the context of the graph-node

I am totally open to renaming that - that shouldn't be a reason to duplicate code

CheapString? BoxStr? BoxString? It's a small amount and I wasn't really sure why it was named this way so I didn't want to change the code

Left the rename out and made an alias, let's discuss options for naming and do that rename

lutter · 2023-03-08T23:09:44Z

graph/src/endpoint.rs

+
+impl EndpointMetrics {
+    /// This should only be used for testing.
+    pub fn noop() -> Self {


To make sure that it is only used from tests, you can add #[cfg(debug_assertions)] and it will be ignored in release builds.

Bikeshedding, but dummy or test_metrics would be a better name

I had it and had to remove it because we need to use it from the test crate, in that case debug_assertions doesn't work so I had to remove it .

replaced with mock to be consistent with other parts of the code

graph/src/firehose/endpoints.rs

lutter · 2023-03-08T23:23:46Z

graph/src/firehose/endpoints.rs

-        }
+            .filter(|x| x.has_subgraph_capacity())
+            .sorted_by_key(|x| x.current_error_count())
+            .find(|x| x.has_subgraph_capacity())


At the point where find gets invoked, isn't x.has_subgraph_capacity() always true? I.e., this boils down to next()?

The non-error behavior we'll get from this, I think, is that we use up the first endpoint to capacity, then the next etc. assuming that sorted_by_key is a stable sort.

I think there are two issue here:

has_subgraph_capacity now goes up and down over time because we just check if we have capacity and now how much capacity, so if an adapter is close to, say, 100 then we could have capacity at 99 but once we take the adapter it will fail due to capacity. I think I swap has_capacity to remaining_capacity() -> u64 since this would allows to sort instead of filtering.

Due to 1. we would need to have this particular has_subgraph_capacity return false so we can re-test the adapter that has errors.

I will try and address both of these

I think I have address all the concerns, let me know what you think

graph/src/firehose/interceptors.rs

mangas · 2023-03-09T10:46:11Z

Generally looks good. One thing I didn't understand is why this needs a separate thread - couldn't handling the per-host counters also be done inline from EndpointMetrics.success etc. ? Ultimately, it's just changing an atomic counter - I would assume that the implementation of the channel involves more costly locks, like a Mutex

In the initial iteration the synchronisation was a bit different, with this design I think we could remove the background processor. I was wondering if this is worth doing because we may want to do a few more things in here other than just counting the errors, like producing metrics and reporting tracing etc so I thought I'd leave it like this since it is prolly easier to add these other operations without paying an additional cost on the hot path. If you think it's not worth the extra complexity I'm happy to remove the asynchronous processing.

lutter · 2023-03-10T00:49:31Z

Generally looks good. One thing I didn't understand is why this needs a separate thread - couldn't handling the per-host counters also be done inline from EndpointMetrics.success etc. ? Ultimately, it's just changing an atomic counter - I would assume that the implementation of the channel involves more costly locks, like a Mutex

In the initial iteration the synchronisation was a bit different, with this design I think we could remove the background processor. I was wondering if this is worth doing because we may want to do a few more things in here other than just counting the errors, like producing metrics and reporting tracing etc so I thought I'd leave it like this since it is prolly easier to add these other operations without paying an additional cost on the hot path. If you think it's not worth the extra complexity I'm happy to remove the asynchronous processing.

Since what we are doing now is very simple, and since we have plenty of places where we manipulate metrics inline on hot code paths, I would do that here, too, and remove the extra thread. We can always add it if we ever do need to do something that's more expensive than these two things.

neysofu · 2023-03-10T17:04:40Z

graph/src/firehose/endpoints.rs

+        let metrics = MetricsInterceptor {
+            metrics: self.endpoint_metrics.cheap_clone(),
+            service: self.channel.cheap_clone(),
+            host: self.host.clone(),
+        };


This block of code is repeated a few times, I would suggest a fn metrics_interceptor(&self) -> MetricsInterceptor<Channel> method.

I think it's going to be almost the exact same code, prolly the best way here would be to write a macro to implement these 3 functions, they are effectively the same with a different generated type

lutter

Nice! Looks great!

graph: Add endpoint metrics

0388978

Add a generic mechanism to count errors based on host url

mangas commented Mar 7, 2023

View reviewed changes

add and wire interceptor

ef9a318

mangas force-pushed the filipe/endpoint-metrics branch from f979bea to d076349 Compare March 8, 2023 15:50

mangas marked this pull request as ready for review March 8, 2023 19:22

mangas requested review from lutter and neysofu March 8, 2023 19:22

mangas commented Mar 8, 2023

View reviewed changes

mangas force-pushed the filipe/endpoint-metrics branch from 60fac76 to 79698e5 Compare March 8, 2023 19:26

Add interceptors for metrics and auth

06c4dee

mangas force-pushed the filipe/endpoint-metrics branch 3 times, most recently from 81880ec to 65fa2f9 Compare March 8, 2023 20:26

use error code to get firehose adapter

8e022c5

mangas force-pushed the filipe/endpoint-metrics branch from 65fa2f9 to 8e022c5 Compare March 8, 2023 21:56

lutter approved these changes Mar 9, 2023

View reviewed changes

address code review comments

fd77f00

mangas force-pushed the filipe/endpoint-metrics branch from 5a247c3 to fd77f00 Compare March 10, 2023 12:35

re-work adapter selection

faf558d

mangas force-pushed the filipe/endpoint-metrics branch 3 times, most recently from 3a768ac to 5abb889 Compare March 10, 2023 16:51

neysofu reviewed Mar 10, 2023

View reviewed changes

address code review comments

b6242bb

mangas force-pushed the filipe/endpoint-metrics branch from 5abb889 to b6242bb Compare March 10, 2023 17:20

lutter approved these changes Mar 10, 2023

View reviewed changes

mangas merged commit 8b1a524 into master Mar 11, 2023

mangas deleted the filipe/endpoint-metrics branch March 11, 2023 13:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

graph: Add endpoint metrics #4430

graph: Add endpoint metrics #4430

mangas commented Mar 7, 2023

mangas Mar 7, 2023

mangas Mar 8, 2023

lutter left a comment

lutter Mar 8, 2023

mangas Mar 10, 2023

lutter Mar 8, 2023

mangas Mar 9, 2023

lutter Mar 9, 2023

mangas Mar 10, 2023

mangas Mar 10, 2023

lutter Mar 8, 2023

lutter Mar 8, 2023

mangas Mar 9, 2023

mangas Mar 10, 2023

lutter Mar 8, 2023

lutter Mar 8, 2023

mangas Mar 9, 2023

mangas Mar 10, 2023

mangas commented Mar 9, 2023 •

edited

lutter commented Mar 10, 2023

neysofu Mar 10, 2023

mangas Mar 10, 2023

lutter left a comment

graph: Add endpoint metrics #4430

graph: Add endpoint metrics #4430

Conversation

mangas commented Mar 7, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lutter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mangas commented Mar 9, 2023 • edited

lutter commented Mar 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lutter left a comment

Choose a reason for hiding this comment

mangas commented Mar 9, 2023 •

edited