record indexer errors #79

hopeyen · 2023-10-23T08:33:09Z

Briefly reviewed the current error docs. For further recoding of the metrics, I'm thinking to move indexer_error metric and metrics services into indexer_common and keep receipt and cost model metrics within indexer_service. For now I simply added recording of indexer_error at places without much refacotoring.

New in indexer-service-rs

IE074: "Failed to resolve the release version",
IE075: "Failed to parse response body to query string",

Recorded in indexer-service-rs

IE018: "Failed to query indexing status API",                                // Graph node query failed to get indexingStatuses
IE022: "Failed to identify attestation signer for allocation",               // Failed to get attestation signer from the cache
IE029: "Invalid Scalar-Receipt header provided",                             // subgraph query failed to extract receipt header
IE030: "No Scalar-Receipt header provided",                                  // Queries without receipt for paid receipts
IE031: "Invalid Scalar-Receipt value provided",                              // ReceiptManager add receipts, fails allocation receipts validator data test (update error message)
IE032: "Failed to process paid query",                                       // generic fail to execute paid queries
IE033: "Failed to process free query",                                       // generic fail to execute free queries
IE053: "Failed to queue receipts for collecting",                            // query fee allocation receipt collector collect receipts error
IE073: "Failed to query subgraph features from indexing statuses endpoint",  // query subgraph features failed

TODO: add to indexer-service-rs

// validate at start up
IE002: "Invalid Ethereum URL",                                           // Network provider validation
IE024: "Failed to connect to indexing status API",                       // Graph node query failed to connect with a basic subgraphDeployment check; similar to IE018

// Server layer
IE035: "Unhandled promise rejection",                                    // Service generic unhandled promise rejection
IE036: "Unhandled exception",                                            // Service generic uncaught exception

// Should be added/reframed with respect to TAP
IE054: "Failed to collect receipts in exchange for query fee voucher",   // query fee allocation receipt collector obtain receipt vouchers error
IE055: "Failed to redeem query fee voucher",                             // query fee allocation receipt collector submit voucher error
IE056: "Failed to remember allocation for collecting receipts later",    // query fee allocation receipt collector remember allocations error

// With refactoring of indexer_errors metrics into indexer_common
IE009: "Failed to query subgraph deployments worth indexing",            // Network monitor subgraph query, update error message
IE010: "Failed to query indexer allocations",                            // Network monitor allocation query
IE063: "No active allocation with provided id found",                    // network monitor allocation query

Not needed in indexer-serivce

IE001: "Failed to run database migrations",
IE006: "Failed to cross-check allocation state with contracts",
IE007: "Failed to check for network pause",
IE008: "Failed to check operator status for indexer",
IE011: "Failed to query claimable indexer allocations",
IE012: "Failed to register indexer",
IE013: "Failed to allocate: insufficient free stake",
IE014: "Failed to allocate: allocation not created on chain",
IE015: "Failed to close allocation",
IE016: "Failed to claim allocation",
IE017: "Failed to ensure default global indexing rule",
IE019: "Failed to query proof of indexing",
IE020: "Failed to ensure subgraph deployment is indexing",
IE021: "Failed to migrate cost model",
IE025: "Failed to query indexer management API",
IE026: "Failed to deploy subgraph deployment",
IE027: "Failed to remove subgraph deployment",
IE028: "Failed to reassign subgraph deployment",
IE034: "Not authorized as an operator for the indexer",
IE037: "Failed to query disputable allocations",
IE038: "Failed to query epochs",
IE039: "Failed to store potential POI disputes",
IE040: "Failed to fetch POI disputes",
IE044: "Failed to collect query fees on chain",
IE050: "Transaction reverted due to gas limit being hit",
IE051: "Transaction reverted for unknown reason",
IE052: "Transaction aborted: maximum configured gas price reached",
IE057: "Transaction reverted due to failing assertion in contract",
IE058: "Transaction failed because nonce has already been used",
IE059: "Failed to check latest operator ETH balance",
IE060: "Failed to allocate: Already allocating to the subgraph deployment",
IE061: "Failed to allocate: Invalid allocation amount provided",
IE062: "Did not receive tx receipt, not authorized or network paused",
IE064: "Failed to unallocate: Allocation cannot be closed in the same epoch it was created"
IE065: "Failed to unallocate: Allocation has already been closed",
IE066: "Failed to allocate: allocation ID already exists on chain",
IE067: "Failed to query POI for current epoch start block",
IE068: "User-provided POI did not match reference POI from graph-node",
IE069: "Failed to query Epoch Block Oracle Subgraph",
IE070: "Failed to query latest valid epoch and block hash",
IE071: "Add Epoch subgraph support for non-protocol chains",
IE072: "Failed to execute batch tx (contract: staking)",

Not used anywhere in typescript impl -> Deprecate/remove/refactor

IE003: "Failed to index network subgraph",
IE004: "Failed to synchronize with network",
IE005: "Failed to reconcile indexer and network",
IE023: "Failed to handle state channel message",
IE041: "Failed to query transfers to resolve",
IE042: "Failed to add transfer to the database",
IE043: "Failed to mark transfer as resolved",
IE045: "Failed to queue transfers for resolving",
IE046: "Failed to resolve transfer",
IE047: "Failed to mark transfer as failed",
IE048: "Failed to withdraw query fees for allocation",
IE049: "Failed to clean up transfers for allocation",

resolves #75, part of #4

Jannis · 2023-10-30T19:41:48Z

service/src/query_processor.rs

+                metrics::INDEXER_ERROR
+                    .with_label_values(&[&IndexerErrorCode::IE033.to_string()])
+                    .inc();
+
+                e


It would be cool if indexer_error would automatically do this.

Ideally, we'd replicate what https://github.com/graphprotocol/indexer/blob/main/packages/indexer-common/src/errors.ts#L181-L183 does:

Any time there is an internal error, we call indexer_error (or an indexer_error! macro), passing in the code and an optional cause).

This internally constructs an IndexerError and also increments a metric.

Whenever we call indexer_error, we should log the error as well, like it's done here: https://github.com/graphprotocol/indexer/blob/f20152a28bea60b0f2ed8113dfad2b45b9a500d4/packages/indexer-common/src/indexer-management/allocations.ts#L172-L174

Basically, an indexer error should allow the indexer to detect it via metrics or logs.

Yeah cool idea!

I moved metrics recording and logs into indexerError::new so they are automatically taken care of when an indexerError is created.

I would also remove the QueryError enum if it makes sense to you, I think the current variants can simply live as part of IndexerError

Jannis · 2023-10-30T19:53:06Z

service/src/util.rs

+        let dependencies = pkg
+            .get("dependencies")
+            .and_then(|d| d.as_table())
+            .expect("Parse package dependencies");
+        let indexer_native = dependencies.get("indexer-native").map(|d| {
+            d.as_str()
+                .expect("Parse indexer-service dependency version")
+                .to_string()
+        });


Since this is only called once on startup, I would just make any errors here fatal, i.e. use expect and no IndexerError. Only errors while running are worth tracking in my opinion.

I'm a bit confused 🧐 I didn't add IndexerError here, the only changes were from unwrap to expect for a better error message when it fails fatally

I kept the expect messages changes and removed IndexerError from helper fn read_manifest

I'm a bit confused 🧐 I didn't add IndexerError here, the only changes were from unwrap to expect for a better error message when it fails fatally

Ah, sorry. I sometimes comment on things even if they were not introduced in the PR itself. Sometimes, a PR is a good opportunity to improve code that was there before.

… logs in new fn

github-actions · 2023-10-31T10:44:03Z

Pull Request Test Coverage Report for Build 6704892743

0 of 119 (0.0%) changed or added relevant lines in 10 files are covered.
4 unchanged lines in 3 files lost coverage.
Overall coverage decreased (-1.0%) to 49.241%

Changes Missing Coverage	Changed/Added Lines	%
service/src/metrics/mod.rs	1	0.0%
service/src/server/routes/deployment.rs	1	0.0%
service/src/server/routes/mod.rs	2	0.0%
service/src/main.rs	3	0.0%
common/src/indexer_errors.rs	6	0.0%
service/src/util.rs	13	0.0%
service/src/server/routes/status.rs	16	0.0%
common/src/metrics/mod.rs	19	0.0%
service/src/query_processor.rs	29	0.0%
service/src/server/routes/subgraphs.rs	29	0.0%

Files with Coverage Reduction	New Missed Lines	%
service/src/config.rs	1	0.0%
service/src/query_processor.rs	1	36.24%
service/src/server/routes/subgraphs.rs	2	0.0%

Totals
Change from base Build 6644835579:	-1.0%
Covered Lines:	1394
Relevant Lines:	2831

💛 - Coveralls

Jannis · 2023-11-01T19:54:33Z

common/src/metrics/mod.rs

+    let m = IntCounterVec::new(
+        Opts::new("indexer_error", "Indexer errors observed over time")
+            .namespace("indexer")
+            .subsystem("service"),
+        &["code"],
+    )
+    .expect("Failed to create indexer_error");
+    register(Box::new(m.clone())).expect("Failed to register indexer_error counter");
+    m


This would be shorter but is mostly equivalent. I think the namespace and subsystem (in other words the indexer_service_ prefix that they result in) is different from what the current indexer-service does, so I'd drop those either way.

register_int_counter_vec!( "indexer_error", "Indexer errors observed over time", &["code"] );

Jannis · 2023-11-01T20:00:13Z

common/src/metrics/mod.rs

+#[allow(dead_code)]
+pub static REGISTRY: Lazy<Registry> = Lazy::new(Registry::new);
+
+#[allow(dead_code)]
+pub fn register_metrics(registry: &Registry, metrics: Vec<Box<dyn Collector>>) {
+    for metric in metrics {
+        registry.register(metric).expect("Cannot register metrics");
+    }
+}
+
+/// Register indexer error metrics in Prometheus registry
+pub fn register_indexer_error_metrics() {
+    register_metrics(&REGISTRY, vec![Box::new(INDEXER_ERROR.clone())]);
+}


I think it would be better to just use the default registry instead of creating our own. I don't recall why we're using a custom one in indexer-service (and also graph-node), but I know that the default registry works well with autometrics which I believe is used in indexer-service-rs. That's why I think using register_int_counter_vec! is probably a good idea, because that uses the default registry.

Nice suggestion! I updated the labeling for both indexer_error and query metrics, used the macros, and removed the custom registry fb870a7. The code looks a lot cleaner than before.

The current custom registry should work fine with autometrics, though we are only using it for encode_global_metrics and global_metrics_exporter. I think it could be interesting to use autometrics for some functions to measure performance, like for execute_paid_query and execute_free_query. I don't think we have those measurements in TS but do lmk if you think adding some of those within this PR is a good idea

Jannis · 2023-11-01T20:02:40Z

service/src/metrics/mod.rs

 /// Start the basic metrics for indexer services
-#[allow(dead_code)]
-pub fn start_metrics() {
+pub fn register_query_metrics() {
    register_metrics(
        &REGISTRY,


Same here, I'd stick to the default registry.

Jannis · 2023-11-01T20:03:45Z

service/src/query_processor.rs

+            Err(e) => {
+                IndexerError::new(
+                    IndexerErrorCode::IE031,
+                    Some(IndexerErrorCause::new(
+                        "Failed to parse receipt for a paid query",
+                    )),
+                );
+
+                return Err(e);


This combination of QueryError and IndexerError makes sense — we're tracking the indexer error internally but return the query error to the client. 👍🏻

Jannis · 2023-11-01T20:40:50Z

service/src/server/routes/subgraphs.rs

+                    Some(IndexerErrorCause::new(
+                        "Failed to execute a paid subgraph query to graph node",
+                    )),


Looking at this, where we essentially drop the error returned by Graph Node and just log a static string makes me wonder if we should have an IndexerErrorCause that supports taking other errors (e.g. the std::error::Error trait) that are then included in the error message.

The TypeScript indexer-service/-common does this by having IndexerErrorCause be an arbitraty type. Easy in TypeScript of course. Here, what we could do is something like

use std::fmt::{self}; use anyhow::anyhow; // 1.0.75 #[derive(Debug)] enum IndexerErrorCode { IE001, } struct IndexerErrorCause<T: ToString>(T); impl<T: ToString> From<T> for IndexerErrorCause<T> { fn from(t: T) -> Self { Self(t) } } impl<T: ToString> fmt::Display for IndexerErrorCause<T> { fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { write!(f, "{}", self.0.to_string()) } } struct IndexerError<T: ToString> { code: IndexerErrorCode, cause: IndexerErrorCause<T> } impl<T> IndexerError<T> where T: ToString + Into<IndexerErrorCause<T>> { pub fn new(code: IndexerErrorCode, cause: T) -> Self { Self { code, cause: cause.into() } } } impl<T: ToString> fmt::Display for IndexerError<T> { fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { write!(f, "{:?}: {}", self.code, self.cause.to_string()) } } fn main() { println!("{}", IndexerError::new(IndexerErrorCode::IE001, "str cause")); println!("{}", IndexerError::new(IndexerErrorCode::IE001, anyhow!("anyhow cause"))); }

You can see this in the playground here: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=5ff50fb1ffb9afe9574f1087ebb581fa

Using Display instead of ToString might be better, I'm not sure which one is more common in situations like this. Perhaps @Theodus knows?

Using ToString is generally preferred. When you implement Display, you get the impl of ToString automatically (docs)

But it seems like an alternative would be cause: anyhow::Error. Since it looks like you're recreating a subset of what anyhow::Error does, but with the additional overhead of the error type not getting erased. Unless you want separate error types in IndexerError for some reason.

Thanks for the detailed review here 🧠 !!

Looking at the TS implementation of subgraph query responses, I realized that we have been incorrectly only counting 200 response from graph node as Successful queries, whereas TS counts all responses when the request doesn't error out (It seems like to me).

When we get error for executing the query, we get QueryError which can be passed into IndexerErrorCause directly as IndexerErrorCause is currently defined as pub struct IndexerErrorCause(Box<dyn Error + Send + Sync>);.
Updating it to anyhow::Error could be nice, but I think the current struct accomplishes what we want to achieve here.

For now, I updated in e56b400 so the match statement will count FAILED_QUERIES only if execute_paid_query fails and it would log out and respond with "Failed to execute a paid subgraph query to graph node: " + QueryError back to the user

lmk what you think:)

Jannis · 2023-11-01T20:42:11Z

service/src/util.rs

+        let dependencies = pkg
+            .get("dependencies")
+            .and_then(|d| d.as_table())
+            .expect("Parse package dependencies");
+        let indexer_native = dependencies.get("indexer-native").map(|d| {
+            d.as_str()
+                .expect("Parse indexer-service dependency version")
+                .to_string()
+        });


I'm a bit confused 🧐 I didn't add IndexerError here, the only changes were from unwrap to expect for a better error message when it fails fatally

Ah, sorry. I sometimes comment on things even if they were not introduced in the PR itself. Sometimes, a PR is a good opportunity to improve code that was there before.

Jannis

I like how this adds better error reporting and still shaves of 20 lines of code 🔥

hopeyen requested a review from Jannis October 23, 2023 10:18

Jannis requested changes Oct 30, 2023

View reviewed changes

hopeyen added 3 commits October 31, 2023 10:50

refactor: move indexer errors to indexer-common

a85b78a

refactor: record some indexer_errors

8e1d06f

refactor: indexer_error metrics moved to indexer_common, auto inc and…

68c4880

… logs in new fn

hopeyen force-pushed the hope/record-indexer-errors branch from 56c6517 to 68c4880 Compare October 31, 2023 09:55

refactor: remove unused ConfigError enum

23135e5

hopeyen requested a review from Jannis November 1, 2023 16:14

Jannis requested changes Nov 1, 2023

View reviewed changes

hopeyen added 2 commits November 2, 2023 22:00

refactor: default registry and match TS labeling

fb870a7

refactor: utilize QueryError for returning subgraph query errors

e56b400

hopeyen requested a review from Jannis November 2, 2023 20:27

Jannis approved these changes Nov 3, 2023

View reviewed changes

hopeyen merged commit 22f6f83 into main Nov 4, 2023
5 checks passed

hopeyen deleted the hope/record-indexer-errors branch November 4, 2023 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

record indexer errors #79

record indexer errors #79

hopeyen commented Oct 23, 2023

Jannis Oct 30, 2023

hopeyen Oct 31, 2023 •

edited

Loading

Jannis Oct 30, 2023

hopeyen Oct 31, 2023

hopeyen Oct 31, 2023

Jannis Nov 1, 2023

github-actions bot commented Oct 31, 2023 •

edited

Loading

Jannis Nov 1, 2023

Jannis Nov 1, 2023

hopeyen Nov 2, 2023

Jannis Nov 1, 2023

Jannis Nov 1, 2023

Jannis Nov 1, 2023

Jannis Nov 2, 2023

Theodus Nov 2, 2023 •

edited

Loading

Theodus Nov 2, 2023 •

edited

Loading

hopeyen Nov 2, 2023

Jannis Nov 1, 2023

Jannis left a comment •

edited

Loading

record indexer errors #79

record indexer errors #79

Conversation

hopeyen commented Oct 23, 2023

New in indexer-service-rs

Recorded in indexer-service-rs

TODO: add to indexer-service-rs

Not needed in indexer-serivce

Not used anywhere in typescript impl -> Deprecate/remove/refactor

Choose a reason for hiding this comment

hopeyen Oct 31, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Oct 31, 2023 • edited Loading

Pull Request Test Coverage Report for Build 6704892743

💛 - Coveralls

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Theodus Nov 2, 2023 • edited Loading

Choose a reason for hiding this comment

Theodus Nov 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jannis left a comment • edited Loading

Choose a reason for hiding this comment

hopeyen Oct 31, 2023 •

edited

Loading

github-actions bot commented Oct 31, 2023 •

edited

Loading

Theodus Nov 2, 2023 •

edited

Loading

Theodus Nov 2, 2023 •

edited

Loading

Jannis left a comment •

edited

Loading