Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

record indexer errors #79

Merged
merged 6 commits into from
Nov 4, 2023
Merged

record indexer errors #79

merged 6 commits into from
Nov 4, 2023

Conversation

hopeyen
Copy link
Collaborator

@hopeyen hopeyen commented Oct 23, 2023

Briefly reviewed the current error docs. For further recoding of the metrics, I'm thinking to move indexer_error metric and metrics services into indexer_common and keep receipt and cost model metrics within indexer_service. For now I simply added recording of indexer_error at places without much refacotoring.

New in indexer-service-rs

IE074: "Failed to resolve the release version",
IE075: "Failed to parse response body to query string",

Recorded in indexer-service-rs

IE018: "Failed to query indexing status API",                                // Graph node query failed to get indexingStatuses
IE022: "Failed to identify attestation signer for allocation",               // Failed to get attestation signer from the cache
IE029: "Invalid Scalar-Receipt header provided",                             // subgraph query failed to extract receipt header
IE030: "No Scalar-Receipt header provided",                                  // Queries without receipt for paid receipts
IE031: "Invalid Scalar-Receipt value provided",                              // ReceiptManager add receipts, fails allocation receipts validator data test (update error message)
IE032: "Failed to process paid query",                                       // generic fail to execute paid queries
IE033: "Failed to process free query",                                       // generic fail to execute free queries
IE053: "Failed to queue receipts for collecting",                            // query fee allocation receipt collector collect receipts error
IE073: "Failed to query subgraph features from indexing statuses endpoint",  // query subgraph features failed

TODO: add to indexer-service-rs

// validate at start up
IE002: "Invalid Ethereum URL",                                           // Network provider validation
IE024: "Failed to connect to indexing status API",                       // Graph node query failed to connect with a basic subgraphDeployment check; similar to IE018

// Server layer
IE035: "Unhandled promise rejection",                                    // Service generic unhandled promise rejection
IE036: "Unhandled exception",                                            // Service generic uncaught exception

// Should be added/reframed with respect to TAP
IE054: "Failed to collect receipts in exchange for query fee voucher",   // query fee allocation receipt collector obtain receipt vouchers error
IE055: "Failed to redeem query fee voucher",                             // query fee allocation receipt collector submit voucher error
IE056: "Failed to remember allocation for collecting receipts later",    // query fee allocation receipt collector remember allocations error

// With refactoring of indexer_errors metrics into indexer_common
IE009: "Failed to query subgraph deployments worth indexing",            // Network monitor subgraph query, update error message
IE010: "Failed to query indexer allocations",                            // Network monitor allocation query
IE063: "No active allocation with provided id found",                    // network monitor allocation query

Not needed in indexer-serivce

IE001: "Failed to run database migrations",
IE006: "Failed to cross-check allocation state with contracts",
IE007: "Failed to check for network pause",
IE008: "Failed to check operator status for indexer",
IE011: "Failed to query claimable indexer allocations",
IE012: "Failed to register indexer",
IE013: "Failed to allocate: insufficient free stake",
IE014: "Failed to allocate: allocation not created on chain",
IE015: "Failed to close allocation",
IE016: "Failed to claim allocation",
IE017: "Failed to ensure default global indexing rule",
IE019: "Failed to query proof of indexing",
IE020: "Failed to ensure subgraph deployment is indexing",
IE021: "Failed to migrate cost model",
IE025: "Failed to query indexer management API",
IE026: "Failed to deploy subgraph deployment",
IE027: "Failed to remove subgraph deployment",
IE028: "Failed to reassign subgraph deployment",
IE034: "Not authorized as an operator for the indexer",
IE037: "Failed to query disputable allocations",
IE038: "Failed to query epochs",
IE039: "Failed to store potential POI disputes",
IE040: "Failed to fetch POI disputes",
IE044: "Failed to collect query fees on chain",
IE050: "Transaction reverted due to gas limit being hit",
IE051: "Transaction reverted for unknown reason",
IE052: "Transaction aborted: maximum configured gas price reached",
IE057: "Transaction reverted due to failing assertion in contract",
IE058: "Transaction failed because nonce has already been used",
IE059: "Failed to check latest operator ETH balance",
IE060: "Failed to allocate: Already allocating to the subgraph deployment",
IE061: "Failed to allocate: Invalid allocation amount provided",
IE062: "Did not receive tx receipt, not authorized or network paused",
IE064: "Failed to unallocate: Allocation cannot be closed in the same epoch it was created"
IE065: "Failed to unallocate: Allocation has already been closed",
IE066: "Failed to allocate: allocation ID already exists on chain",
IE067: "Failed to query POI for current epoch start block",
IE068: "User-provided POI did not match reference POI from graph-node",
IE069: "Failed to query Epoch Block Oracle Subgraph",
IE070: "Failed to query latest valid epoch and block hash",
IE071: "Add Epoch subgraph support for non-protocol chains",
IE072: "Failed to execute batch tx (contract: staking)",

Not used anywhere in typescript impl -> Deprecate/remove/refactor

IE003: "Failed to index network subgraph",
IE004: "Failed to synchronize with network",
IE005: "Failed to reconcile indexer and network",
IE023: "Failed to handle state channel message",
IE041: "Failed to query transfers to resolve",
IE042: "Failed to add transfer to the database",
IE043: "Failed to mark transfer as resolved",
IE045: "Failed to queue transfers for resolving",
IE046: "Failed to resolve transfer",
IE047: "Failed to mark transfer as failed",
IE048: "Failed to withdraw query fees for allocation",
IE049: "Failed to clean up transfers for allocation",

resolves #75, part of #4

@hopeyen hopeyen requested a review from Jannis October 23, 2023 10:18
Comment on lines 98 to 102
metrics::INDEXER_ERROR
.with_label_values(&[&IndexerErrorCode::IE033.to_string()])
.inc();

e
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be cool if indexer_error would automatically do this.

Ideally, we'd replicate what https://github.com/graphprotocol/indexer/blob/main/packages/indexer-common/src/errors.ts#L181-L183 does:

Basically, an indexer error should allow the indexer to detect it via metrics or logs.

Copy link
Collaborator Author

@hopeyen hopeyen Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah cool idea!

I moved metrics recording and logs into indexerError::new so they are automatically taken care of when an indexerError is created.

I would also remove the QueryError enum if it makes sense to you, I think the current variants can simply live as part of IndexerError

Comment on lines +46 to +51
let dependencies = pkg
.get("dependencies")
.and_then(|d| d.as_table())
.expect("Parse package dependencies");
let indexer_native = dependencies.get("indexer-native").map(|d| {
d.as_str()
.expect("Parse indexer-service dependency version")
.to_string()
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is only called once on startup, I would just make any errors here fatal, i.e. use expect and no IndexerError. Only errors while running are worth tracking in my opinion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused 🧐 I didn't add IndexerError here, the only changes were from unwrap to expect for a better error message when it fails fatally

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the expect messages changes and removed IndexerError from helper fn read_manifest

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused 🧐 I didn't add IndexerError here, the only changes were from unwrap to expect for a better error message when it fails fatally

Ah, sorry. I sometimes comment on things even if they were not introduced in the PR itself. Sometimes, a PR is a good opportunity to improve code that was there before.

@github-actions
Copy link

github-actions bot commented Oct 31, 2023

Pull Request Test Coverage Report for Build 6704892743

  • 0 of 119 (0.0%) changed or added relevant lines in 10 files are covered.
  • 4 unchanged lines in 3 files lost coverage.
  • Overall coverage decreased (-1.0%) to 49.241%

Changes Missing Coverage Covered Lines Changed/Added Lines %
service/src/metrics/mod.rs 0 1 0.0%
service/src/server/routes/deployment.rs 0 1 0.0%
service/src/server/routes/mod.rs 0 2 0.0%
service/src/main.rs 0 3 0.0%
common/src/indexer_errors.rs 0 6 0.0%
service/src/util.rs 0 13 0.0%
service/src/server/routes/status.rs 0 16 0.0%
common/src/metrics/mod.rs 0 19 0.0%
service/src/query_processor.rs 0 29 0.0%
service/src/server/routes/subgraphs.rs 0 29 0.0%
Files with Coverage Reduction New Missed Lines %
service/src/config.rs 1 0.0%
service/src/query_processor.rs 1 36.24%
service/src/server/routes/subgraphs.rs 2 0.0%
Totals Coverage Status
Change from base Build 6644835579: -1.0%
Covered Lines: 1394
Relevant Lines: 2831

💛 - Coveralls

@hopeyen hopeyen requested a review from Jannis November 1, 2023 16:14
Comment on lines 9 to 17
let m = IntCounterVec::new(
Opts::new("indexer_error", "Indexer errors observed over time")
.namespace("indexer")
.subsystem("service"),
&["code"],
)
.expect("Failed to create indexer_error");
register(Box::new(m.clone())).expect("Failed to register indexer_error counter");
m
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be shorter but is mostly equivalent. I think the namespace and subsystem (in other words the indexer_service_ prefix that they result in) is different from what the current indexer-service does, so I'd drop those either way.

register_int_counter_vec!(
    "indexer_error",
    "Indexer errors observed over time",
    &["code"]
);

Comment on lines 20 to 33
#[allow(dead_code)]
pub static REGISTRY: Lazy<Registry> = Lazy::new(Registry::new);

#[allow(dead_code)]
pub fn register_metrics(registry: &Registry, metrics: Vec<Box<dyn Collector>>) {
for metric in metrics {
registry.register(metric).expect("Cannot register metrics");
}
}

/// Register indexer error metrics in Prometheus registry
pub fn register_indexer_error_metrics() {
register_metrics(&REGISTRY, vec![Box::new(INDEXER_ERROR.clone())]);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to just use the default registry instead of creating our own. I don't recall why we're using a custom one in indexer-service (and also graph-node), but I know that the default registry works well with autometrics which I believe is used in indexer-service-rs. That's why I think using register_int_counter_vec! is probably a good idea, because that uses the default registry.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice suggestion! I updated the labeling for both indexer_error and query metrics, used the macros, and removed the custom registry fb870a7. The code looks a lot cleaner than before.

The current custom registry should work fine with autometrics, though we are only using it for encode_global_metrics and global_metrics_exporter. I think it could be interesting to use autometrics for some functions to measure performance, like for execute_paid_query and execute_free_query. I don't think we have those measurements in TS but do lmk if you think adding some of those within this PR is a good idea

/// Start the basic metrics for indexer services
#[allow(dead_code)]
pub fn start_metrics() {
pub fn register_query_metrics() {
register_metrics(
&REGISTRY,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, I'd stick to the default registry.

Comment on lines +117 to +125
Err(e) => {
IndexerError::new(
IndexerErrorCode::IE031,
Some(IndexerErrorCause::new(
"Failed to parse receipt for a paid query",
)),
);

return Err(e);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This combination of QueryError and IndexerError makes sense — we're tracking the indexer error internally but return the query error to the client. 👍🏻

Comment on lines 132 to 134
Some(IndexerErrorCause::new(
"Failed to execute a paid subgraph query to graph node",
)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this, where we essentially drop the error returned by Graph Node and just log a static string makes me wonder if we should have an IndexerErrorCause that supports taking other errors (e.g. the std::error::Error trait) that are then included in the error message.

The TypeScript indexer-service/-common does this by having IndexerErrorCause be an arbitraty type. Easy in TypeScript of course. Here, what we could do is something like

use std::fmt::{self};
use anyhow::anyhow; // 1.0.75

#[derive(Debug)]
enum IndexerErrorCode {
    IE001,
}

struct IndexerErrorCause<T: ToString>(T);

impl<T: ToString> From<T> for IndexerErrorCause<T> {
    fn from(t: T) -> Self {
        Self(t)
    }
}

impl<T: ToString> fmt::Display for IndexerErrorCause<T> {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        write!(f, "{}", self.0.to_string())
    }
}

struct IndexerError<T: ToString> {
    code: IndexerErrorCode,
    cause: IndexerErrorCause<T>
}

impl<T> IndexerError<T>
where T: ToString + Into<IndexerErrorCause<T>>
{
    pub fn new(code: IndexerErrorCode, cause: T) -> Self {
        Self { code, cause: cause.into() }
    }
}

impl<T: ToString> fmt::Display for IndexerError<T> {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        write!(f, "{:?}: {}", self.code, self.cause.to_string())
    }
}

fn main() {
    println!("{}", IndexerError::new(IndexerErrorCode::IE001, "str cause"));
    println!("{}", IndexerError::new(IndexerErrorCode::IE001, anyhow!("anyhow cause")));
}

You can see this in the playground here: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=5ff50fb1ffb9afe9574f1087ebb581fa

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Display instead of ToString might be better, I'm not sure which one is more common in situations like this. Perhaps @Theodus knows?

Copy link
Member

@Theodus Theodus Nov 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ToString is generally preferred. When you implement Display, you get the impl of ToString automatically (docs)

Copy link
Member

@Theodus Theodus Nov 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it seems like an alternative would be cause: anyhow::Error. Since it looks like you're recreating a subset of what anyhow::Error does, but with the additional overhead of the error type not getting erased. Unless you want separate error types in IndexerError for some reason.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed review here 🧠 !!

Looking at the TS implementation of subgraph query responses, I realized that we have been incorrectly only counting 200 response from graph node as Successful queries, whereas TS counts all responses when the request doesn't error out (It seems like to me).

When we get error for executing the query, we get QueryError which can be passed into IndexerErrorCause directly as IndexerErrorCause is currently defined as pub struct IndexerErrorCause(Box<dyn Error + Send + Sync>);.
Updating it to anyhow::Error could be nice, but I think the current struct accomplishes what we want to achieve here.

For now, I updated in e56b400 so the match statement will count FAILED_QUERIES only if execute_paid_query fails and it would log out and respond with "Failed to execute a paid subgraph query to graph node: " + QueryError back to the user

lmk what you think:)

Comment on lines +46 to +51
let dependencies = pkg
.get("dependencies")
.and_then(|d| d.as_table())
.expect("Parse package dependencies");
let indexer_native = dependencies.get("indexer-native").map(|d| {
d.as_str()
.expect("Parse indexer-service dependency version")
.to_string()
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused 🧐 I didn't add IndexerError here, the only changes were from unwrap to expect for a better error message when it fails fatally

Ah, sorry. I sometimes comment on things even if they were not introduced in the PR itself. Sometimes, a PR is a good opportunity to improve code that was there before.

@hopeyen hopeyen requested a review from Jannis November 2, 2023 20:27
Copy link
Collaborator

@Jannis Jannis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how this adds better error reporting and still shaves of 20 lines of code 🔥

@hopeyen hopeyen merged commit 22f6f83 into main Nov 4, 2023
5 checks passed
@hopeyen hopeyen deleted the hope/record-indexer-errors branch November 4, 2023 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve utilization of IndexerError and tie them to prometheus metrics
3 participants