Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing time instrumentation #1338

Merged
merged 14 commits into from
Nov 14, 2019
Merged

Conversation

leoyvens
Copy link
Collaborator

@leoyvens leoyvens commented Nov 7, 2019

Resolves #1302.

This adds Prometheus metrics for measuring the total time to sync a subgraph, broken down by sections. This PR marks some sections and a convenient API is introduced for marking sections as needed. This also introduces a metric for measuring the impact of DB contention on performance. Grafana panels were added to display these metrics, which looks like this (timings for syncing erc20 up to about block 5400000):

Captura de Tela 2019-11-07 às 13 10 07

See https://github.com/graphprotocol/instrumentation/pull/1 for the panels.

@leoyvens leoyvens requested a review from a team November 7, 2019 16:15
@lutter
Copy link
Collaborator

lutter commented Nov 7, 2019

One thing I am wondering: it might be useful to also measure how much time we spend in block reversions. Would that be terribly hard to add?

@leoyvens
Copy link
Collaborator Author

leoyvens commented Nov 7, 2019

@lutter I made this stop registering metrics when the subgraph becomes synced, I think what we care about measuring right now is how fast a subgraph becomes synced. Since reverts usually happen after the subgraph has synced, it doesn't fit with the theme of this PR, though it would be a good standalone metric to add. Shouldn't be hard to add, we have separate gauges for other things including number of blocks reverted.

@Jannis
Copy link
Contributor

Jannis commented Nov 7, 2019

@leoyvens I'll review later but one quick comment: any chance we could benchmark the WASM memory allocation? You improved that recently but the time we spend there used to be a significant unknown between store.get returning and the mapping execution being able to use the value. Would be nice to know how long this takes now. What do you think?

@lutter
Copy link
Collaborator

lutter commented Nov 7, 2019

@leoyvens what made me mention reverts is that I have seen some queries related to reverts take a long time (something like 2.5s) and having a better understanding of how much time we spend there would be great. But I agree with your point about time-to-sync being more interesting right now.

@leoyvens
Copy link
Collaborator Author

@Jannis I've added a section for store_get_asc_new, it's about 10% of the total store.get time. It might become something we need to optimize but not an issue right now.

Copy link
Collaborator

@lutter lutter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@leoyvens
Copy link
Collaborator Author

@Jannis do you want to take a look? Should I merge?

core/src/subgraph/instance_manager.rs Show resolved Hide resolved
graph/src/components/metrics/stopwatch.rs Outdated Show resolved Hide resolved
let mut inner = StopwatchInner {
total_counter: *registry
.new_counter(
format!("{}_total_secs", subgraph_id),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure total_secs is the right term. Maybe {}_secs_to_synced or {}_syncing_secs? Who knows what we might use total for in the future, maybe all time spent on a subgraph?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is fine actually, as the stopwatch metrics could be used in other places too. But then I'd maybe pass in a sync prefix here that allows this to become {}_sync_total_secs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I hadn't considered the reuse potential. Should all metrics be prefixed with _sync or just the one for the totals?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, I thought we should prefix all but I think just the totals works. {subgraph}_total_secs is just too generic, but {subgraph}_store_get (or whatever) makes sense even if it's just capture during syncing for now. What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I'll change just the total_secs to sync_total_secs.

graph/src/components/metrics/stopwatch.rs Outdated Show resolved Hide resolved
graph/src/components/metrics/stopwatch.rs Outdated Show resolved Hide resolved
graph/src/components/metrics/stopwatch.rs Outdated Show resolved Hide resolved
graph/src/components/metrics/stopwatch.rs Show resolved Hide resolved
runtime/wasm/src/module/mod.rs Show resolved Hide resolved
store/postgres/src/store.rs Outdated Show resolved Hide resolved
@leoyvens
Copy link
Collaborator Author

@Jannis thanks for the review, I've responded to all comments.

@Jannis
Copy link
Contributor

Jannis commented Nov 13, 2019

@leoyvens Ok, just the one comment about _total_secs remains.

@leoyvens leoyvens force-pushed the leo/indexing-time-instrumentation branch from 3866c0b to 97d47c4 Compare November 14, 2019 12:10
@leoyvens
Copy link
Collaborator Author

@Jannis renamed that.

Copy link
Contributor

@Jannis Jannis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@leoyvens leoyvens merged commit 107b768 into master Nov 14, 2019
@leoyvens leoyvens deleted the leo/indexing-time-instrumentation branch November 14, 2019 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve instrumentation to measure how much time all parts of indexing take
3 participants