New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
analyzer: grab blocks in parallelism-friendly way #389
Conversation
388cd99
to
df96e4a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this meant to allow multiple indexer processes to operate on the same db? could we only parallelize block grabbing in a single process for now, without coordinating in the db
1990e8d
to
e707e2e
Compare
Could. The ticket suggested using the DB to coordinate as a possible solution, and I think it does make sense in the long-term. |
a6b6798
to
0c98d5f
Compare
ugh I'd better look at the ticket |
oh dang, it really does call for concurrent access from multiple indexer instances |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Peter! I have some nits (it's probably too early for them, but oh well), and a larger comment/question around the core logic of how we grab blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm working on a related change (https://app.clickup.com/t/8669qdzbh), and had a few more realizations in the process of thinking about it.
2efd412
to
0675ca3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great progress!
I went overboard with one of the comment threads again :|. It's just a lot of text, not a big comment in terms of work.
0675ca3
to
a3a3af4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. It feels like we're quite close now 👍
9f9b9d4
to
da79ad7
Compare
380fb26
to
d844be1
Compare
@mitjat ready for final reviews. I have one more todo planned for running |
BTW: The block analyzer currently doesn't ensure that blocks are indexed in order even if only a single analyzer is used (e.g. restarting the analyzer process will likely keep some locked unprocessed blocks that will be skipped for the "timeout-period"). So we likely want to merge #405 first, and then also implement a non-fast-sync mode, that is aware of there being a single analyzer and ignoring locks (and always processing in order). |
processed_time TIMESTAMP WITH TIME ZONE NOT NULL, | ||
|
||
processed_time TIMESTAMP WITH TIME ZONE, -- NULL if the block is not yet processed. | ||
locked_time TIMESTAMP WITH TIME ZONE, -- NULL if the block is not locked. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UnlockBlockForProcessing
can set it to NULL
(fast-unlock in case of block process failure).
However, we could change the NULL
there to be -infinity
(or 0
). Not sure why I didn't think of this before. I'll change it, so I can remove the OR IS NULL
check in at least one place.
Uh, good point. I don't want to block you on #405 because I got sidetracked with a higher-prioritized requests of making sapphire testnet indexing work. What if this PR added a fastSync top-level constant, or env-variable-based toggle? Then #405 or its followup (which will implement fast-sync everywhere) can rework that into a parameter to
Did you have different thoughts about implementing slow sync? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for all the work on this!
This LGTM 🎉 , with the exception of the point you bring up about slow sync, i.e. guaranteeing that blocks will be processed in-order. Maybe you can add that as a separate commit?
f697eb5
to
73e71ec
Compare
Thanks for the suggestion!
I opted for a similar, but a bit of a different solution. Basically, we use the same query in slow-sync mode, only ignoring locks (this is done by using
I chose to ignore locks, as that is a much more fool-proof solution than relying on unlocking blocks. In unexpected crashes or restarts of the process/host, there is no way to ensure that locks are cleaned up.
I opted against doing this, because there are cases where locked blocks are expected to be present (leftover due to a restart or a crash of the slow-analyzer mid-batch). Ensuring all "fast-synced" blocks were processed and that there is only a single instance of the "slow-sync" analyzer running should be handled in the code setting up the analyzers. See the second commit in the PR for the details. |
4501625
to
103cb7f
Compare
I'd say if anything, it's the other way around. These particular locks expire with time, so the situation is self-healing. OTOH now that we're ignoring locks, we have to be careful to not ever run two slow-sync analyzers in parallel (even accidentally, e.g. during a k8s rolling upgrade); or we have to have some code that checks for this situation, as you suggested below.
I'd prefer not have the controlling code have to know about the DB and the analyzers' internals. So I would opt to add this check into the slow sync path of the BlockAnalyzer after all. But wherever it gets added, let's do it in a separate PR so we can wrap this one up. The current situation (before this PR) is no different: The enforcement of a single indexer running at a time has to be manual and external. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Thanks for all the reviews! I'll likely merge this tomorrow, have a plan to run one more test.
I just don't like the idea that in the slow-mode (single analyzer case) a non-graceful restart would then cause 5 minutes (=lock timeout) of downtime where no blocks would be processed. Although the 5-minute timeout was chosen arbitrarily, I would feel differently about it if it would be a minute or so - It might make sense to use a lower timeout at least for slow-syncs. Technically, the locks check could also miss other analyzers if the query would be run right after a batch was processed by another analyzer, so there would be no locked blocks, so it's not an ideal solution. Also, there are other ways to detect if multiple analyzers are running, e.g. ensuring that the last-processed block is increasing as expected. Anyway, agree to revisiting this after. |
103cb7f
to
e2edd74
Compare
This PR adds support for running multiple consensus/runtime analyzers in parallel. Additionally, the before duplicated "block-picking" code is unified between the consensus and runtime analyzers in
analyzer/block/block.go
."Locking" is implemented by adding
locked_time TIMESTAMP WITH TIME ZONE
column to thechain.processed_blocks
table.processed_time
is updated to allow NULL values. The "fast-sync" semantics are:The block analyzer also supports a "slow-sync" mode, where it is assumed that a single analyzer is processing the blocks, and it ensures blocks that blocks are processed in order.
TODO:
rebase on upstream
add unit tests for the new
blockBasedAnalyzer
codesanity-check the mega-query performance on a huge table
chain.processed_blocks
table. Two cases:10ms
and indexes were used as expected1sec
in this case, mostly due to thehighest_done_block
subquery, which needs to scan all expired locks. Note: this case is likely unable to happen with current design of the indexer: Even if all analyzers keep failing all blocks, the locks will expire (or be unlocked) and the analyzers will keep retrying same blocks, so there is a limit on a number of failed blocks in the db.run on mainnet (with parallelism=1) and ensure blocks/sec stays relatively similar, and sanity check results (
reindex_and_run.sh
)