analyzer: grab blocks in parallelism-friendly way #389

ptrus · 2023-04-25T20:30:29Z

This PR adds support for running multiple consensus/runtime analyzers in parallel. Additionally, the before duplicated "block-picking" code is unified between the consensus and runtime analyzers in analyzer/block/block.go.

"Locking" is implemented by adding locked_time TIMESTAMP WITH TIME ZONE column to the chain.processed_blocks table. processed_time is updated to allow NULL values. The "fast-sync" semantics are:

locked_time	processed_time	state
"expired" (application controlled - 5 minutes)	NULL	unlocked: block can be processed
set and "not-expired"	NULL	locked: block is being processed (started at locked_time)
set	set	done: block was processed (at processed_time)

The block analyzer also supports a "slow-sync" mode, where it is assumed that a single analyzer is processing the blocks, and it ensures blocks that blocks are processed in order.

TODO:

rebase on upstream
add unit tests for the new blockBasedAnalyzer code
- added two basic happy-path tests,
- add one more tests with some failures
sanity-check the mega-query performance on a huge table
- tried locally with 10_000_000 rows in the chain.processed_blocks table. Two cases:
  - most blocks processed: expected case, queries ran in ~10ms and indexes were used as expected
  - most blocks unprocessed: This is a pathological example where there are 10milion locked and failed (expired) blocks in the DB. Queries took up to ~1sec in this case, mostly due to the highest_done_block subquery, which needs to scan all expired locks. Note: this case is likely unable to happen with current design of the indexer: Even if all analyzers keep failing all blocks, the locks will expire (or be unlocked) and the analyzers will keep retrying same blocks, so there is a limit on a number of failed blocks in the db.
run on mainnet (with parallelism=1) and ensure blocks/sec stays relatively similar, and sanity check results (reindex_and_run.sh)
- blocks/sec stays consistent on my node (~18 blocs/sec)
- sanity-check results: no regressions in first 100 blocks

storage/migrations/01_consensus.up.sql

pro-wh

is this meant to allow multiple indexer processes to operate on the same db? could we only parallelize block grabbing in a single process for now, without coordinating in the db

ptrus · 2023-04-26T07:52:28Z

could we only parallelize block grabbing in a single process for now, without coordinating in the db

Could. The ticket suggested using the DB to coordinate as a possible solution, and I think it does make sense in the long-term.
The single-process parallelism would be indeed a bit simpler, though.
Can probably do some tests on how much the in-process parallelism even helps with #385 fix in. I would assume it's less effective than before.

pro-wh · 2023-04-26T17:04:12Z

ugh I'd better look at the ticket

pro-wh · 2023-04-26T17:08:20Z

oh dang, it really does call for concurrent access from multiple indexer instances

https://app.clickup.com/t/8669qeadp

mitjat

Thank you Peter! I have some nits (it's probably too early for them, but oh well), and a larger comment/question around the core logic of how we grab blocks.

analyzer/block/block.go

analyzer/queries/queries.go

analyzer/runtime/runtime.go

storage/oasis/consensus.go

analyzer/block/block.go

analyzer/queries/queries.go

analyzer/block/block.go

mitjat

I'm working on a related change (https://app.clickup.com/t/8669qdzbh), and had a few more realizations in the process of thinking about it.

analyzer/block/block.go

mitjat

Great progress!
I went overboard with one of the comment threads again :|. It's just a lot of text, not a big comment in terms of work.

analyzer/queries/queries.go

analyzer/block/block.go

analyzer/queries/queries.go

analyzer/block/block.go

storage/oasis/consensus.go

analyzer/queries/queries.go

mitjat

Nice. It feels like we're quite close now 👍

analyzer/block/block.go

analyzer/queries/queries.go

analyzer/block/block.go

analyzer/queries/queries.go

ptrus · 2023-05-11T21:16:16Z

@mitjat ready for final reviews. I have one more todo planned for running reindex_and_run.sh and sanity checking results, other than that I think this is mostly ready. I have also updated the PR description.

ptrus · 2023-05-11T21:22:35Z

BTW: The block analyzer currently doesn't ensure that blocks are indexed in order even if only a single analyzer is used (e.g. restarting the analyzer process will likely keep some locked unprocessed blocks that will be skipped for the "timeout-period"). So we likely want to merge #405 first, and then also implement a non-fast-sync mode, that is aware of there being a single analyzer and ignoring locks (and always processing in order).

ptrus · 2023-05-11T21:30:40Z

storage/migrations/01_consensus.up.sql

-  processed_time TIMESTAMP WITH TIME ZONE NOT NULL,
+
+  processed_time TIMESTAMP WITH TIME ZONE, -- NULL if the block is not yet processed.
+  locked_time     TIMESTAMP WITH TIME ZONE, -- NULL if the block is not locked.


The UnlockBlockForProcessing can set it to NULL (fast-unlock in case of block process failure).

However, we could change the NULL there to be -infinity (or 0). Not sure why I didn't think of this before. I'll change it, so I can remove the OR IS NULL check in at least one place.

mitjat · 2023-05-11T22:19:28Z

BTW: The block analyzer currently doesn't ensure that blocks are indexed in order even if only a single analyzer is used

Uh, good point. I don't want to block you on #405 because I got sidetracked with a higher-prioritized requests of making sapphire testnet indexing work.

What if this PR added a fastSync top-level constant, or env-variable-based toggle? Then #405 or its followup (which will implement fast-sync everywhere) can rework that into a parameter to Start(). I think for non-fast-sync mode, we need:

To run a check that there are no locked blocks right before the megaquery, under the same advisory lock
To change the continue in line 214 (after a failed block) into a break. Ideally we'd also unlock the remaining blocks in the batch at that point. Alternatively, replace continue with "retry loop at current height".

Did you have different thoughts about implementing slow sync?

mitjat

Thank you for all the work on this!
This LGTM 🎉 , with the exception of the point you bring up about slow sync, i.e. guaranteeing that blocks will be processed in-order. Maybe you can add that as a separate commit?

analyzer/queries/queries.go

ptrus · 2023-05-12T11:13:32Z

Thanks for the suggestion!

Did you have different thoughts about implementing slow sync?

I opted for a similar, but a bit of a different solution. Basically, we use the same query in slow-sync mode, only ignoring locks (this is done by using 0 for the lockExpiry in the query). As you suggested, we also break processing the batch when encountering an error, but don't worry about unlocking blocks as the next iteration of picking blocks will treat all locks as expired.

Ideally we'd also unlock the remaining blocks in the batch at that point

I chose to ignore locks, as that is a much more fool-proof solution than relying on unlocking blocks. In unexpected crashes or restarts of the process/host, there is no way to ensure that locks are cleaned up.

To run a check that there are no locked blocks right before the megaquery, under the same advisory lock

I opted against doing this, because there are cases where locked blocks are expected to be present (leftover due to a restart or a crash of the slow-analyzer mid-batch). Ensuring all "fast-synced" blocks were processed and that there is only a single instance of the "slow-sync" analyzer running should be handled in the code setting up the analyzers.

See the second commit in the PR for the details.

mitjat · 2023-05-12T19:16:11Z

I chose to ignore locks, as that is a much more fool-proof solution than relying on unlocking blocks.

I'd say if anything, it's the other way around. These particular locks expire with time, so the situation is self-healing. OTOH now that we're ignoring locks, we have to be careful to not ever run two slow-sync analyzers in parallel (even accidentally, e.g. during a k8s rolling upgrade); or we have to have some code that checks for this situation, as you suggested below.

Ensuring all "fast-synced" blocks were processed and that there is only a single instance of the "slow-sync" analyzer running should be handled in the code setting up the analyzers.

I'd prefer not have the controlling code have to know about the DB and the analyzers' internals. So I would opt to add this check into the slow sync path of the BlockAnalyzer after all. But wherever it gets added, let's do it in a separate PR so we can wrap this one up. The current situation (before this PR) is no different: The enforcement of a single indexer running at a time has to be manual and external.

mitjat

Thank you!

ptrus · 2023-05-12T19:29:04Z

Thanks for all the reviews! I'll likely merge this tomorrow, have a plan to run one more test.

I'd say if anything, it's the other way around. These particular locks expire with time, so the situation is self-healing

I just don't like the idea that in the slow-mode (single analyzer case) a non-graceful restart would then cause 5 minutes (=lock timeout) of downtime where no blocks would be processed. Although the 5-minute timeout was chosen arbitrarily, I would feel differently about it if it would be a minute or so - It might make sense to use a lower timeout at least for slow-syncs.

Technically, the locks check could also miss other analyzers if the query would be run right after a batch was processed by another analyzer, so there would be no locked blocks, so it's not an ideal solution.

Also, there are other ways to detect if multiple analyzers are running, e.g. ensuring that the last-processed block is increasing as expected.

Anyway, agree to revisiting this after.

ptrus force-pushed the ptrus/feature/parallelism-blocks branch from 388cd99 to df96e4a Compare April 25, 2023 20:31

pro-wh reviewed Apr 25, 2023

View reviewed changes

storage/migrations/01_consensus.up.sql Outdated Show resolved Hide resolved

pro-wh reviewed Apr 25, 2023

View reviewed changes

ptrus force-pushed the ptrus/feature/parallelism-blocks branch 4 times, most recently from 1990e8d to e707e2e Compare April 26, 2023 07:37

ptrus force-pushed the ptrus/feature/parallelism-blocks branch 5 times, most recently from a6b6798 to 0c98d5f Compare April 26, 2023 15:58

mitjat reviewed Apr 28, 2023

View reviewed changes

mitjat reviewed May 3, 2023

View reviewed changes

analyzer/block/block.go Outdated Show resolved Hide resolved

analyzer/block/block.go Outdated Show resolved Hide resolved

analyzer/block/block.go Show resolved Hide resolved

analyzer/block/block.go Outdated Show resolved Hide resolved

ptrus mentioned this pull request May 4, 2023

Ability to run analyzers in parallel #405

Closed

ptrus force-pushed the ptrus/feature/parallelism-blocks branch 4 times, most recently from 2efd412 to 0675ca3 Compare May 4, 2023 16:21

mitjat reviewed May 5, 2023

View reviewed changes

ptrus commented May 5, 2023

View reviewed changes

analyzer/queries/queries.go Outdated Show resolved Hide resolved

ptrus force-pushed the ptrus/feature/parallelism-blocks branch from 0675ca3 to a3a3af4 Compare May 5, 2023 12:41

mitjat reviewed May 6, 2023

View reviewed changes

ptrus force-pushed the ptrus/feature/parallelism-blocks branch 4 times, most recently from 9f9b9d4 to da79ad7 Compare May 8, 2023 16:30

ptrus force-pushed the ptrus/feature/parallelism-blocks branch from 380fb26 to d844be1 Compare May 11, 2023 19:11

ptrus marked this pull request as ready for review May 11, 2023 21:14

ptrus requested review from aefhm and Andrew7234 as code owners May 11, 2023 21:15

ptrus commented May 11, 2023

View reviewed changes

mitjat reviewed May 11, 2023

View reviewed changes

analyzer/queries/queries.go Show resolved Hide resolved

ptrus force-pushed the ptrus/feature/parallelism-blocks branch 4 times, most recently from f697eb5 to 73e71ec Compare May 12, 2023 10:59

ptrus force-pushed the ptrus/feature/parallelism-blocks branch 6 times, most recently from 4501625 to 103cb7f Compare May 12, 2023 18:34

mitjat approved these changes May 12, 2023

View reviewed changes

ptrus added 2 commits May 13, 2023 14:38

analyzer: grab blocks in parallelism-friendly way

429d3ea

analyzer/block: add support for slow-sync

e2edd74

ptrus force-pushed the ptrus/feature/parallelism-blocks branch from 103cb7f to e2edd74 Compare May 13, 2023 12:38

ptrus merged commit 46e5f0e into main May 13, 2023
5 checks passed

ptrus deleted the ptrus/feature/parallelism-blocks branch May 13, 2023 12:51

ptrus mentioned this pull request May 17, 2023

Andrew7234/ci e2e #410

Merged

mitjat mentioned this pull request Jun 22, 2023

Ability to run analyzers in parallel #456

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analyzer: grab blocks in parallelism-friendly way #389

analyzer: grab blocks in parallelism-friendly way #389

ptrus commented Apr 25, 2023 •

edited

pro-wh left a comment

ptrus commented Apr 26, 2023 •

edited

pro-wh commented Apr 26, 2023

pro-wh commented Apr 26, 2023 •

edited

mitjat left a comment

mitjat left a comment

mitjat left a comment

mitjat left a comment

ptrus commented May 11, 2023

ptrus commented May 11, 2023 •

edited

ptrus May 11, 2023

mitjat commented May 11, 2023

mitjat left a comment

ptrus commented May 12, 2023 •

edited

mitjat commented May 12, 2023

mitjat left a comment

ptrus commented May 12, 2023 •

edited

analyzer: grab blocks in parallelism-friendly way #389

analyzer: grab blocks in parallelism-friendly way #389

Conversation

ptrus commented Apr 25, 2023 • edited

pro-wh left a comment

Choose a reason for hiding this comment

ptrus commented Apr 26, 2023 • edited

pro-wh commented Apr 26, 2023

pro-wh commented Apr 26, 2023 • edited

mitjat left a comment

Choose a reason for hiding this comment

mitjat left a comment

Choose a reason for hiding this comment

mitjat left a comment

Choose a reason for hiding this comment

mitjat left a comment

Choose a reason for hiding this comment

ptrus commented May 11, 2023

ptrus commented May 11, 2023 • edited

ptrus May 11, 2023

Choose a reason for hiding this comment

mitjat commented May 11, 2023

mitjat left a comment

Choose a reason for hiding this comment

ptrus commented May 12, 2023 • edited

mitjat commented May 12, 2023

mitjat left a comment

Choose a reason for hiding this comment

ptrus commented May 12, 2023 • edited

ptrus commented Apr 25, 2023 •

edited

ptrus commented Apr 26, 2023 •

edited

pro-wh commented Apr 26, 2023 •

edited

ptrus commented May 11, 2023 •

edited

ptrus commented May 12, 2023 •

edited

ptrus commented May 12, 2023 •

edited