Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

analyzer: grab blocks in parallelism-friendly way #389

Merged
merged 2 commits into from May 13, 2023

Conversation

ptrus
Copy link
Member

@ptrus ptrus commented Apr 25, 2023

This PR adds support for running multiple consensus/runtime analyzers in parallel. Additionally, the before duplicated "block-picking" code is unified between the consensus and runtime analyzers in analyzer/block/block.go.

"Locking" is implemented by adding locked_time TIMESTAMP WITH TIME ZONE column to the chain.processed_blocks table. processed_time is updated to allow NULL values. The "fast-sync" semantics are:

locked_time processed_time state
"expired" (application controlled - 5 minutes) NULL unlocked: block can be processed
set and "not-expired" NULL locked: block is being processed (started at locked_time)
set set done: block was processed (at processed_time)

The block analyzer also supports a "slow-sync" mode, where it is assumed that a single analyzer is processing the blocks, and it ensures blocks that blocks are processed in order.

TODO:

  • rebase on upstream

  • add unit tests for the new blockBasedAnalyzer code

    • added two basic happy-path tests,
    • add one more tests with some failures
  • sanity-check the mega-query performance on a huge table

    • tried locally with 10_000_000 rows in the chain.processed_blocks table. Two cases:
      • most blocks processed: expected case, queries ran in ~10ms and indexes were used as expected
      • most blocks unprocessed: This is a pathological example where there are 10milion locked and failed (expired) blocks in the DB. Queries took up to ~1sec in this case, mostly due to the highest_done_block subquery, which needs to scan all expired locks. Note: this case is likely unable to happen with current design of the indexer: Even if all analyzers keep failing all blocks, the locks will expire (or be unlocked) and the analyzers will keep retrying same blocks, so there is a limit on a number of failed blocks in the db.
  • run on mainnet (with parallelism=1) and ensure blocks/sec stays relatively similar, and sanity check results (reindex_and_run.sh)

    • blocks/sec stays consistent on my node (~18 blocs/sec)
    • sanity-check results: no regressions in first 100 blocks

@ptrus ptrus force-pushed the ptrus/feature/parallelism-blocks branch from 388cd99 to df96e4a Compare April 25, 2023 20:31
Copy link
Collaborator

@pro-wh pro-wh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this meant to allow multiple indexer processes to operate on the same db? could we only parallelize block grabbing in a single process for now, without coordinating in the db

@ptrus ptrus force-pushed the ptrus/feature/parallelism-blocks branch 4 times, most recently from 1990e8d to e707e2e Compare April 26, 2023 07:37
@ptrus
Copy link
Member Author

ptrus commented Apr 26, 2023

could we only parallelize block grabbing in a single process for now, without coordinating in the db

Could. The ticket suggested using the DB to coordinate as a possible solution, and I think it does make sense in the long-term.
The single-process parallelism would be indeed a bit simpler, though.
Can probably do some tests on how much the in-process parallelism even helps with #385 fix in. I would assume it's less effective than before.

@ptrus ptrus force-pushed the ptrus/feature/parallelism-blocks branch 5 times, most recently from a6b6798 to 0c98d5f Compare April 26, 2023 15:58
@pro-wh
Copy link
Collaborator

pro-wh commented Apr 26, 2023

ugh I'd better look at the ticket

@pro-wh
Copy link
Collaborator

pro-wh commented Apr 26, 2023

oh dang, it really does call for concurrent access from multiple indexer instances

https://app.clickup.com/t/8669qeadp

Copy link
Collaborator

@mitjat mitjat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Peter! I have some nits (it's probably too early for them, but oh well), and a larger comment/question around the core logic of how we grab blocks.

analyzer/block/block.go Outdated Show resolved Hide resolved
analyzer/block/block.go Outdated Show resolved Hide resolved
analyzer/queries/queries.go Outdated Show resolved Hide resolved
analyzer/queries/queries.go Outdated Show resolved Hide resolved
analyzer/runtime/runtime.go Outdated Show resolved Hide resolved
storage/oasis/consensus.go Outdated Show resolved Hide resolved
analyzer/block/block.go Outdated Show resolved Hide resolved
analyzer/queries/queries.go Outdated Show resolved Hide resolved
analyzer/block/block.go Outdated Show resolved Hide resolved
analyzer/block/block.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@mitjat mitjat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on a related change (https://app.clickup.com/t/8669qdzbh), and had a few more realizations in the process of thinking about it.

analyzer/block/block.go Outdated Show resolved Hide resolved
analyzer/block/block.go Outdated Show resolved Hide resolved
analyzer/block/block.go Show resolved Hide resolved
analyzer/block/block.go Outdated Show resolved Hide resolved
@ptrus ptrus force-pushed the ptrus/feature/parallelism-blocks branch 4 times, most recently from 2efd412 to 0675ca3 Compare May 4, 2023 16:21
Copy link
Collaborator

@mitjat mitjat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great progress!
I went overboard with one of the comment threads again :|. It's just a lot of text, not a big comment in terms of work.

analyzer/queries/queries.go Outdated Show resolved Hide resolved
analyzer/block/block.go Outdated Show resolved Hide resolved
analyzer/block/block.go Outdated Show resolved Hide resolved
analyzer/block/block.go Outdated Show resolved Hide resolved
analyzer/queries/queries.go Outdated Show resolved Hide resolved
analyzer/block/block.go Show resolved Hide resolved
storage/oasis/consensus.go Outdated Show resolved Hide resolved
analyzer/queries/queries.go Outdated Show resolved Hide resolved
@ptrus ptrus force-pushed the ptrus/feature/parallelism-blocks branch from 0675ca3 to a3a3af4 Compare May 5, 2023 12:41
Copy link
Collaborator

@mitjat mitjat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. It feels like we're quite close now 👍

analyzer/block/block.go Show resolved Hide resolved
analyzer/block/block.go Outdated Show resolved Hide resolved
analyzer/queries/queries.go Outdated Show resolved Hide resolved
analyzer/block/block.go Outdated Show resolved Hide resolved
analyzer/queries/queries.go Outdated Show resolved Hide resolved
@ptrus ptrus force-pushed the ptrus/feature/parallelism-blocks branch 4 times, most recently from 9f9b9d4 to da79ad7 Compare May 8, 2023 16:30
@ptrus ptrus force-pushed the ptrus/feature/parallelism-blocks branch from 380fb26 to d844be1 Compare May 11, 2023 19:11
@ptrus ptrus marked this pull request as ready for review May 11, 2023 21:14
@ptrus
Copy link
Member Author

ptrus commented May 11, 2023

@mitjat ready for final reviews. I have one more todo planned for running reindex_and_run.sh and sanity checking results, other than that I think this is mostly ready. I have also updated the PR description.

@ptrus
Copy link
Member Author

ptrus commented May 11, 2023

BTW: The block analyzer currently doesn't ensure that blocks are indexed in order even if only a single analyzer is used (e.g. restarting the analyzer process will likely keep some locked unprocessed blocks that will be skipped for the "timeout-period"). So we likely want to merge #405 first, and then also implement a non-fast-sync mode, that is aware of there being a single analyzer and ignoring locks (and always processing in order).

processed_time TIMESTAMP WITH TIME ZONE NOT NULL,

processed_time TIMESTAMP WITH TIME ZONE, -- NULL if the block is not yet processed.
locked_time TIMESTAMP WITH TIME ZONE, -- NULL if the block is not locked.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UnlockBlockForProcessing can set it to NULL (fast-unlock in case of block process failure).

However, we could change the NULL there to be -infinity (or 0). Not sure why I didn't think of this before. I'll change it, so I can remove the OR IS NULL check in at least one place.

@mitjat
Copy link
Collaborator

mitjat commented May 11, 2023

BTW: The block analyzer currently doesn't ensure that blocks are indexed in order even if only a single analyzer is used

Uh, good point. I don't want to block you on #405 because I got sidetracked with a higher-prioritized requests of making sapphire testnet indexing work.

What if this PR added a fastSync top-level constant, or env-variable-based toggle? Then #405 or its followup (which will implement fast-sync everywhere) can rework that into a parameter to Start(). I think for non-fast-sync mode, we need:

  • To run a check that there are no locked blocks right before the megaquery, under the same advisory lock
  • To change the continue in line 214 (after a failed block) into a break. Ideally we'd also unlock the remaining blocks in the batch at that point. Alternatively, replace continue with "retry loop at current height".

Did you have different thoughts about implementing slow sync?

Copy link
Collaborator

@mitjat mitjat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all the work on this!
This LGTM 🎉 , with the exception of the point you bring up about slow sync, i.e. guaranteeing that blocks will be processed in-order. Maybe you can add that as a separate commit?

analyzer/queries/queries.go Show resolved Hide resolved
@ptrus ptrus force-pushed the ptrus/feature/parallelism-blocks branch 4 times, most recently from f697eb5 to 73e71ec Compare May 12, 2023 10:59
@ptrus
Copy link
Member Author

ptrus commented May 12, 2023

Thanks for the suggestion!

Did you have different thoughts about implementing slow sync?

I opted for a similar, but a bit of a different solution. Basically, we use the same query in slow-sync mode, only ignoring locks (this is done by using 0 for the lockExpiry in the query). As you suggested, we also break processing the batch when encountering an error, but don't worry about unlocking blocks as the next iteration of picking blocks will treat all locks as expired.

Ideally we'd also unlock the remaining blocks in the batch at that point

I chose to ignore locks, as that is a much more fool-proof solution than relying on unlocking blocks. In unexpected crashes or restarts of the process/host, there is no way to ensure that locks are cleaned up.

To run a check that there are no locked blocks right before the megaquery, under the same advisory lock

I opted against doing this, because there are cases where locked blocks are expected to be present (leftover due to a restart or a crash of the slow-analyzer mid-batch). Ensuring all "fast-synced" blocks were processed and that there is only a single instance of the "slow-sync" analyzer running should be handled in the code setting up the analyzers.


See the second commit in the PR for the details.

@ptrus ptrus force-pushed the ptrus/feature/parallelism-blocks branch 6 times, most recently from 4501625 to 103cb7f Compare May 12, 2023 18:34
@mitjat
Copy link
Collaborator

mitjat commented May 12, 2023

I chose to ignore locks, as that is a much more fool-proof solution than relying on unlocking blocks.

I'd say if anything, it's the other way around. These particular locks expire with time, so the situation is self-healing. OTOH now that we're ignoring locks, we have to be careful to not ever run two slow-sync analyzers in parallel (even accidentally, e.g. during a k8s rolling upgrade); or we have to have some code that checks for this situation, as you suggested below.

Ensuring all "fast-synced" blocks were processed and that there is only a single instance of the "slow-sync" analyzer running should be handled in the code setting up the analyzers.

I'd prefer not have the controlling code have to know about the DB and the analyzers' internals. So I would opt to add this check into the slow sync path of the BlockAnalyzer after all. But wherever it gets added, let's do it in a separate PR so we can wrap this one up. The current situation (before this PR) is no different: The enforcement of a single indexer running at a time has to be manual and external.

Copy link
Collaborator

@mitjat mitjat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@ptrus
Copy link
Member Author

ptrus commented May 12, 2023

Thanks for all the reviews! I'll likely merge this tomorrow, have a plan to run one more test.

I'd say if anything, it's the other way around. These particular locks expire with time, so the situation is self-healing

I just don't like the idea that in the slow-mode (single analyzer case) a non-graceful restart would then cause 5 minutes (=lock timeout) of downtime where no blocks would be processed. Although the 5-minute timeout was chosen arbitrarily, I would feel differently about it if it would be a minute or so - It might make sense to use a lower timeout at least for slow-syncs.

Technically, the locks check could also miss other analyzers if the query would be run right after a batch was processed by another analyzer, so there would be no locked blocks, so it's not an ideal solution.

Also, there are other ways to detect if multiple analyzers are running, e.g. ensuring that the last-processed block is increasing as expected.

Anyway, agree to revisiting this after.

@ptrus ptrus force-pushed the ptrus/feature/parallelism-blocks branch from 103cb7f to e2edd74 Compare May 13, 2023 12:38
@ptrus ptrus merged commit 46e5f0e into main May 13, 2023
5 checks passed
@ptrus ptrus deleted the ptrus/feature/parallelism-blocks branch May 13, 2023 12:51
@ptrus ptrus mentioned this pull request May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants