-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Tree Sequence Based Backfilling #150
[Feature] Tree Sequence Based Backfilling #150
Comments
This approach makes good sense. We have been experimenting with a new cl audits schema that essentially corresponds to your tree_sequences table. Here's the schema. If you can convert to this schema then we can cooperate more easily on this feature. Here is the schema:
Another thing to consider is building a snapshotting mechanism so that you never need to re-index from the beginning. The hard part is snapshot verification. |
https://github.com/rpcpool/digital-asset-rpc-infrastructure/tree/triton-build/tree_backfiller The backfiller prototype |
Looks pretty awesome, My two cents are that originally we stored every sequence for every modification meaning we had multiple copies of any filled tree nodes in the seq table. As usage goes up this goes up tremendously. Instead we started to overwrite so we contain at most two although the database table allows more. |
Thanks @austbot for the review. I agree this approach does increase the storage requirements. However, the information is valuable to folks as a getSignaturesForAddress was add to the API to return signatures for a leaf_id which you cannot do with Solana RPC because it only handles account addresses. The query is currently powered by cl_audits which write a complete revision history of the change log state which was over storing information. They have since switched to the schema in this PR. The storage increase feels justified since it helps with queries and not just backfilling. If sharding is required doing so on the tree won't have cross shard interactions. I'm a longterm advocate of dApp developers being able to run app specific indexes that store a desired subset of the trees so they pay for what they need. |
The storage for cl_audits_v2 is not significant. Current trees take around
40-50GB. That's less than other tables.
The reason it's a lot smaller is because it uses enums and doesn't store
all nodes. It only stores one row per seq. Given that most trees have depth
20, this is a roughly 20x reduction.
…On Sun, Dec 24, 2023, 2:46 a.m. Kyle Espinola ***@***.***> wrote:
Looks pretty awesome,
My two cents are that originally we stored every sequence for every
modification meaning we had multiple copies of any filled tree nodes in the
seq table. As usage goes up this goes up tremendously.
Instead we started to overwrite so we contain at most two although the
database table allows more.
Thanks @austbot <https://github.com/austbot> for the review. I agree this
approach does increase the storage requirements. However, the information
is valuable to folks as a getSignaturesForAddress was add to the API to
return signatures for a leaf_id which you can do with Solana RPC because
only handles account addresses. It currently powered by cl_audits which
write complete revision history of the change log state which was over
storying information. They have since switched to the schema in this PR.
The storage increase feels justified since it helps with queries and just
backfilling.
If sharding is required doing so on the tree won't have cross shard
interactions.
I'm a longterm advocate of dApp developers being able to run app specific
indexes that store a desired subset of the trees so they pay for what they
need.
—
Reply to this email directly, view it on GitHub
<#150 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AECOQHO7JXHWAW23XGBYPYTYK7MVDAVCNFSM6AAAAABAHUWFFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRYGQ2TINZSHE>
.
You are receiving this because you commented.Message ID:
<metaplex-foundation/digital-asset-rpc-infrastructure/issues/150/1868454729
@github.com>
|
Issues
Inefficiency in Bootstrapping: The current backfiller struggles to bootstrap a fresh DAS installation effectively. During transaction processing, an overwhelming influx of
backfill_items
congests the table. This situation leads to a backlog, as the backfiller cannot catch up with the initial tree discoveries at the start of the run.Ineffective Subsequent Backfill Attempts: After the initial setup, the backfiller's approach to identify transaction gaps is to sequentially scan blocks. This method often involves blocks with no relevant transactions for the targeted tree, resulting in inefficient processing and an inability to reach a completion state.
Lack of Process Continuity: Currently, the backfiller lacks the capability to resume from its last state. If interrupted, it loses all progress and must restart from the beginning, further hindering the backfill completion.
Goals
Proposal
To address the inefficiencies and improve the continuity of the backfill process, we propose the following enhancements:
Enhanced State Tracking for Trees: Develop a detailed recording system that captures every sequence change of a tree within the tree_sequences table. This table will serve as a ledger, logging each modification to the trees along with its associated metadata, such as the sequence number (seq), the public key of the tree (tree), the index of the leaf that triggered the change (leaf_idx), the transaction signature (signature), and the specific instruction identifier (instruction).
Optimized RPC Queries: Utilize the getSignaturesForAddress RPC call with the before and until parameters to precisely target the transaction gaps for each tree. This will ensure that only relevant data is fetched, reducing unnecessary processing.
Concurrent Processing: Introduce concurrent or parallel processing of backfill tasks to expedite the completion of the backfill process.
Database Schema Update: Introduce a new table, tree_sequences, to track the sequence of changes for each tree. The schema for the tree_sequences table will be as follows:
By implementing these enhancements, we aim to achieve a more efficient, focused, and resilient backfill process.
References
A sequence diagram outlining the current backfill implementation for those that are looking to ramp up on the topic.
The text was updated successfully, but these errors were encountered: