[Feature] Tree Sequence Based Backfilling #150

kespinola · 2023-12-05T12:59:30Z

Issues

Inefficiency in Bootstrapping: The current backfiller struggles to bootstrap a fresh DAS installation effectively. During transaction processing, an overwhelming influx of backfill_items congests the table. This situation leads to a backlog, as the backfiller cannot catch up with the initial tree discoveries at the start of the run.
```
SELECT COUNT(backfilled), backfilled FROM backfill_items GROUP BY backfilled;
  count  | backfilled
---------+------------
 2430983 | f
      45 | t
```
Ineffective Subsequent Backfill Attempts: After the initial setup, the backfiller's approach to identify transaction gaps is to sequentially scan blocks. This method often involves blocks with no relevant transactions for the targeted tree, resulting in inefficient processing and an inability to reach a completion state.
Lack of Process Continuity: Currently, the backfiller lacks the capability to resume from its last state. If interrupted, it loses all progress and must restart from the beginning, further hindering the backfill completion.

Goals

Enable backfill tasks to resume from their last state, eliminating redundant work.
Optimize RPC queries to ensure data relevance and minimize discard.
Maximize task efficiency through concurrent or parallel execution.

Proposal

To address the inefficiencies and improve the continuity of the backfill process, we propose the following enhancements:

Enhanced State Tracking for Trees: Develop a detailed recording system that captures every sequence change of a tree within the tree_sequences table. This table will serve as a ledger, logging each modification to the trees along with its associated metadata, such as the sequence number (seq), the public key of the tree (tree), the index of the leaf that triggered the change (leaf_idx), the transaction signature (signature), and the specific instruction identifier (instruction).
Optimized RPC Queries: Utilize the getSignaturesForAddress RPC call with the before and until parameters to precisely target the transaction gaps for each tree. This will ensure that only relevant data is fetched, reducing unnecessary processing.
Concurrent Processing: Introduce concurrent or parallel processing of backfill tasks to expedite the completion of the backfill process.
Database Schema Update: Introduce a new table, tree_sequences, to track the sequence of changes for each tree. The schema for the tree_sequences table will be as follows:

CREATE TABLE tree_sequences (
    -- Composite primary key consisting of the sequence number and tree public key
    seq BIGINT NOT NULL,
    tree BYTEA NOT NULL,
    -- Index of the leaf causing the tree change
    leaf_idx BIGINT,
    -- Signature of the transaction associated with the instruction
    signature BYTEA NOT NULL,
    -- Identifier for the instruction
    instruction VARCHAR,
    PRIMARY KEY (seq, tree)
);

Gap Identification and Filling: Perform a database query to identify sequence gaps for each tree. Utilize the getSignaturesForAddress RPC call, setting the until and before parameters to the known tree_sequences bounding the gap. A tree is considered up-to-date when the active sequence number on-chain matches the count of tree_sequences.

By implementing these enhancements, we aim to achieve a more efficient, focused, and resilient backfill process.

References

A sequence diagram outlining the current backfill implementation for those that are looking to ramp up on the topic.

sequenceDiagram
    participant Filler as Backfiller Filler Task
    participant Finder as Backfiller Finder Task
    participant DB as Database
    participant RPC as RPC Service
    participant Messenger as Messaging Service

    note over Filler: Periodically checks for trees to backfill
    Filler->>DB: Request trees to backfill
    DB-->>Filler: Return trees
    loop For each tree
        Filler->>DB: Check backfill strategy (from seq 1 or plug gaps)
        DB-->>Filler: Strategy info
        alt If backfill_from_seq_1 is true
            note right of Filler: backfill_from_seq_1 flag is set based on the tree's backfill requirements, determined earlier in the workflow
            Filler->>RPC: Call backfill_tree_from_seq_1 (RPC queries for full backfill)
            RPC-->>Filler: Block data for full backfill
            Filler->>DB: Update backfill items from sequence 1
        else If backfill_from_seq_1 is false
            Filler->>RPC: Call fetch_and_plug_gaps (RPC queries for specific gaps)
            RPC-->>Filler: Block data for specific gaps
            loop For each gap
                Filler->>+DB: Plug gap with data
                loop For each block in gap
                    Filler->>RPC: Fetch block transactions
                    RPC-->>Filler: Transactions data
                    loop For each transaction
                        Filler->>Filler: Check if transaction involves backfill tree
                        alt If transaction involves backfill tree
                            Filler->>Messenger: Submit encoded transaction to TXNFILL stream
                        end
                    end
                end
                DB-->>-Filler: Gap plugged
            end
        end
        Filler->>Messenger: Submit transactions to backfill
    end
    alt On success
        Filler->>DB: Clean up backfilled tree
    else On error
        Filler->>DB: Mark tree as failed (after retries)
    end

    note over Finder: Periodically looks for missing trees
    Finder->>RPC: fetch_trees_by_gpa (Get trees by GPA)
    RPC-->>Finder: Trees data
    Finder->>DB: Request locked/failed trees for filtering
    DB-->>Finder: Locked/Failed trees data
    Finder->>DB: Request missing trees based on filtered data
    DB-->>Finder: Return missing trees
    loop For each missing tree
        Finder->>DB: Set tree to backfill from 0
    end
    note over Filler, Finder: Handle errors and retries appropriately

The text was updated successfully, but these errors were encountered:

NicolasPennie · 2023-12-21T18:31:58Z

This approach makes good sense. We have been experimenting with a new cl audits schema that essentially corresponds to your tree_sequences table. Here's the schema. If you can convert to this schema then we can cooperate more easily on this feature.

Here is the schema:

#[derive(Clone, Debug, PartialEq, DeriveModel, DeriveActiveModel, Serialize, Deserialize)]
pub struct Model {
    pub id: i64,
    pub tree: Vec<u8>,
    pub leaf_idx: i64,
    pub seq: i64,
    pub created_at: DateTime,
    pub tx: Vec<u8>,
    pub instruction: Instruction,
}

Another thing to consider is building a snapshotting mechanism so that you never need to re-index from the beginning. The hard part is snapshot verification.

kespinola · 2023-12-23T08:46:42Z

https://github.com/rpcpool/digital-asset-rpc-infrastructure/tree/triton-build/tree_backfiller

The backfiller prototype

austbot · 2023-12-23T12:33:51Z

Looks pretty awesome,

My two cents are that originally we stored every sequence for every modification meaning we had multiple copies of any filled tree nodes in the seq table. As usage goes up this goes up tremendously.

Instead we started to overwrite so we contain at most two although the database table allows more.

kespinola · 2023-12-24T07:46:14Z

Looks pretty awesome,

My two cents are that originally we stored every sequence for every modification meaning we had multiple copies of any filled tree nodes in the seq table. As usage goes up this goes up tremendously.

Instead we started to overwrite so we contain at most two although the database table allows more.

Thanks @austbot for the review. I agree this approach does increase the storage requirements. However, the information is valuable to folks as a getSignaturesForAddress was add to the API to return signatures for a leaf_id which you cannot do with Solana RPC because it only handles account addresses. The query is currently powered by cl_audits which write a complete revision history of the change log state which was over storing information. They have since switched to the schema in this PR. The storage increase feels justified since it helps with queries and not just backfilling.

If sharding is required doing so on the tree won't have cross shard interactions.

I'm a longterm advocate of dApp developers being able to run app specific indexes that store a desired subset of the trees so they pay for what they need.

NicolasPennie · 2023-12-24T15:09:01Z

The storage for cl_audits_v2 is not significant. Current trees take around 40-50GB. That's less than other tables. The reason it's a lot smaller is because it uses enums and doesn't store all nodes. It only stores one row per seq. Given that most trees have depth 20, this is a roughly 20x reduction.

…

On Sun, Dec 24, 2023, 2:46 a.m. Kyle Espinola ***@***.***> wrote: Looks pretty awesome, My two cents are that originally we stored every sequence for every modification meaning we had multiple copies of any filled tree nodes in the seq table. As usage goes up this goes up tremendously. Instead we started to overwrite so we contain at most two although the database table allows more. Thanks @austbot <https://github.com/austbot> for the review. I agree this approach does increase the storage requirements. However, the information is valuable to folks as a getSignaturesForAddress was add to the API to return signatures for a leaf_id which you can do with Solana RPC because only handles account addresses. It currently powered by cl_audits which write complete revision history of the change log state which was over storying information. They have since switched to the schema in this PR. The storage increase feels justified since it helps with queries and just backfilling. If sharding is required doing so on the tree won't have cross shard interactions. I'm a longterm advocate of dApp developers being able to run app specific indexes that store a desired subset of the trees so they pay for what they need. — Reply to this email directly, view it on GitHub <#150 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AECOQHO7JXHWAW23XGBYPYTYK7MVDAVCNFSM6AAAAABAHUWFFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRYGQ2TINZSHE> . You are receiving this because you commented.Message ID: <metaplex-foundation/digital-asset-rpc-infrastructure/issues/150/1868454729 @github.com>

kespinola mentioned this issue Dec 7, 2023

[WIP] Tree Transaction Backfiller rpcpool/digital-asset-rpc-infrastructure#114

Merged

9 tasks

kespinola changed the title ~~[Feature] Tree Transaction Based Backfilling~~ [Feature] Tree Sequence Based Backfilling Dec 21, 2023

kespinola mentioned this issue Dec 21, 2023

Replace cl_audits table with tree_transactions #153

Closed

kespinola mentioned this issue Dec 22, 2023

Add cl_audits_v2 rpcpool/digital-asset-rpc-infrastructure#120

Merged

kespinola mentioned this issue Jan 24, 2024

[Feat] Ops cli with tree backfiller #163

Merged

kespinola closed this as completed in #163 Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Tree Sequence Based Backfilling #150

[Feature] Tree Sequence Based Backfilling #150

kespinola commented Dec 5, 2023 •

edited

Loading

NicolasPennie commented Dec 21, 2023

kespinola commented Dec 23, 2023

austbot commented Dec 23, 2023

kespinola commented Dec 24, 2023 •

edited

Loading

NicolasPennie commented Dec 24, 2023 via email

[Feature] Tree Sequence Based Backfilling #150

[Feature] Tree Sequence Based Backfilling #150

Comments

kespinola commented Dec 5, 2023 • edited Loading

Issues

Goals

Proposal

References

NicolasPennie commented Dec 21, 2023

kespinola commented Dec 23, 2023

austbot commented Dec 23, 2023

kespinola commented Dec 24, 2023 • edited Loading

NicolasPennie commented Dec 24, 2023 via email

kespinola commented Dec 5, 2023 •

edited

Loading

kespinola commented Dec 24, 2023 •

edited

Loading