Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Tree Sequence Based Backfilling #150

Closed
kespinola opened this issue Dec 5, 2023 · 5 comments · Fixed by rpcpool/digital-asset-rpc-infrastructure#114 or #163
Closed

[Feature] Tree Sequence Based Backfilling #150

kespinola opened this issue Dec 5, 2023 · 5 comments · Fixed by rpcpool/digital-asset-rpc-infrastructure#114 or #163

Comments

@kespinola
Copy link
Collaborator

kespinola commented Dec 5, 2023

Issues

  1. Inefficiency in Bootstrapping: The current backfiller struggles to bootstrap a fresh DAS installation effectively. During transaction processing, an overwhelming influx of backfill_items congests the table. This situation leads to a backlog, as the backfiller cannot catch up with the initial tree discoveries at the start of the run.

    SELECT COUNT(backfilled), backfilled FROM backfill_items GROUP BY backfilled;
      count  | backfilled
    ---------+------------
     2430983 | f
          45 | t
    
  2. Ineffective Subsequent Backfill Attempts: After the initial setup, the backfiller's approach to identify transaction gaps is to sequentially scan blocks. This method often involves blocks with no relevant transactions for the targeted tree, resulting in inefficient processing and an inability to reach a completion state.

  3. Lack of Process Continuity: Currently, the backfiller lacks the capability to resume from its last state. If interrupted, it loses all progress and must restart from the beginning, further hindering the backfill completion.

Goals

  • Enable backfill tasks to resume from their last state, eliminating redundant work.
  • Optimize RPC queries to ensure data relevance and minimize discard.
  • Maximize task efficiency through concurrent or parallel execution.

Proposal

To address the inefficiencies and improve the continuity of the backfill process, we propose the following enhancements:

  1. Enhanced State Tracking for Trees: Develop a detailed recording system that captures every sequence change of a tree within the tree_sequences table. This table will serve as a ledger, logging each modification to the trees along with its associated metadata, such as the sequence number (seq), the public key of the tree (tree), the index of the leaf that triggered the change (leaf_idx), the transaction signature (signature), and the specific instruction identifier (instruction).

  2. Optimized RPC Queries: Utilize the getSignaturesForAddress RPC call with the before and until parameters to precisely target the transaction gaps for each tree. This will ensure that only relevant data is fetched, reducing unnecessary processing.

  3. Concurrent Processing: Introduce concurrent or parallel processing of backfill tasks to expedite the completion of the backfill process.

  4. Database Schema Update: Introduce a new table, tree_sequences, to track the sequence of changes for each tree. The schema for the tree_sequences table will be as follows:

CREATE TABLE tree_sequences (
    -- Composite primary key consisting of the sequence number and tree public key
    seq BIGINT NOT NULL,
    tree BYTEA NOT NULL,
    -- Index of the leaf causing the tree change
    leaf_idx BIGINT,
    -- Signature of the transaction associated with the instruction
    signature BYTEA NOT NULL,
    -- Identifier for the instruction
    instruction VARCHAR,
    PRIMARY KEY (seq, tree)
);
  1. Gap Identification and Filling: Perform a database query to identify sequence gaps for each tree. Utilize the getSignaturesForAddress RPC call, setting the until and before parameters to the known tree_sequences bounding the gap. A tree is considered up-to-date when the active sequence number on-chain matches the count of tree_sequences.

By implementing these enhancements, we aim to achieve a more efficient, focused, and resilient backfill process.

References

A sequence diagram outlining the current backfill implementation for those that are looking to ramp up on the topic.

sequenceDiagram
    participant Filler as Backfiller Filler Task
    participant Finder as Backfiller Finder Task
    participant DB as Database
    participant RPC as RPC Service
    participant Messenger as Messaging Service

    note over Filler: Periodically checks for trees to backfill
    Filler->>DB: Request trees to backfill
    DB-->>Filler: Return trees
    loop For each tree
        Filler->>DB: Check backfill strategy (from seq 1 or plug gaps)
        DB-->>Filler: Strategy info
        alt If backfill_from_seq_1 is true
            note right of Filler: backfill_from_seq_1 flag is set based on the tree's backfill requirements, determined earlier in the workflow
            Filler->>RPC: Call backfill_tree_from_seq_1 (RPC queries for full backfill)
            RPC-->>Filler: Block data for full backfill
            Filler->>DB: Update backfill items from sequence 1
        else If backfill_from_seq_1 is false
            Filler->>RPC: Call fetch_and_plug_gaps (RPC queries for specific gaps)
            RPC-->>Filler: Block data for specific gaps
            loop For each gap
                Filler->>+DB: Plug gap with data
                loop For each block in gap
                    Filler->>RPC: Fetch block transactions
                    RPC-->>Filler: Transactions data
                    loop For each transaction
                        Filler->>Filler: Check if transaction involves backfill tree
                        alt If transaction involves backfill tree
                            Filler->>Messenger: Submit encoded transaction to TXNFILL stream
                        end
                    end
                end
                DB-->>-Filler: Gap plugged
            end
        end
        Filler->>Messenger: Submit transactions to backfill
    end
    alt On success
        Filler->>DB: Clean up backfilled tree
    else On error
        Filler->>DB: Mark tree as failed (after retries)
    end

    note over Finder: Periodically looks for missing trees
    Finder->>RPC: fetch_trees_by_gpa (Get trees by GPA)
    RPC-->>Finder: Trees data
    Finder->>DB: Request locked/failed trees for filtering
    DB-->>Finder: Locked/Failed trees data
    Finder->>DB: Request missing trees based on filtered data
    DB-->>Finder: Return missing trees
    loop For each missing tree
        Finder->>DB: Set tree to backfill from 0
    end
    note over Filler, Finder: Handle errors and retries appropriately
Loading
@kespinola kespinola changed the title [Feature] Tree Transaction Based Backfilling [Feature] Tree Sequence Based Backfilling Dec 21, 2023
@NicolasPennie
Copy link
Collaborator

This approach makes good sense. We have been experimenting with a new cl audits schema that essentially corresponds to your tree_sequences table. Here's the schema. If you can convert to this schema then we can cooperate more easily on this feature.

Here is the schema:

#[derive(Clone, Debug, PartialEq, DeriveModel, DeriveActiveModel, Serialize, Deserialize)]
pub struct Model {
    pub id: i64,
    pub tree: Vec<u8>,
    pub leaf_idx: i64,
    pub seq: i64,
    pub created_at: DateTime,
    pub tx: Vec<u8>,
    pub instruction: Instruction,
}

Another thing to consider is building a snapshotting mechanism so that you never need to re-index from the beginning. The hard part is snapshot verification.

@kespinola
Copy link
Collaborator Author

@austbot
Copy link
Collaborator

austbot commented Dec 23, 2023

Looks pretty awesome,

My two cents are that originally we stored every sequence for every modification meaning we had multiple copies of any filled tree nodes in the seq table. As usage goes up this goes up tremendously.

Instead we started to overwrite so we contain at most two although the database table allows more.

@kespinola
Copy link
Collaborator Author

kespinola commented Dec 24, 2023

Looks pretty awesome,

My two cents are that originally we stored every sequence for every modification meaning we had multiple copies of any filled tree nodes in the seq table. As usage goes up this goes up tremendously.

Instead we started to overwrite so we contain at most two although the database table allows more.

Thanks @austbot for the review. I agree this approach does increase the storage requirements. However, the information is valuable to folks as a getSignaturesForAddress was add to the API to return signatures for a leaf_id which you cannot do with Solana RPC because it only handles account addresses. The query is currently powered by cl_audits which write a complete revision history of the change log state which was over storing information. They have since switched to the schema in this PR. The storage increase feels justified since it helps with queries and not just backfilling.

If sharding is required doing so on the tree won't have cross shard interactions.

I'm a longterm advocate of dApp developers being able to run app specific indexes that store a desired subset of the trees so they pay for what they need.

@NicolasPennie
Copy link
Collaborator

NicolasPennie commented Dec 24, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants