Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: rollup indexing #198

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from
Draft

feat: rollup indexing #198

wants to merge 20 commits into from

Conversation

RequescoS
Copy link

@RequescoS RequescoS commented Jun 11, 2024

This PR aims to add indexing for the FinalizeTreeWithRoot instruction in DAS-API.

Notice: As we need to integrate this instruction into other packages (blockbuster, spl-compression, etc.), we are currently using local forks of these packages. In the future, these will be replaced by standard imports.

The FinalizeTreeWithRoot instruction has a unique characteristic that prevents it from being processed in the same way as other Bubblegum instructions. It represents a batch mint action, and the potential size of this batch can reach many millions of assets. Processing this instruction inline with others could block them in the queue until the FinalizeTreeWithRoot processing is complete, which can take a considerable amount of time.

To address this issue, we decided to create a separate queue for processing rollups, similar to our existing task processing system. When we receive a FinalizeTreeWithRoot instruction update, we add a new rollup to the queue, which is then processed in a separate process.

To represent the rollups queue, we created the rollup_to_verify table in Postgres, while downloaded rollups are stored in the rollup table. We store downloaded rollups for several reasons:

  1. To avoid downloading the same rollup multiple times if it is received from several instructions, which is possible.
  2. In the future, DAS-API providers will be able to create rollups instead of users, and all providers will send some FinalizeTreeWithRoot transactions to the Solana network. At this step, they will store the rollup in the database and use it instead of downloading it from an external link.

For rollup processing, we implemented a finite state machine (FSM) with the following states: ReceivedTransaction, SuccessfullyDownload, SuccessfullyValidate, StoredUpdate, and FailedToPersist. StoredUpdate and FailedToPersist are the final states, representing successful and unsuccessful processing cases, respectively. If we encounter a FailedToPersist state, the rollup_fail_status column in the rollup_to_verify table will store an enum representing the reason for the failure. Possible values of this enum include: ChecksumVerifyFailed (hash calculated during processing and hash received from the transaction are different), DownloadFailed, FileSerialization (invalid JSON), and RollupVerifyFailed (invalid tree root, leaf pubkey, etc.).

The initial state for a rollup is ReceivedTransaction. If the rollup is already stored in the database, we retrieve it, cast it to the Rollup structure in the code, and move to the next state. Otherwise, we need to download it using the URL received from the transaction.

The next state is SuccessfullyDownload. Here, we need to validate that the rollup contains a valid tree. For this purpose, we use the ITree trait, which abstracts ConcurrentMerkleTree<DEPTH, BUF_SIZE> regardless of the const generic parameters. This abstraction is necessary for working comfortably with ConcurrentMerkleTree without causing stack overflow (detailed reasons are described in the code comments in the merkle_tree_wrapper.rs file). If validation completes successfully, the rollup transitions to the SuccessfullyValidate state.

If the rollup is SuccessfullyValidate, we can process all the assets inside it by iterating over them and calling the mint_v1 instruction handler. Once all assets are processed, the rollup transitions to the final StoredUpdate state. If a failure occurs at any step, the rollup will enter the FailedToPersist state. Any step may be retried.

The persist_rollups process runs in a single worker in the nft_ingester/src/main.rs file. While we can run it with multiple workers, it can be very RAM-intensive (a single rollup may be many GB in size).

@RequescoS RequescoS marked this pull request as draft June 11, 2024 15:41
@RequescoS RequescoS requested a review from danenbm June 12, 2024 06:24
@RequescoS RequescoS marked this pull request as ready for review June 12, 2024 06:45
@RequescoS RequescoS marked this pull request as draft June 12, 2024 06:49
Comment on lines 1 to 4
[submodule "mpl-bubblegum"]
path = mpl-bubblegum
url = git@github.com:n00m4d/mpl-bubblegum.git
branch = feat/rollup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming this is all just related to these packages not actually being updated yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming this move is temporary until packages are updated, just so you could add the submodule at the top blockbuster level

Comment on lines 290 to 291
for (i, item) in arr.iter_mut().enumerate().take(AE_CIPHERTEXT_LEN) {
*item = seq
for i in 0..AE_CIPHERTEXT_LEN {
arr[i] = seq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually git a lint warning from this saying if you are indexing arr with i you should consider using enumerate(). I think this change might actually be undoing a previous clippy fix?

#[allow(clippy::large_enum_variant)]
pub enum TokenExtensionsProgramAccount {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here I think this was added to fix a clippy issue and probably shouldn't be removed. Maybe the PR change was based on slightly older version of the files?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, you are right, seems like the PR change was based on slightly older version of the files. Thank you

Comment on lines 3 to 4
#![allow(unused_imports)]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be left since it I believe it was manually added to fix a clippy error.

Comment on lines +553 to +568
bubblegum::mint_v1::mint_v1(
&rolled_mint.into(),
&InstructionBundle {
txn_id: &signature,
program: Default::default(),
instruction: None,
inner_ix: None,
keys: &[],
slot,
},
txn,
"CreateTreeWithRoot",
cl_audits,
)
.await?;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what happens if the all these rolled mints go to mint_v1, but there's one missing? Seems like that shouldn't happen because they are fully validated tree down to the merkle root.

But if it was ever possible then I think it results in a missing an entry in the cl_audits table, and gets detected by the backfilling process, which would try to get the missing transactions, but they wouldn't exist. So maybe we want to prevent all rollups from being entered in the cl_audits table and just hardcode this cl_audits bool to false?

But then again the cl_audits table might also be used for auditing purposes other than gap filling, so maybe we want to put entries? Probably needs some additional discussion.

Overall the state management is good so looks like you would have RollupPersistingState::FailedToPersist as an indicator that you need to retry or backfill rollups.

Copy link
Author

@RequescoS RequescoS Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm
For now i do not think it is possible to miss some asset inside rollup because of validation
And if we will find some setup that create a gap we might just add additional validation check
But for straightforward flow sake it can be convenient to set cl_audits == false, wdyt?


pub async fn persist_rollups(&self) {
loop {
tokio::time::sleep(Duration::from_secs(5)).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered doing a a trigger/notification when a row is added to the rollup_to_verify table? That would probably work better than a hardcoded delay for the times when there are no rollups and the times there are spikes of multiple new entries.

Comment on lines +202 to +208
let Ok(rollup) = rollup
.map(|r| bincode::deserialize::<Rollup>(r.rollup_binary_bincode.as_slice()))
.transpose()
.map_err(|e| ProgramTransformerError::DeserializationError(e.to_string()))
else {
continue;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of error will it keep trying to deserialize this rollup in subsequent iterations of the loop, because it is stuck in RollupPersistingState::ReceivedTransaction and never deleted from the table?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch, thank you!

Comment on lines +220 to +221
loop {
match &rollup_to_verify.rollup_persisting_state {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice state machine!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks :)

Comment on lines +254 to +261
if let Err(e) = self.drop_rollup_from_queue(&rollup_to_verify).await {
error!("failed to drop rollup from queue: {}", e);
};
info!(
"Finish processing {} rollup file with {:?} state",
&rollup_to_verify.url, &rollup_to_verify.rollup_persisting_state
);
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a note that probably when we bring this to the wider RPC community we might want to add some specific metrics based on their ops preferences as well.

Comment on lines +96 to +101
solana-account-decoder = "1.18.11"
solana-client = "1.18.11"
solana-geyser-plugin-interface = "1.18.11"
solana-program = "1.18.11"
solana-sdk = "1.18.11"
solana-transaction-status = "1.18.11"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if RPC community is ready to update to 1.18 FYI

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated smart contracts migrated to this solana version, so we must use it here or find some detours to avoid conflicts in import versions. Or maybe we can talk with the community and notify that in the near future, it will be necessary to update the solana version?

Comment on lines +222 to +225
&RollupPersistingState::ReceivedTransaction => {
if let Err(err) = self
.download_rollup(&mut rollup_to_verify, &mut rollup)
.await
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with most DAS processing there is an expectation that there will eventually be multiple concurrent processors of the data in order to handle volume. So in this case, there would be multiple consumers of the same rollup_to_verify table row, and it seems like we would need to lock an entry or have some other means to prevent a race condition of multiple processors both downloading and then changing to the next state. The background task processing has a lock state, which may not be perfect. I may be misunderstanding but it seems like this processing also needs a plan for locking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants