Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip: DataApi trait for querying/inserting data #29

Closed
wants to merge 1 commit into from

Conversation

joshieDo
Copy link
Collaborator

@joshieDo joshieDo commented Oct 10, 2022

WIP

Ideally, mdbx, memory_db and static have a common well-defined API to query for data. Is it a too naive of an approach?
All 3 elements would implement the same DataApi trait.

Other crates (RPC,pool, executor etc) request the data through an ApiManager (which implements the same trait) without knowing where the data is. ApiManager will accordingly to the parameters choose the best course of action. (eg. if no block is passed, then its the tip and will query mdbx.)


If we all agree this is the way to go, then I'd ask that you'd also add whatever API you need here.

This also helps out with the schema design, to know what's really needed in the DB/Index... or how we can merge/structure data

Copy link
Member

@onbjerg onbjerg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having functions like this for every type and/or table will make the data model very inflexible/hard to change.

Imo we should look at just having some interface for reading, writing and iterating on types that implement e.g. Encode/Decode.

Obviously they would need some sort of identifier like a table ID.

The abstraction layer here would just mostly decide where to read and write the data to and how. That way the surface area for the databases is small, and the surface area for types being written/read from the database is also small (just impl Encode/Decode)

We also still need an abstraction for the notion of transactions though, even if the underlying database does not use transactions

@joshieDo
Copy link
Collaborator Author

joshieDo commented Oct 10, 2022

I think having functions like this for every type and/or table will make the data model very inflexible/hard to change.

This layer is above that though. If I want an Account at block X: db, memdb and staticfile should be able to provide it. And if that's the case, don't they share that trait? Each one would need to piece together the Account and return that.

The abstraction layer here would just mostly decide where to read and write the data to and how. That way the surface area for the databases is small, and the surface area for types being written/read from the database is also small.
We also still need an abstraction for the notion of transactions though, even if the underlying database does not use transactions

That's what I am going for here. Why do you think that's not whats happening ?

@onbjerg
Copy link
Member

onbjerg commented Oct 10, 2022

I think it should just be a trait that is generic over anything that is Encode/Decode is my main point - the way it is now, if we add a new database type we have to implement all of these functions for the new database, right?

If it was more generic, it would be easier to implement for new databases/stores (the extreme case would be 2 functions although that is unlikely).

If we want to add a new table/type, we need to add new functions to this trait + implement other traits for the type etc. + specify the table in the tables file and so on. If it was generic, this overhead (I think) would be lowered since it's 1 less thing to implement

Although I guess there is something to be said for per-store codecs... I would imagine memdb and staticfile would represent their data differently, maybe one is compressed and the other is not. Not entirely sure how that would be solved - maybe some wrapper types that specify how the data should ideally be encoded? See e.g the Json<T>/Msgpack<T> stuff in here: https://docs.rs/kv/latest/kv/

Edit: Another example, this one is a fully typed "high level" (not that much) wrapper around LMDB (which MDBX is based on) https://docs.rs/heed/latest/heed/. This one also sort of uses wrapper types (e.g. OwnedType), these are defined in https://docs.rs/heed-types/0.8.0/heed_types/index.html

@joshieDo
Copy link
Collaborator Author

joshieDo commented Oct 10, 2022

if we add a new database type we have to implement all of these functions for the new database, right?

Yes, as suggested on discord, this trait could be split into smaller ones. libmdbx probably would implement all, while static would only implement reading and append ones.

If we want to add a new table/type, we need to add new functions to this trait + implement other traits for the type etc. + specify the table in the tables file and so on.

I mean, it really depends on what kind of type for what purpose. If it's an index, you probably wouldn't need to, since it would be abstracted away anyway.

I think what I'm having trouble grasping at the moment is: Much of the data doesn't have a straight encode <-> decode mapping. It has to be constructed from different places/indexes/tables/files depending on the store specifications. Getting a Block even without transactions, means hitting different tables (BlockBodies, Headers, Transactions, etc...) to stitch it up. Calling .write_blocks([Block,..]) , would hit multiple tables on mdbx, while on static has to do some fancy compression+hash_table most likely.

However, the type that RPC/Executor/Stage expects is supposed to be the same, If we don't have a common API (even if split among smaller Traits), won't it just get kinda messy...?

About how each data should be ultimately encoded before writing to disk... this is what I was going for: https://github.com/foundry-rs/reth/blob/joshie/kv_mdbx/crates/db/src/kv/table.rs . I'll take a look into those, thanks!

I feel like I'm misunderstanding your point. I'll think more about it

Block example from erigon:
https://github.com/ledgerwatch/erigon/blob/6815504462235f12624afccb9e20a92a08580b0f/core/rawdb/accessors_chain.go#L1112

fn write_total_difficulty(&self, difficulty: U256) -> Result<(), ApiError>;
fn truncate_total_difficulty(&self, from: BlockNumber) -> Result<(), ApiError>;

fn read_receipt(&self, number: Option<TxNumber>) -> Result<Receipt, ApiError>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call this TxIndex instead of number

Suggested change
fn read_receipt(&self, number: Option<TxNumber>) -> Result<Receipt, ApiError>;
fn read_receipt(&self, number: Option<TxIndex>) -> Result<Receipt, ApiError>;

Comment on lines +36 to +37
fn move_to_canonical_txes(&self) -> Result<(), ApiError>;
fn move_to_non_canonical_txes(&self) -> Result<(), ApiError>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you envision this doing?


fn read_canonical_hash(&self, number: u64) -> Result<H256, ApiError>;
fn write_total_difficulty(&self, difficulty: U256) -> Result<(), ApiError>;
fn truncate_total_difficulty(&self, from: BlockNumber) -> Result<(), ApiError>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do?

fn read_account_code(&self, address: Address, tx_number: Option<TxNumber>) -> Result<Bytes, ApiError>;
fn read_account_code_size(&self, address: Address, tx_number: Option<TxNumber>) -> Result<u64, ApiError>;
fn read_account_storage(&self, address: Address, tx_number: Option<TxNumber>, key: H256) -> Result<Bytes, ApiError>;
fn read_account_incarnation(&self, address: Address, tx_number: Option<TxNumber>) -> Result<u64, ApiError>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should incarnations be part of the struct Account or should they be in a separate table? I can imagine in Ethereum we'd want to make Incarnations a separate table and eventually move it out of MDBX, because SELFDESTRUCT is going away. Might save some space, but probably trivial?

(same msg sent on discord

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, leaning towards having it in a different table tbh

Comment on lines +13 to +15
/// Common API for querying and writing data.
#[rustfmt::skip]
pub trait DataApi {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on splitting this in smaller logically coupled traits which get implemented by each in-mem type.

One idea is that we could define trait HeaderReader / trait HeaderWriter, and do the same for Blocks, Transaction, Account.

Comment on lines +155 to +163
fn encode(self) -> Self::Encoded {
todo!();
}
}

impl Decode for Vec<U256> {
fn decode(_value: &[u8]) -> Result<Self, KVError> {
todo!();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd do

impl<T: Encode> for Vec<T> {
              // fn encode
              // serialize len
             // for each item in vec call `encode`
}

and same for decode, read the len from the first number and then decode each element in 0..len

Not sure if you could serialize length as u8/u16/u32 etc.

Copy link
Collaborator Author

@joshieDo joshieDo Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally, the codec we choose, should do that?

Comment on lines 101 to +102
#[allow(non_camel_case_types)]
type BlockNumber_BlockHash_TxId = Vec<u8>;
type BlockNumber_BlockHash_TxNumber = Vec<u8>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit weird typedef here, is this 3 things?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted some representation of a concatenation of (BlockNumber, BlockHash, TxNumber), open to change it 🤷‍♂️

@gakonst
Copy link
Member

gakonst commented Oct 10, 2022

Although I guess there is something to be said for per-store codecs... I would imagine memdb and staticfile would represent their data differently, maybe one is compressed and the other is not.

This is a great point @onbjerg and I think we'd want to maintain the optionality.

We also still need an abstraction for the notion of transactions though, even if the underlying database does not use transactions

I think it's likely our abstraction is leaky and requires that in-mem test dbs will need to implement some transaction-related logic. Do we consider that acceptable?

/// Returns the lowest and highest block number the data structure contains.
fn read_block_number_range(&self) -> Result<(BlockNumber, BlockNumber), ApiError>;
fn read_block_latest(&self) -> Result<Block, ApiError>;
fn read_block(&self, number: Option<BlockNumber>, hash: Option<BlockHash>) -> Result<Block, ApiError>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, maybe having an enum here BlockId with fields Number and Hash?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for sure let's use the enum type instead of option.

@rakita
Copy link
Collaborator

rakita commented Oct 10, 2022

I don't see functions that give us Cursor, is it needed for ETL?

Edit: additionally transactions are created and cleared (committed), this comment here: https://github.com/foundry-rs/reth/blob/c41c6b99a66bdb662602d0941f353984cd37a052/crates/db/src/kv/mod.rs#L92-L93
We probably need some function that would do that, especially for write transactions, assuming that opening writing closing is not the most optimal way but opening writing x times closing is better.

@onbjerg
Copy link
Member

onbjerg commented Oct 11, 2022

I think it's likely our abstraction is leaky and requires that in-mem test dbs will need to implement some transaction-related logic. Do we consider that acceptable?

I don't think we have any other option. If we want to have control over how things are committed to the database, which is essential to staged sync, then we need transactions. Otherwise how would e.g. an MDBX impl know when to commit?

@onbjerg
Copy link
Member

onbjerg commented Oct 11, 2022

Re: common types/API etc., I really still don't think the way to go is to have a read_block, read_account, read_xyz is the way to go. So much implementation overhead, and to be honest, I don't think much of the node will deal with a full block anyway, although I might be wrong here - the stages process headers and bodies separately, and transactions separately from that; the downloader does the same; the RPC does the same most of the time (except for e.g. eth_getBlock). Most of these aggregate types are just not super necessary if they are the main reason the abstractions become rigid imo.

The main difference in how data is treated between the different databases are how the data is indexed and colocated. Indexes are a function of table keys, and indexes largely also determine how data is colocated, so I'd assume a more generic abstraction is possible

@gakonst
Copy link
Member

gakonst commented Oct 11, 2022

I don't think we have any other option. If we want to have control over how things are committed to the database, which is essential to staged sync, then we need transactions. Otherwise how would e.g. an MDBX impl know when to commit?

Yep. OK sounds good to me. I'm good with that.

The main difference in how data is treated between the different databases are how the data is indexed and colocated. Indexes are a function of table keys, and indexes largely also determine how data is colocated, so I'd assume a more generic abstraction is possible

Let's have a call tomorrow with @joshieDo and iron this out?

@mattsse mattsse mentioned this pull request Oct 11, 2022
8 tasks
@onbjerg onbjerg added C-enhancement New feature or request A-db Related to the database labels Oct 11, 2022
@joshieDo
Copy link
Collaborator Author

i think we can close this one

Added it to the tracking issue

@joshieDo joshieDo mentioned this pull request Oct 14, 2022
6 tasks
@gakonst gakonst closed this Oct 14, 2022
@gakonst gakonst deleted the joshie/api branch October 14, 2022 18:14
yutianwu pushed a commit to yutianwu/reth that referenced this pull request Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-db Related to the database C-enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants