wip: `DataApi` trait for querying/inserting data #29

joshieDo · 2022-10-10T11:09:53Z

WIP

Ideally, mdbx, memory_db and static have a common well-defined API to query for data. Is it a too naive of an approach?
All 3 elements would implement the same DataApi trait.

Other crates (RPC,pool, executor etc) request the data through an ApiManager (which implements the same trait) without knowing where the data is. ApiManager will accordingly to the parameters choose the best course of action. (eg. if no block is passed, then its the tip and will query mdbx.)

If we all agree this is the way to go, then I'd ask that you'd also add whatever API you need here.

This also helps out with the schema design, to know what's really needed in the DB/Index... or how we can merge/structure data

onbjerg

I think having functions like this for every type and/or table will make the data model very inflexible/hard to change.

Imo we should look at just having some interface for reading, writing and iterating on types that implement e.g. Encode/Decode.

Obviously they would need some sort of identifier like a table ID.

The abstraction layer here would just mostly decide where to read and write the data to and how. That way the surface area for the databases is small, and the surface area for types being written/read from the database is also small (just impl Encode/Decode)

We also still need an abstraction for the notion of transactions though, even if the underlying database does not use transactions

joshieDo · 2022-10-10T13:18:41Z

I think having functions like this for every type and/or table will make the data model very inflexible/hard to change.

This layer is above that though. If I want an Account at block X: db, memdb and staticfile should be able to provide it. And if that's the case, don't they share that trait? Each one would need to piece together the Account and return that.

The abstraction layer here would just mostly decide where to read and write the data to and how. That way the surface area for the databases is small, and the surface area for types being written/read from the database is also small.
We also still need an abstraction for the notion of transactions though, even if the underlying database does not use transactions

That's what I am going for here. Why do you think that's not whats happening ?

onbjerg · 2022-10-10T13:57:52Z

I think it should just be a trait that is generic over anything that is Encode/Decode is my main point - the way it is now, if we add a new database type we have to implement all of these functions for the new database, right?

If it was more generic, it would be easier to implement for new databases/stores (the extreme case would be 2 functions although that is unlikely).

If we want to add a new table/type, we need to add new functions to this trait + implement other traits for the type etc. + specify the table in the tables file and so on. If it was generic, this overhead (I think) would be lowered since it's 1 less thing to implement

Although I guess there is something to be said for per-store codecs... I would imagine memdb and staticfile would represent their data differently, maybe one is compressed and the other is not. Not entirely sure how that would be solved - maybe some wrapper types that specify how the data should ideally be encoded? See e.g the Json<T>/Msgpack<T> stuff in here: https://docs.rs/kv/latest/kv/

Edit: Another example, this one is a fully typed "high level" (not that much) wrapper around LMDB (which MDBX is based on) https://docs.rs/heed/latest/heed/. This one also sort of uses wrapper types (e.g. OwnedType), these are defined in https://docs.rs/heed-types/0.8.0/heed_types/index.html

joshieDo · 2022-10-10T14:52:20Z

if we add a new database type we have to implement all of these functions for the new database, right?

Yes, as suggested on discord, this trait could be split into smaller ones. libmdbx probably would implement all, while static would only implement reading and append ones.

If we want to add a new table/type, we need to add new functions to this trait + implement other traits for the type etc. + specify the table in the tables file and so on.

I mean, it really depends on what kind of type for what purpose. If it's an index, you probably wouldn't need to, since it would be abstracted away anyway.

I think what I'm having trouble grasping at the moment is: Much of the data doesn't have a straight encode <-> decode mapping. It has to be constructed from different places/indexes/tables/files depending on the store specifications. Getting a Block even without transactions, means hitting different tables (BlockBodies, Headers, Transactions, etc...) to stitch it up. Calling .write_blocks([Block,..]) , would hit multiple tables on mdbx, while on static has to do some fancy compression+hash_table most likely.

However, the type that RPC/Executor/Stage expects is supposed to be the same, If we don't have a common API (even if split among smaller Traits), won't it just get kinda messy...?

About how each data should be ultimately encoded before writing to disk... this is what I was going for: https://github.com/foundry-rs/reth/blob/joshie/kv_mdbx/crates/db/src/kv/table.rs . I'll take a look into those, thanks!

I feel like I'm misunderstanding your point. I'll think more about it

Block example from erigon:
https://github.com/ledgerwatch/erigon/blob/6815504462235f12624afccb9e20a92a08580b0f/core/rawdb/accessors_chain.go#L1112

gakonst · 2022-10-10T19:55:08Z

crates/db/src/api.rs

+    fn write_total_difficulty(&self, difficulty: U256) -> Result<(), ApiError>;
+    fn truncate_total_difficulty(&self, from: BlockNumber) -> Result<(), ApiError>;
+
+    fn read_receipt(&self, number: Option<TxNumber>) -> Result<Receipt, ApiError>;


Let's call this TxIndex instead of number

Suggested change

fn read_receipt(&self, number: Option<TxNumber>) -> Result<Receipt, ApiError>;

fn read_receipt(&self, number: Option<TxIndex>) -> Result<Receipt, ApiError>;

gakonst · 2022-10-10T19:57:15Z

crates/db/src/api.rs

+    fn move_to_canonical_txes(&self) -> Result<(), ApiError>;
+    fn move_to_non_canonical_txes(&self) -> Result<(), ApiError>;


What do you envision this doing?

gakonst · 2022-10-10T19:57:30Z

crates/db/src/api.rs

+
+    fn read_canonical_hash(&self, number: u64) -> Result<H256, ApiError>;
+    fn write_total_difficulty(&self, difficulty: U256) -> Result<(), ApiError>;
+    fn truncate_total_difficulty(&self, from: BlockNumber) -> Result<(), ApiError>;


What does this do?

gakonst · 2022-10-10T19:59:11Z

crates/db/src/api.rs

+    fn read_account_code(&self, address: Address, tx_number: Option<TxNumber>) -> Result<Bytes, ApiError>;
+    fn read_account_code_size(&self, address: Address, tx_number: Option<TxNumber>) -> Result<u64, ApiError>;
+    fn read_account_storage(&self, address: Address, tx_number: Option<TxNumber>, key: H256) -> Result<Bytes, ApiError>;
+    fn read_account_incarnation(&self, address: Address, tx_number: Option<TxNumber>) -> Result<u64, ApiError>;


Should incarnations be part of the struct Account or should they be in a separate table? I can imagine in Ethereum we'd want to make Incarnations a separate table and eventually move it out of MDBX, because SELFDESTRUCT is going away. Might save some space, but probably trivial?

(same msg sent on discord

yeah, leaning towards having it in a different table tbh

gakonst · 2022-10-10T20:09:15Z

crates/db/src/api.rs

+/// Common API for querying and writing data.
+#[rustfmt::skip]
+pub trait DataApi {


Agree on splitting this in smaller logically coupled traits which get implemented by each in-mem type.

One idea is that we could define trait HeaderReader / trait HeaderWriter, and do the same for Blocks, Transaction, Account.

gakonst · 2022-10-10T20:12:27Z

crates/db/src/kv/table.rs

+    fn encode(self) -> Self::Encoded {
+        todo!();
+    }
+}
+
+impl Decode for Vec<U256> {
+    fn decode(_value: &[u8]) -> Result<Self, KVError> {
+        todo!();
+    }


I'd do

impl<T: Encode> for Vec<T> { // fn encode // serialize len // for each item in vec call `encode` }

and same for decode, read the len from the first number and then decode each element in 0..len

Not sure if you could serialize length as u8/u16/u32 etc.

ideally, the codec we choose, should do that?

gakonst · 2022-10-10T20:13:06Z

crates/db/src/kv/tables.rs

 #[allow(non_camel_case_types)]
-type BlockNumber_BlockHash_TxId = Vec<u8>;
+type BlockNumber_BlockHash_TxNumber = Vec<u8>;


a bit weird typedef here, is this 3 things?

I wanted some representation of a concatenation of (BlockNumber, BlockHash, TxNumber), open to change it 🤷‍♂️

gakonst · 2022-10-10T20:17:09Z

Although I guess there is something to be said for per-store codecs... I would imagine memdb and staticfile would represent their data differently, maybe one is compressed and the other is not.

This is a great point @onbjerg and I think we'd want to maintain the optionality.

We also still need an abstraction for the notion of transactions though, even if the underlying database does not use transactions

I think it's likely our abstraction is leaky and requires that in-mem test dbs will need to implement some transaction-related logic. Do we consider that acceptable?

rakita · 2022-10-10T22:11:02Z

crates/db/src/api.rs

+    /// Returns the lowest and highest block number the data structure contains.
+    fn read_block_number_range(&self) -> Result<(BlockNumber, BlockNumber), ApiError>;
+    fn read_block_latest(&self) -> Result<Block, ApiError>;
+    fn read_block(&self, number: Option<BlockNumber>, hash: Option<BlockHash>) -> Result<Block, ApiError>;


nit, maybe having an enum here BlockId with fields Number and Hash?

Yes, for sure let's use the enum type instead of option.

rakita · 2022-10-10T22:23:38Z

I don't see functions that give us Cursor, is it needed for ETL?

Edit: additionally transactions are created and cleared (committed), this comment here: https://github.com/foundry-rs/reth/blob/c41c6b99a66bdb662602d0941f353984cd37a052/crates/db/src/kv/mod.rs#L92-L93
We probably need some function that would do that, especially for write transactions, assuming that opening writing closing is not the most optimal way but opening writing x times closing is better.

onbjerg · 2022-10-11T02:23:56Z

I think it's likely our abstraction is leaky and requires that in-mem test dbs will need to implement some transaction-related logic. Do we consider that acceptable?

I don't think we have any other option. If we want to have control over how things are committed to the database, which is essential to staged sync, then we need transactions. Otherwise how would e.g. an MDBX impl know when to commit?

onbjerg · 2022-10-11T02:26:06Z

Re: common types/API etc., I really still don't think the way to go is to have a read_block, read_account, read_xyz is the way to go. So much implementation overhead, and to be honest, I don't think much of the node will deal with a full block anyway, although I might be wrong here - the stages process headers and bodies separately, and transactions separately from that; the downloader does the same; the RPC does the same most of the time (except for e.g. eth_getBlock). Most of these aggregate types are just not super necessary if they are the main reason the abstractions become rigid imo.

The main difference in how data is treated between the different databases are how the data is indexed and colocated. Indexes are a function of table keys, and indexes largely also determine how data is colocated, so I'd assume a more generic abstraction is possible

gakonst · 2022-10-11T02:44:08Z

I don't think we have any other option. If we want to have control over how things are committed to the database, which is essential to staged sync, then we need transactions. Otherwise how would e.g. an MDBX impl know when to commit?

Yep. OK sounds good to me. I'm good with that.

The main difference in how data is treated between the different databases are how the data is indexed and colocated. Indexes are a function of table keys, and indexes largely also determine how data is colocated, so I'd assume a more generic abstraction is possible

Let's have a call tomorrow with @joshieDo and iron this out?

joshieDo · 2022-10-14T10:23:06Z

i think we can close this one

Added it to the tracking issue

release: prepare for v0.1.0-beta.2

trait DataApi

2ac21e8

onbjerg reviewed Oct 10, 2022

View reviewed changes

gakonst reviewed Oct 10, 2022

View reviewed changes

rakita reviewed Oct 10, 2022

View reviewed changes

mattsse mentioned this pull request Oct 11, 2022

Tracking: RPC support #37

Closed

8 tasks

onbjerg added C-enhancement New feature or request A-db Related to the database labels Oct 11, 2022

joshieDo mentioned this pull request Oct 12, 2022

feat(db): codec encoding/decoding #51

Merged

joshieDo mentioned this pull request Oct 14, 2022

Tracking: DB #66

Closed

6 tasks

gakonst closed this Oct 14, 2022

gakonst deleted the joshie/api branch October 14, 2022 18:14

yutianwu pushed a commit to yutianwu/reth that referenced this pull request Jun 13, 2024

Merge pull request paradigmxyz#29 from bnb-chain/develop_bak

5710c1f

release: prepare for v0.1.0-beta.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wip: `DataApi` trait for querying/inserting data #29

wip: `DataApi` trait for querying/inserting data #29

joshieDo commented Oct 10, 2022 •

edited

Loading

onbjerg left a comment •

edited

Loading

joshieDo commented Oct 10, 2022 •

edited

Loading

onbjerg commented Oct 10, 2022 •

edited

Loading

joshieDo commented Oct 10, 2022 •

edited

Loading

gakonst Oct 10, 2022

gakonst Oct 10, 2022

gakonst Oct 10, 2022

gakonst Oct 10, 2022

joshieDo Oct 11, 2022

gakonst Oct 10, 2022

gakonst Oct 10, 2022

joshieDo Oct 11, 2022 •

edited

Loading

gakonst Oct 10, 2022

joshieDo Oct 11, 2022

gakonst commented Oct 10, 2022

rakita Oct 10, 2022

gakonst Oct 10, 2022

rakita commented Oct 10, 2022 •

edited

Loading

onbjerg commented Oct 11, 2022

onbjerg commented Oct 11, 2022 •

edited

Loading

gakonst commented Oct 11, 2022

joshieDo commented Oct 14, 2022

	fn read_receipt(&self, number: Option<TxNumber>) -> Result<Receipt, ApiError>;
	fn read_receipt(&self, number: Option<TxIndex>) -> Result<Receipt, ApiError>;

		fn move_to_canonical_txes(&self) -> Result<(), ApiError>;
		fn move_to_non_canonical_txes(&self) -> Result<(), ApiError>;

wip: DataApi trait for querying/inserting data #29

wip: DataApi trait for querying/inserting data #29

Conversation

joshieDo commented Oct 10, 2022 • edited Loading

onbjerg left a comment • edited Loading

Choose a reason for hiding this comment

joshieDo commented Oct 10, 2022 • edited Loading

onbjerg commented Oct 10, 2022 • edited Loading

joshieDo commented Oct 10, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joshieDo Oct 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gakonst commented Oct 10, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rakita commented Oct 10, 2022 • edited Loading

onbjerg commented Oct 11, 2022

onbjerg commented Oct 11, 2022 • edited Loading

gakonst commented Oct 11, 2022

joshieDo commented Oct 14, 2022

wip: `DataApi` trait for querying/inserting data #29

wip: `DataApi` trait for querying/inserting data #29

joshieDo commented Oct 10, 2022 •

edited

Loading

onbjerg left a comment •

edited

Loading

joshieDo commented Oct 10, 2022 •

edited

Loading

onbjerg commented Oct 10, 2022 •

edited

Loading

joshieDo commented Oct 10, 2022 •

edited

Loading

joshieDo Oct 11, 2022 •

edited

Loading

rakita commented Oct 10, 2022 •

edited

Loading

onbjerg commented Oct 11, 2022 •

edited

Loading