Cargo is required to build/install (rustup is an easy way to install it). Then from the root of this repo, run
git submodule --init
cargo install --path chaindexer/
The query engine supports querying directly from an RPC node (i.e. without pre-indexing the entire chain). To spin up a SQL REPL and start querying data from your RPC, you can simply run:
ETH_RPC_API=<your rpc url> chaindexer sql
Running queries and building indices both require an RPC api that you can connect to. But it is possible to query chain data without building an index ahead of time. This can be done as long as the queries ran are highly selective in terms of block numbers accessed. In other words: since on-chain data is range partitioned by block number, queries must only access a specific range of block numbers--otherwise the entire chain state would have to be indexed (see next section for how to do that.)
For example, a valid query would be something like:
select * from eth.logs l
join eth.blocks b
on b.hash = l.block_hash
where b.number >= (current_block('eth') - 10) -- 10 most recent blocks for ethereum
Predicates that can be evaluated on block numbers will always be pushed down to the table scan. As long as queries have predicates like this that can be pushed down so that only a small subset of block numbers are needed to be retrieved, you should be able to query data directly via the RPC api (i.e. without a full index).
You can also have the query engine automatically filter out all but the last n
most
recent blocks by specifying. Running:
chaindexer sql --last-n-blocks 10
select * from eth.logs l
join eth.blocks b
on b.hash = l.block_hash;
is equivalent to the query shown previously. For doing interactive querying without doing
a pre-index, using the last-n-blocks
option is highly recommended.
NOTE: Other SQL interfaces
A JDBC compatible server is coming soon (this would allow you to use this in something like PyCharm or any other database client that accepts JDBC).
Schemas, tables and each table's columns can be shown via SQL:
SHOW TABLES; -- shows available tables, what tables are available will depend on your config
DESCRIBE eth.blocks; -- describe a single table
For more info on the CLI commands, run:
chaindexer help
To query the full chain, an index for that chain must be constructed and saved either on disk, or on a blob store (currently IPFS, and S3). An index is essentially just a file that contains mappings for tables and block range to the location of files containing the raw data for that table/block range. Each index is backed by a single storage config, which specifies the location of the index file as well as where to place the actual data partitions.
You'll probably want a config file to specify your storage layer before building an index:
chaindexer config
This will create a TOML file that you can edit. For example, to use S3 (other configuration
examples can be seen in the source code at src/storage/conf.rs
, datatype is named StorageConf
)
as your storage layer, you can open up the config file at ~/.chaindexer/config.toml
and change the stores.eth
section (s3 credentials are read from the environment):
[stores.eth]
type="s3"
bucket="<your s3 bucket>"
# s3 key prefix for where all partition files will be placed
prefix="/prefix"
# the root index will be stored here. the index file contains mappings to the partition
# files which will be loaded into prefix.
# this file will also be under prefix, so if your prefix is `/prefix`, the full s3 key
# of the root index file would be `/prefix/eth.db`
filename="eth.db"
Each chain has its own storage configuration, which is why the storage layer for eth is
specified under stores.eth
. If you'd rather store data on disk, the storage layer in the
default generated config file should be fine.
Now with an eth store layer specified, we can run the index
command to start
building the ethereum index. Currently running index
indexes the raw chain data
only (we are actively working to move some of our custom data pipelines
into this open source project).
chaindexer index --chain eth
Now non-selective queries can be run (if the partition index is not fully built, it will just treat the blocks that is built for as the only blocks that exist.)
We recognize that indexing an entire chain just to run queries is unrealistic for most users. We are currently working to put our pre-indexed ethereum onto IPFS, so that other users can use the query engine without needing to index the entire chain.
Chain data is indexed and then persisted using Apache Parquet on various storage layers. The query engine is leverages DataFusion, a highly flexible and extensible in-memory query engine which uses Apache Arrow as its in-memory columnar data format.
Each index is a SQLite database that describes to the query engine how to access the raw data during each table scan. For example, an ethereum index would have a row in the SQLite db mapping from the blocks table, for block range 0 - 1,000,000 to a parquet file on S3 (or other storage layers, including your disk). That parquet file would have 1,000,000 rows, representing all the blocks in that range.