MultiversX ETL

ETL (extract, transform and load) tools for publishing MultiversX blockchain data (extracted from a standard MultiversX Elasticsearch instance) on Google BigQuery.

Published data

Mainnet

Setup virtual environment

Create a virtual environment and install the dependencies:

python3 -m venv ./venv
source ./venv/bin/activate

pip install -r ./requirements.txt --upgrade
pip install -r ./requirements-dev.txt --upgrade

Run the tests

export PYTHONPATH=.
pytest -m "not integration"

Quickstart

This implementation copies the data from Elasticsearch in two parallel flows.

One flow copies the append-only indices (e.g. blocks, transactions, logs, receipts, etc.) into a staging BQ dataset. This process is incremental, i.e. it only copies the new data since the last run, and it's executed more often than the second flow (every 1 hour, by default). Once the staging database is loaded, the data is transferred to the main BQ dataset, using the Big Query Data Transfers facility.

The second flow copies the mutable indices (e.g. tokens, accounts, etc.) into a staging BQ dataset. This process is not incremental. Tables are truncated and reloaded on each run. Once the staging database is loaded, the data is transferred to the main BQ dataset, using the Big Query Data Transfers facility. This flow is executed less often than the first flow (every 4 hours, by default).

In order to invoke the two processes, you can either use the Docker setup (see next section) or explicitly invoke the following commands:

# First, set the following environment variables:
export GCP_PROJECT_ID=multiversx-blockchain-etl
export WORKSPACE=${HOME}/multiversx-etl/mainnet

# The first flow (for append-only indices):
python3 -m multiversxetl.app process-append-only-indices --workspace=${WORKSPACE} --sleep-between-iterations=3600

# The second flow (for mutable indices):
python3 -m multiversxetl.app process-mutable-indices --workspace=${WORKSPACE} --sleep-between-iterations=86400

Rewinding

Sometimes, errors occur during the ETL process. For the append-only flow, it's recommended to rewind the BQ tables to the latest checkpoint (good state), and re-run the process only after that. This helps to de-duplicate the data beforehand, through a simple data removal. Otherwise, the full data de-duplication step would be employed (performed automatically, after each bulk of tasks, if the data counts from BQ and Elasticsearch do not match), which is more expensive.

To rewind the BQ tables corresponding to the append-only indices to the latest checkpoint, run the following command:

python3 -m multiversxetl.app rewind --workspace=${WORKSPACE}

If the checkpoint is not available or is assumed to be corrupted, one can find the latest good checkpoint by running the following command:

python3 -m multiversxetl.app find-latest-good-checkpoint --workspace=${WORKSPACE}

Docker setup

Build the Docker image:

docker build --network host -f ./docker/Dockerfile -t multiversx-etl:latest .

Set up the containers:

# mainnet
docker compose --file ./docker/docker-compose.yml \
    --env-file ./docker/env/mainnet.env \
    --project-name multiversx-etl-mainnet up --detach

# devnet
docker compose --file ./docker/docker-compose.yml \
    --env-file ./docker/env/devnet.env \
    --project-name multiversx-etl-devnet up --detach

# testnet
docker compose --file ./docker/docker-compose.yml \
    --env-file ./docker/env/testnet.env \
    --project-name multiversx-etl-testnet up --detach

Generate schema files

Maintainers of this repository should trigger a re-generation of the BigQuery schema files whenever the Elasticsearch schema is updated. This is done by running the following command (make sure to check out mx-chain-tools-go in advance):

python3 -m multiversxetl.app regenerate-schema --input-folder=~/mx-chain-tools-go/elasticreindexer/cmd/indices-creator/config/noKibana/ --output-folder=./schema

The resulting files should be committed to this repository.

At times, the load step could fail for some tables due to, say, new fields added to Elasticsearch indices (of which the BigQuery schema was not aware). If so, then re-generate the schema files (see above), update the BigQuery with the bq command (example below is for the tokens table), and restart the ETL flow:

bq update ${GCP_PROJECT_ID}:${BQ_DATASET}.tokens schema/tokens.json

Management (Google Cloud Console)

Below are a few links useful for managing the ETL process. They are only accessible to the MultiversX team.

BigQuery Workspace: inspect and manage the BigQuery datasets and tables.
Analytics Hub: create and publish data listings.
Logs Explorer: inspect logs.
Monitoring: resource utilization and jobs explorer.
Data Transfers: inspect and manage the data transfers.

Name		Name	Last commit message	Last commit date
Latest commit History 186 Commits
.github/workflows		.github/workflows
.vscode		.vscode
docker		docker
maintenance		maintenance
multiversxetl		multiversxetl
queries		queries
schema		schema
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
py.typed		py.typed
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

License

multiversx/multiversx-etl

Folders and files

Latest commit

History

Repository files navigation

MultiversX ETL

Published data

Setup virtual environment

Run the tests

Quickstart

Rewinding

Docker setup

Generate schema files

Management (Google Cloud Console)

About

Resources

License

Stars

Watchers

Forks

Languages