-
Notifications
You must be signed in to change notification settings - Fork 26
log :: 2026‐04
Tip
Relay Readiness, Mempool & TxSubmission, Snapshots & Bootstrapping, Simulation Time, Trace/OTEL Tooling
These notes coordinate “clean relay” delivery: removing sync effects, using DB snapshots for consistent reads, re-enabling deterministic simulations (including time control), and fixing e2e performance regressions.
They advance transaction handling end-to-end by turning the mempool into a stage, wiring it to submit API + adopt-chain + txsubmission, and starting validation/invalidation (TTL, catch-up cost controls).
Snapshot production is being industrialized (db-analyser/mithril/bootstrap-hd, diff tooling, GH automation) to keep CI aligned with recent epochs and reduce timeouts.
Observability continues to mature (trace_span! migration, OTEL UI/dashboard, CBOR inspection tools) and is used to diagnose performance and protocol behavior.
Tactical planning highlights parallel workstreams (ledger rules/epoch transitions, UPLC VM optimization/fuzzing, peer selection/networking, Antithesis readiness, and partner/acceptance processes).
- Eric: Big PR review with refactoring sync effets & database snapshots
Mempool logic PR Mempool & tx submission protocol, working on simplifying the interaction Observability & Public API Simulation tests improvements (with Roland) - final step for end to end test
- Josh: Open PRs for every static checks
Balancing correctness & speed Pending Review with Matthias Pairing time with Matthias on Friday ( Epoch boundary & context; + changes for more mechanical tests)
- Jonathan: working on script context & UPLC VM
Optimisation fuzzer & optimisation; benchmarks for UPLC VM (50% reduction of speed) from Pi improvements and shortcuts Haskell comparison of UPLC and implementation differences full review of the massive PR of optimisation (including Pi modifications) still need a pass on the rebase/merge of the optimisation PR
- Eric: Configuration what do we need to "show" and tweaking the environment variables
Option 1: make a build with the basic configuration Option 2: make available the configuration + provide a user guide and the implications of the tweaks available Decision to be made on providing a basic setup for people to be aware of configuration and add little by little customisation parameters
Making the acceptance process EDR worthy; Inviting all the monthly meetings for each contract; Contracts review and finalisation for Antithesis; First mapping of potential partners
Synchronizing for the release of the "clean" Amaru relay & making everything ready to start once we have it
- Josh & KtorZ: pairing on next steps for ledger
- Signing the Antithesis contract
- Using the acceptance process and revising it based on first tests
Amaru tactical planning alignment
- Eric: Transaction validation & Mempool
Huge PR removing sync effects (with access to the store) (navigating the chain store with snapshots and having mutations) Prevents us from finishing the simulation (currently turned off) - protocols running all the time - passage of time in simulation is a problem, need to remove the sync effects (executed by pure stages)
- Eric: Investigation of the PR linked to the performance (boostrapping, regretion) end to end snapshots epoch 185 (no good run until March)
Two outcomes possible: lower target epoch (in a PR) or increase the timeout (on main)
-
Josh: ledger rules implementation, ppview hash; native script execution; outside of forecast for plutus scripts (3 separate PR)
-
KtorZ: working on something at an interface and need to synchronize (potential refactoring)
-
KtorZ: working on epoch transition and the boundaries (2000 blocks earlier than we want) problem is artifacts of the epoch transition that have to be stored in memory and not in disk (i.e. rewards because not stable enough at that point of calculation)
Ledger state lives on disk and in an "in between" environment
-
KtorZ: mapping out the current state logic and the succession of steps to get it right
-
KtorZ: working on integrating the most pragmatic way the rewards calculation related to the epoch transition
-
Julien: simplifying the creation of bootstrap snapshots to have a simple command to choose the epoch and they'll
-
Julien: working on removing the protocol v9
-
Julien: Reviving the Pi project
-
Jonathan: updating and polishing the fuzzer; ledger rules comparison between Haskell & Amaru with the UPLC turbo as well
-
KtorZ: Preparing the UPLC turbo re-re-released on crates.io; 1 was released by Lucas; 1 was released by Julien; now the sources are cleaned up and everything is renamed to amaru-uplc; managed by Amaru committers; only the core maintainers team can do some released
-
Roland: adding stages for tracking the peers to disconnect when they do something wrong; peer selection stage;
-
Roland: adding peer candidates from the ledger (rules, addresses, IP) and make a choice
Finalising the Antithesis contract; meeting with potential users/contributors of the relay node; preparing the node diversity workshop in Porto
Synchronizing for the release of the "clean" Amaru relay & making everything ready to start once we have it
- Josh & KtorZ: work on ledger rules not impacted for the refactoring
- Roland: looking for draft PR feedback about networking and peer selection
- Signing the Antithesis contract
- Drafting use case focused timelines and agreements
db-analyser apparently is currently broken in lsm mode and fails after some time with mainnet data.
The approach I consider following for automatic snapshot creation is the the following:
- create an initial amaru snapshot from provided script, using mithril + db-analyser InMem + make bootstrap-hd
- upload this initial snapshot automatically
- then an automated process can fetch the latest amaru snapshot, fast-sync using amaru-ledger sync, upload new snapshot
This should run smoothly on GitHub action.
Once db-analyser is better supported with lsm we can consider other approaches.
- Continuously validate Amaru against the latest available ledger snapshot within a day of an epoch ending
-
revived the amaru-pi project so that it can be used as a local relay broadcasting a dashboard on an external monitor
- migrated to latest pi image infrastructure: now can create image programmatically on local machine
- made sure it supports both local screens and external display
- introduced ssh over usb (including internet sharing) so that it can be just plugged on a laptop
-
made a push towards testing Halo2/KZG on ESP32-S3; no luck so far
- implement GH flow pushing amaru snapshots automatically
- make sure bootstrap is amended so that it can leverage those
Using the mempool as a stage receiving messages from the local submit API + the adopt_chain stage when a block is adopted +
the tx submission protocol is now merged.
I started a PR now to validate / invalidate the mempool transactions. It is doing the simplest thing (just revalidate txs
the tip changes) but I would like to include the TTL check as well.
Also I need to find a way to disable the tx submission protocol in catch-up mode. Otherwise each tip adoption will be too expensive.
This is necessary to:
- Put the simulation tests back on because external effects prevented us to drive the simulation strictly via the passing of time.
- Remove a whole class of bug where we read inconsistent data.
I still have to tackle a few comments on that PR and we left some TODOs that will make things even better:
- Better deal with the case of the origin point that doesn't have a corresponding header in the store. That forces us to special case many calls.
- Store tips instead of hashes for the anchor and best chain. This will help for next point.
- Improve the test support for functions using pure-stage.
It seems that there are 2 mains causes:
- Accessing the best tip by loading the corresponding header from the hash then getting its tip in catch-up mode.
- The Phase 2 validation makes catching up slower.
We decided to:
- Implement the storing of tips in the DB.
- Test with less epochs on PRs for e2e tests.
- Keep the same epochs for e2e tests on main with a longer timeout.
Amaru as a production relay.
- Finish to address @rkuhn comments on the effects + store PR
- Finish the mempool implementation
The PR integrating the mempool as a stage is now merged:
- The mempool stage serializes write accesses to the mempool, coming from
- The local submit API.
- The tx submission protocol.
- The ledger when there are roll forwards and rollbacks.
- Work has started to add validation / invalidation using the ledger.
We currently deactivated the simulation tests because determining when a simulation must stop is non trivial. It will be possible to properly do it once we drive it purely with the passing of time (which also allows us to test timeouts). However, in order to do this we need to remove the "sync effects" support from pure-stage and make sure that every effect, store accesses in particular, can be suspended. This is also the opportunity to make sure that when we navigate the database to query data, we use a snapshot of the database so that we don't work on data that's being mutated by another process as we traverse it (that would be a major source of bugs).
This PR is quite large and touches a lot of "business" logic. It is now ready for review. I would be more confident if we had simulation tests to test it, but that's the very PR that will help us have the tests back!
We have regressed since March. The preprod test cannot read epoch 185 in less than 15 minutes. It seems that there are 2 main causes:
- Some unnecessary accesses to the database during catch-up, in particular to retrieve the best tip. We are using the best hash, then the header for that hash, then we get a point. Getting a point directly would be faster.
- The phase 2 validation has slowed down the processing.
During our meeting today we decided to:
- Fix the database access.
- Target an earlier epoch when testing PRs
- Using a longer timeout when testing main
Generating snapshots has been a bit more involved than expected especially for mainnet. Memory requirements (even stronger with dn-analyser) pushed me to investigate more recents versions of cardano-node to experiment with less memory-intensive backends (lmdb, then lsm). This also means having proper support to parse on-disk utxos.
Note that new snapshots for preprod/preview allow to bootstrap amaru dbs identical to the current method.
- Continuously validate Amaru against the latest available ledger snapshot within a day of an epoch ending
Published a new release for uplc; now used in amaru.
Upgraded cardanop-zkvms to latest OpenVM beta 2.0.
Shared an awesome repo for amaru: awesome-amaru
- push snapshots creation
Merged trace_span! migration PR. Worked on improving some more trace support.
Created a simple dashboard for amaru: https://github.com/jeluard/amaru-dashboard Relies on OTEL traces and OTEL websocket bridge. The goal is to improve the PI experiment to have a nice monitor support.
Created a web app allowing to easily analyze and compare structure of CBOR files.
This allows to more easily analyze CBOR dumps from cardano-node.
https://jeluard.github.io/cbor-structure/
- Refine and document 100% of Amaru's traces
- Continuously validate Amaru against the latest available ledger snapshot within a day of an epoch ending
- Provide a dashboard for Amaru PI
- Continue work on snapshot generation
I expanded the PR testing the select chain stage:
- The data generation is more realistic. Actions are generated in response to previous outputs, the same way things would work in production. For example we would get a FetchBlock message after having processed a NewTip message.
- The data generation also models restarts since the
select_chainstage initial state depends on the state of the chain store. - The property that is checked is stronger, akin to "we must always emit downstream what we currently know as the best tip".
- In the light of that property I fixed the code which sometimes was regressing too far in the past or not returning an alternate fork.
This has not yet been reviewed by @rkuhn, he still needs to validate that I'm not off with the property as it stands in that PR.
The Mempool was previously modeled as a Resource only, directly accessed by the submit api or the txsubmission stages.
It is now a proper stage with this PR and it currently receives messages:
- From the submit API to insert new local transactions.
- From the txsubmission stages to insert transactions from blocks or wait for new transactions.
This will:
- Make the processing of effects more homogeneous (all are async).
- Support to simulate the passing of time for all effects and allow us to re-enable the simulation tests (https://github.com/pragma-org/amaru/issues/737).
This is unfortunately a large PR which does the following:
- Removes the notion of an "external sync effect". Now all effects are async (which allows them to be interrupted).
- Makes async calls to store and ledger effects.
- Moves some store operations to the chain store directly, like switching to a new fork.
- Makes sure that read operations that are navigating the chain store, for example to get points along the best chain, operate on a snapshot of the store.
There are still a few things to tweak and test in that PR so it is still a draft at the moment.
Make the node ready as a production relay.
Next week is off for me but when I come back, I will work on:
- Finishing the removal of sync effects.
- Simulating time in simulation tests to make them terminate effectively: https://github.com/pragma-org/amaru/issues/737.
- Finishing the implementation of the mempool to validate transactions from an apply block or taken to make a new block.
Worked on re-creating new snapshots using db-analyser using Docker images.
Implemented a new bootstrap-hd command allowing to bootstrap from cardano-hd ledger state.
To validate that snapshots produce the expected result we can now use rocksdb-diff to validate that the end result DBs are equivalent.
cargo install rocksdb-diff
rocksdb-diff ledger.preprod.db.old/164 ledger.preprod.db/164 --prefix acct,comm,dlg,drep,pool,pots,prop,slot,utxo,vote- Continuously validate Amaru against the latest available ledger snapshot within a day of an epoch ending
- Continue this work; ideally we would host all amaru meaningful snapshots
This week I updated the animation for the new consensus graph traces and helped to have it merged. There was a performance regression that made the end to end test fail because the trace buffer was serializing data even while being deactivated. This means that we will have to find a way to run this serialization asynchronously when we want to enable traces in production.
I also fixed the github workflow that had some env. variable bugs when running workflows manually.
Now, while rebasing the PR that was adding a property test for the new select_chain stage (#726), I noticed 2 potential issues:
- When a block is invalidated we fallback to the header at the best chain, but there might be headers with valid blocks after the best chain tip.
- Similarly there could be a fork with valid blocks that could become the next best chain.
I'm still trying to represent these cases in the data generation and oracle + trying to find a good fix for these cases. In itself, the issue is not catastrophic because we never send incorrect information downstream. However we break the contract for that stage which should return the best known tip all the time (and that could delay the transmission of that information downstream).
There are now 3 issues for the development of the mempool (#733, #734, #735):
- Make mempool a proper stage.
- Implement the validation / invalidation of transactions (including TTL).
- Add observability
I already started doing 1. but I haven't finished yet the testing of the chain selection so I need to do that first.
This all supports the possibility to run amaru as a relay node with a fixed set of peers.
- Finish the property test for the
select_chainstage. - Finish rebasing / merging my PRs.
- Continue the mempool work.
The PR #721 fixes the consensus graph with a proper update of the ledger, back-pressure and so on. However the build was not passing because the end to end tests were too slow. I tried to adjust different parameters like the batch size for the fetch block request but that didn't help. Eventually Roland noticed that the trace buffer was always serializing data even when it was not supposed to be enabled. That fix made the tests pass and the PR is now green.
We are now left with:
- Addressing the CodeRabbit comments
- Possibly going back to the fixes I did in #729 for the chainsync responder because I don't think that the one in #721 is robust enough
- Re-enabling the simulation tests on that PR.
Then, when it is merged, I will add on top:
- The extension of the simulation tests with the txsubmission property
- The updated animation of traces
At some stage we also need to revisit the simulation support to control finely how time is passing. Unfortunately this might require some heavy duty rework of how we currently handle synchronous effects, like the store effects.