Skip to content

log :: 2026‐04

Matthias Benkort edited this page May 12, 2026 · 14 revisions

Tip

KEYWORDS

Relay Readiness, Mempool & TxSubmission, Snapshots & Bootstrapping, Simulation Time, Trace/OTEL Tooling

SUMMARY

These notes coordinate “clean relay” delivery: removing sync effects, using DB snapshots for consistent reads, re-enabling deterministic simulations (including time control), and fixing e2e performance regressions.

They advance transaction handling end-to-end by turning the mempool into a stage, wiring it to submit API + adopt-chain + txsubmission, and starting validation/invalidation (TTL, catch-up cost controls).

Snapshot production is being industrialized (db-analyser/mithril/bootstrap-hd, diff tooling, GH automation) to keep CI aligned with recent epochs and reduce timeouts.

Observability continues to mature (trace_span! migration, OTEL UI/dashboard, CBOR inspection tools) and is used to diagnose performance and protocol behavior.

Tactical planning highlights parallel workstreams (ledger rules/epoch transitions, UPLC VM optimization/fuzzing, peer selection/networking, Antithesis readiness, and partner/acceptance processes).

2026-04-31

Tactical team planning & Weekly update (@Dam-CZ)

  • Eric: Big PR review with refactoring sync effets & database snapshots

Mempool logic PR Mempool & tx submission protocol, working on simplifying the interaction Observability & Public API Simulation tests improvements (with Roland) - final step for end to end test

  • Josh: Open PRs for every static checks

Balancing correctness & speed Pending Review with Matthias Pairing time with Matthias on Friday ( Epoch boundary & context; + changes for more mechanical tests)

  • Jonathan: working on script context & UPLC VM

Optimisation fuzzer & optimisation; benchmarks for UPLC VM (50% reduction of speed) from Pi improvements and shortcuts Haskell comparison of UPLC and implementation differences full review of the massive PR of optimisation (including Pi modifications) still need a pass on the rebase/merge of the optimisation PR

  • Eric: Configuration what do we need to "show" and tweaking the environment variables

Option 1: make a build with the basic configuration Option 2: make available the configuration + provide a user guide and the implications of the tweaks available Decision to be made on providing a basic setup for people to be aware of configuration and add little by little customisation parameters

What did you work on this week?

Making the acceptance process EDR worthy; Inviting all the monthly meetings for each contract; Contracts review and finalisation for Antithesis; First mapping of potential partners

What outcome/key result did it support?

Synchronizing for the release of the "clean" Amaru relay & making everything ready to start once we have it

What's immediately next?

  • Josh & KtorZ: pairing on next steps for ledger
  • Signing the Antithesis contract
  • Using the acceptance process and revising it based on first tests

2026-04-24

Tactical team planning & Weekly update (@Dam-CZ)

Amaru tactical planning alignment

  • Eric: Transaction validation & Mempool

Huge PR removing sync effects (with access to the store) (navigating the chain store with snapshots and having mutations) Prevents us from finishing the simulation (currently turned off) - protocols running all the time - passage of time in simulation is a problem, need to remove the sync effects (executed by pure stages)

  • Eric: Investigation of the PR linked to the performance (boostrapping, regretion) end to end snapshots epoch 185 (no good run until March)

Two outcomes possible: lower target epoch (in a PR) or increase the timeout (on main)

  • Josh: ledger rules implementation, ppview hash; native script execution; outside of forecast for plutus scripts (3 separate PR)

  • KtorZ: working on something at an interface and need to synchronize (potential refactoring)

  • KtorZ: working on epoch transition and the boundaries (2000 blocks earlier than we want) problem is artifacts of the epoch transition that have to be stored in memory and not in disk (i.e. rewards because not stable enough at that point of calculation)

Ledger state lives on disk and in an "in between" environment

  • KtorZ: mapping out the current state logic and the succession of steps to get it right

  • KtorZ: working on integrating the most pragmatic way the rewards calculation related to the epoch transition

  • Julien: simplifying the creation of bootstrap snapshots to have a simple command to choose the epoch and they'll

  • Julien: working on removing the protocol v9

  • Julien: Reviving the Pi project

  • Jonathan: updating and polishing the fuzzer; ledger rules comparison between Haskell & Amaru with the UPLC turbo as well

  • KtorZ: Preparing the UPLC turbo re-re-released on crates.io; 1 was released by Lucas; 1 was released by Julien; now the sources are cleaned up and everything is renamed to amaru-uplc; managed by Amaru committers; only the core maintainers team can do some released

  • Roland: adding stages for tracking the peers to disconnect when they do something wrong; peer selection stage;

  • Roland: adding peer candidates from the ledger (rules, addresses, IP) and make a choice

What did you work on this week?

Finalising the Antithesis contract; meeting with potential users/contributors of the relay node; preparing the node diversity workshop in Porto

What outcome/key result did it support?

Synchronizing for the release of the "clean" Amaru relay & making everything ready to start once we have it

What's immediately next?

  • Josh & KtorZ: work on ledger rules not impacted for the refactoring
  • Roland: looking for draft PR feedback about networking and peer selection
  • Signing the Antithesis contract
  • Drafting use case focused timelines and agreements

2026-04-17

Weekly Update (@jeluard)

What did you work on this week?

Snapshots

db-analyser apparently is currently broken in lsm mode and fails after some time with mainnet data.

The approach I consider following for automatic snapshot creation is the the following:

  • create an initial amaru snapshot from provided script, using mithril + db-analyser InMem + make bootstrap-hd
  • upload this initial snapshot automatically
  • then an automated process can fetch the latest amaru snapshot, fast-sync using amaru-ledger sync, upload new snapshot

This should run smoothly on GitHub action.

Once db-analyser is better supported with lsm we can consider other approaches.

What outcome/key result did it support?

  • Continuously validate Amaru against the latest available ledger snapshot within a day of an epoch ending

Extra

  • revived the amaru-pi project so that it can be used as a local relay broadcasting a dashboard on an external monitor

    • migrated to latest pi image infrastructure: now can create image programmatically on local machine
    • made sure it supports both local screens and external display
    • introduced ssh over usb (including internet sharing) so that it can be just plugged on a laptop
  • made a push towards testing Halo2/KZG on ESP32-S3; no luck so far

What's immediately next?

  • implement GH flow pushing amaru snapshots automatically
  • make sure bootstrap is amended so that it can leverage those

2026-04-24

Weekly Update (@etorreborre)

What did you work on this week?

Mempool as a stage + validation

Using the mempool as a stage receiving messages from the local submit API + the adopt_chain stage when a block is adopted + the tx submission protocol is now merged. I started a PR now to validate / invalidate the mempool transactions. It is doing the simplest thing (just revalidate txs the tip changes) but I would like to include the TTL check as well.

Also I need to find a way to disable the tx submission protocol in catch-up mode. Otherwise each tip adoption will be too expensive.

Remove external effects and use snapshot to read consistent chain data

This is necessary to:

  • Put the simulation tests back on because external effects prevented us to drive the simulation strictly via the passing of time.
  • Remove a whole class of bug where we read inconsistent data.

I still have to tackle a few comments on that PR and we left some TODOs that will make things even better:

  • Better deal with the case of the origin point that doesn't have a corresponding header in the store. That forces us to special case many calls.
  • Store tips instead of hashes for the anchor and best chain. This will help for next point.
  • Improve the test support for functions using pure-stage.
Trying to find out why we have some regression on the performance for e2e tests

It seems that there are 2 mains causes:

  • Accessing the best tip by loading the corresponding header from the hash then getting its tip in catch-up mode.
  • The Phase 2 validation makes catching up slower.

We decided to:

  • Implement the storing of tips in the DB.
  • Test with less epochs on PRs for e2e tests.
  • Keep the same epochs for e2e tests on main with a longer timeout.

What outcome/key result did it support?

Amaru as a production relay.

What's immediately next?

  • Finish to address @rkuhn comments on the effects + store PR
  • Finish the mempool implementation

2026-04-23

Mempool integration

The PR integrating the mempool as a stage is now merged:

  • The mempool stage serializes write accesses to the mempool, coming from
    • The local submit API.
    • The tx submission protocol.
    • The ledger when there are roll forwards and rollbacks.
  • Work has started to add validation / invalidation using the ledger.

Removing sync effects and using database snapshots

We currently deactivated the simulation tests because determining when a simulation must stop is non trivial. It will be possible to properly do it once we drive it purely with the passing of time (which also allows us to test timeouts). However, in order to do this we need to remove the "sync effects" support from pure-stage and make sure that every effect, store accesses in particular, can be suspended. This is also the opportunity to make sure that when we navigate the database to query data, we use a snapshot of the database so that we don't work on data that's being mutated by another process as we traverse it (that would be a major source of bugs).

This PR is quite large and touches a lot of "business" logic. It is now ready for review. I would be more confident if we had simulation tests to test it, but that's the very PR that will help us have the tests back!

End to end tests performance

We have regressed since March. The preprod test cannot read epoch 185 in less than 15 minutes. It seems that there are 2 main causes:

  1. Some unnecessary accesses to the database during catch-up, in particular to retrieve the best tip. We are using the best hash, then the header for that hash, then we get a point. Getting a point directly would be faster.
  2. The phase 2 validation has slowed down the processing.

During our meeting today we decided to:

  1. Fix the database access.
  2. Target an earlier epoch when testing PRs
  3. Using a longer timeout when testing main

2026-04-17

Weekly Update (@jeluard)

What did you work on this week?

Snapshots

Generating snapshots has been a bit more involved than expected especially for mainnet. Memory requirements (even stronger with dn-analyser) pushed me to investigate more recents versions of cardano-node to experiment with less memory-intensive backends (lmdb, then lsm). This also means having proper support to parse on-disk utxos. Note that new snapshots for preprod/preview allow to bootstrap amaru dbs identical to the current method.

What outcome/key result did it support?

  • Continuously validate Amaru against the latest available ledger snapshot within a day of an epoch ending

Extra

Published a new release for uplc; now used in amaru.

Upgraded cardanop-zkvms to latest OpenVM beta 2.0.

Shared an awesome repo for amaru: awesome-amaru

What's immediately next?

  • push snapshots creation

2026-04-10

Weekly Update (@jeluard)

What did you work on this week?

Observability

Merged trace_span! migration PR. Worked on improving some more trace support.

Created a simple dashboard for amaru: https://github.com/jeluard/amaru-dashboard Relies on OTEL traces and OTEL websocket bridge. The goal is to improve the PI experiment to have a nice monitor support.

Snapshots

Created a web app allowing to easily analyze and compare structure of CBOR files. This allows to more easily analyze CBOR dumps from cardano-node.

https://jeluard.github.io/cbor-structure/

What outcome/key result did it support?

  • Refine and document 100% of Amaru's traces
  • Continuously validate Amaru against the latest available ledger snapshot within a day of an epoch ending
  • Provide a dashboard for Amaru PI

What's immediately next?

  • Continue work on snapshot generation

Weekly Update (@etorreborre)

What did you work on this week?

Property test the select chain stage

I expanded the PR testing the select chain stage:

  • The data generation is more realistic. Actions are generated in response to previous outputs, the same way things would work in production. For example we would get a FetchBlock message after having processed a NewTip message.
  • The data generation also models restarts since the select_chain stage initial state depends on the state of the chain store.
  • The property that is checked is stronger, akin to "we must always emit downstream what we currently know as the best tip".
  • In the light of that property I fixed the code which sometimes was regressing too far in the past or not returning an alternate fork.

This has not yet been reviewed by @rkuhn, he still needs to validate that I'm not off with the property as it stands in that PR.

Mempool as a stage

The Mempool was previously modeled as a Resource only, directly accessed by the submit api or the txsubmission stages. It is now a proper stage with this PR and it currently receives messages:

  • From the submit API to insert new local transactions.
  • From the txsubmission stages to insert transactions from blocks or wait for new transactions.

Remove sync effects and make reads on the chain store more consistent

This will:

This is unfortunately a large PR which does the following:

  1. Removes the notion of an "external sync effect". Now all effects are async (which allows them to be interrupted).
  2. Makes async calls to store and ledger effects.
  3. Moves some store operations to the chain store directly, like switching to a new fork.
  4. Makes sure that read operations that are navigating the chain store, for example to get points along the best chain, operate on a snapshot of the store.

There are still a few things to tweak and test in that PR so it is still a draft at the moment.

What outcome/key result did it support?

Make the node ready as a production relay.

What's immediately next?

Next week is off for me but when I come back, I will work on:

  • Finishing the removal of sync effects.
  • Simulating time in simulation tests to make them terminate effectively: https://github.com/pragma-org/amaru/issues/737.
  • Finishing the implementation of the mempool to validate transactions from an apply block or taken to make a new block.

2026-04-03

Weekly Update (@jeluard)

What did you work on this week?

Snapshots

Worked on re-creating new snapshots using db-analyser using Docker images. Implemented a new bootstrap-hd command allowing to bootstrap from cardano-hd ledger state.

To validate that snapshots produce the expected result we can now use rocksdb-diff to validate that the end result DBs are equivalent.

cargo install rocksdb-diff
rocksdb-diff ledger.preprod.db.old/164 ledger.preprod.db/164 --prefix acct,comm,dlg,drep,pool,pots,prop,slot,utxo,vote

What outcome/key result did it support?

  • Continuously validate Amaru against the latest available ledger snapshot within a day of an epoch ending

What's immediately next?

  • Continue this work; ideally we would host all amaru meaningful snapshots

Weekly Update (@etorreborre)

What did you work on this week?

Simulation, tests and new consensus graph

This week I updated the animation for the new consensus graph traces and helped to have it merged. There was a performance regression that made the end to end test fail because the trace buffer was serializing data even while being deactivated. This means that we will have to find a way to run this serialization asynchronously when we want to enable traces in production.

I also fixed the github workflow that had some env. variable bugs when running workflows manually.

Now, while rebasing the PR that was adding a property test for the new select_chain stage (#726), I noticed 2 potential issues:

  • When a block is invalidated we fallback to the header at the best chain, but there might be headers with valid blocks after the best chain tip.
  • Similarly there could be a fork with valid blocks that could become the next best chain.

I'm still trying to represent these cases in the data generation and oracle + trying to find a good fix for these cases. In itself, the issue is not catastrophic because we never send incorrect information downstream. However we break the contract for that stage which should return the best known tip all the time (and that could delay the transmission of that information downstream).

Mempool

There are now 3 issues for the development of the mempool (#733, #734, #735):

  1. Make mempool a proper stage.
  2. Implement the validation / invalidation of transactions (including TTL).
  3. Add observability

I already started doing 1. but I haven't finished yet the testing of the chain selection so I need to do that first.

What outcome/key result did it support?

This all supports the possibility to run amaru as a relay node with a fixed set of peers.

What's immediately next?

  • Finish the property test for the select_chain stage.
  • Finish rebasing / merging my PRs.
  • Continue the mempool work.

2026-04-01

New consensus graph

The PR #721 fixes the consensus graph with a proper update of the ledger, back-pressure and so on. However the build was not passing because the end to end tests were too slow. I tried to adjust different parameters like the batch size for the fetch block request but that didn't help. Eventually Roland noticed that the trace buffer was always serializing data even when it was not supposed to be enabled. That fix made the tests pass and the PR is now green.

We are now left with:

  • Addressing the CodeRabbit comments
  • Possibly going back to the fixes I did in #729 for the chainsync responder because I don't think that the one in #721 is robust enough
  • Re-enabling the simulation tests on that PR.

Then, when it is merged, I will add on top:

  • The extension of the simulation tests with the txsubmission property
  • The updated animation of traces

At some stage we also need to revisit the simulation support to control finely how time is passing. Unfortunately this might require some heavy duty rework of how we currently handle synchronous effects, like the store effects.

Clone this wiki locally