Skip to content

log :: 2026‐03

Matthias Benkort edited this page May 12, 2026 · 11 revisions

Tip

KEYWORDS

Consensus Graph, Simulation Properties, ChainSync Responder, TxSubmission, Observability Tooling

SUMMARY

These notes describe stabilizing the new consensus/stage graph (ledger updates, back-pressure, trace-buffer perf fix) and a set of follow-up tasks: responder robustness, re-enabling simulations, and expanding properties.

Simulation work is becoming end-to-end (full nodes + miniprotocols): defining reliable termination conditions, adding property tests for select_chain, and extending checks to txsubmission and later ledger validation.

Several protocol correctness fixes land from simulation findings: chainsync responder snapshotting to avoid races, early/buffered miniprotocol registration to prevent drops, and txsubmission behavior improvements (stream txs as available, responder readiness).

Observability matures into typed trace schemas with CI regression checks, improved OTEL tooling/UI (playback, multi-instance correlation), and new data/DB semantic traces—supporting relay demos and debugging flaky CI.

2026-04-01

New consensus graph

The PR #721 fixes the consensus graph with a proper update of the ledger, back-pressure and so on. However the build was not passing because the end to end tests were too slow. I tried to adjust different parameters like the batch size for the fetch block request but that didn't help. Eventually Roland noticed that the trace buffer was always serializing data even when it was not supposed to be enabled. That fix made the tests pass and the PR is now green.

We are now left with:

  • Addressing the CodeRabbit comments
  • Possibly going back to the fixes I did in #729 for the chainsync responder because I don't think that the one in #721 is robust enough
  • Re-enabling the simulation tests on that PR.

Then, when it is merged, I will add on top:

  • The extension of the simulation tests with the txsubmission property
  • The updated animation of traces

At some stage we also need to revisit the simulation support to control finely how time is passing. Unfortunately this might require some heavy duty rework of how we currently handle synchronous effects, like the store effects.

2026-03-27

Weekly Update (@jeluard)

What did you work on this week?

Observability

Finalized trace PR; now waiting for feedback before merging. Some experimentation with amarost.

Snapshots

Work on tooling/scripts to automate and simplify snapshots creation.

What outcome/key result did it support?

  • Refine and document 100% of Amaru's traces
  • Continuously validate Amaru against the latest available ledger snapshot within a day of an epoch ending

What's immediately next?

  • work on snapshots creation
  • merge trace PR

Weekly Update (@etorreborre)

What did you work on this week?

Simulation and tests

This week I investigated issues found by or related to the simulation:

  • I still had to tweak some code to try to find the best way to tell if a simulation is done or not.
  • I fixed some issues with the sending of header in the chainsync responder: #729.
    • We need to snapshot the best chain state before returning headers downstream.
  • On top of that PR, I added some setup and checks for the tx submission protocol and found another issue: #728.
    • We should return transactions as soon as some of them are available and not wait until we fill the required number of transactions.
    • There was also a possible concurrency issue where the txsubmission initiator starts sending transactions while the responder is not yet registered.

I also upgraded the animation: #731 to display the connection + miniprotocol stages and added many features.

What outcome/key result did it support?

This supports making sure that the node is robust enough to act as a relay in production.

What's immediately next?

  • I have tested my latest fixes on top of @rkuhn's PR for finishing the new consensus graph and I made the simulation tests (#721) pass after additional fixes (on the rk/fetch-blocks-simulation-tests-fix branch). When Roland is back next week we need to finish all this work and push it to main.

  • Then more simulation testing should be done + I should start having a look at the mempool.

2026-03-25

Chainsync responder fixes

The responder gets "NewTip" messages and must deduce what messages need to be sent to a downstream peer based on the current position of that peer compared to the current tip. The previous code was getting that information from the best chain stored in the ChainStore but that chain can be mutated while the information is retrieved which leads to incorrect headers being sent downstream.

The fix in PR #729 consist in returning headers from the database directly if they are contained in the immutable part of the best chain, and otherwise snapshot a fragment the current chain to feed the downstream peer. A similar treatment has been applied to the search for an intersection point between 2 peers when the protocol is initialized.

Miniprotocol registration fix

Running the simulation found one issue in the implementation of the miniprotocols. The initiator of the txsubmission protocol was sending a message at a time when the responder was not yet registered for that protocol. The fix consists in registering all the protocols early in buffering mode so that messages can not be dropped (@rkuhn you will want to review this in case I did not interpret the Manager machinery correctly :-)).

2026-03-20

Weekly Update (@etorreborre)

What did you work on this week?

Simulation

Simulation termination

I worked again on trying to find the best way to stop a simulation before checking properties. I'm eventually using the generated trace to determine if the only effects being executed are keepalive and chainsync next effects.

Found bugs in the chainsync responder

Re-running the simulation triggered some issues with the chainsync responder implementation. In order to send headers to the initiator we query the best chain in the chain store but that best chain might be mutated at the same time. I worked towards some fixes but still need to refine this PR.

Tx submission property

The simulation has been extended with the injection of transactions in downstream nodes and a check that they are effectively transmitted upstream (PR).

New select_chain stage property test

This PR adds a property test for that stage to make sure that interleavings of new tips + validated/invalidated blocks yield the correct results.

Transaction submission API

Added an API to post CBOR-serialized transactions: #727.

What outcome/key result did it support?

This is working towards having a fully functioning relay node.

What's immediately next?

Next week I will refine the in-progress PRs above so that they are ready for review. Then I need to review PRs. After that I will extend the simulation to involve the ledger validations.

2026-03-20

Weekly Update (@jeluard)

What did you work on this week?

Observability

Added more traces for better applicative introspectability. Migrate traces to trace_span!: simpler, more readable, less potential to break things.

What we now have:

  • typed traces using trace_span! and associated schema
  • json schema file
  • CI process ensuring no regression to the schema
  • semantic traces based on OTEL standard
  • web page showing up trace schemas nicely
  • script providing analysis of traces
  • CI process ensuring no regression of traces
Extra
  • experiment with amarost: leverage amaru to build a provable alternative to blockfrost (commitment, authenticated data structure)

What outcome/key result did it support?

  • Refine and document 100% of Amaru's traces

What's immediately next?

  • data love
  • merge trace PR
  • last experiment with custom Trace ID and distributed traces

2026-03-17

Testing

Termination of tests during a simulation

Now running a simulation implies running full nodes, with their mini-protocols, including keep-alive. This makes the determination of when a test is over more difficult (an amaru node should never crash).

This PR: #724, fixes the termination of tests by draining all the effects that can executed, except time-dependent ones, when all the test actions have been sent to the node under test. This means that stages sending events after a given amount of time, typically for keep-alive, will stop doing so.

I am not convinced that this is the best way to determine the end of a test even if it seems to work for now. Maybe a better way would be to use header hashes as trace ids (cc @jeluard) and have a way to query for the termination of a trace. Then we would know if a test action has been fully "consumed" or not.

Property-testing the new chain selection stage

That stage is now responsible for receiving new tips from peers and block validations. From there is it able to tell if there is a new best tip or not. I have now added a property tests that simulates a series of rollforwards sent by peers and marks some of the referenced blocks as invalid (so that tips including them in their chain should not be sent downstream).

Good news: no issue was found by the property test, even when varying chain depth, number of peers, number of tests!

2026-03-13

Weekly Update (@jeluard)

What did you work on this week?

Observability

Added new traces based on db semantic conventions. This allows to extract basic statistics related to amaru db access, improving our global understanding of internal working.

Extra

Experimented with leveraging CBOR structure for improved data access. The overall idea is to leverage CDDL schemas to leverage direct byte array access and remove access to irrelevant fields (also removing associated (de)serialization costs).

Also took this opportunity to benchmark most prominent key/value store in the rust ecosystem. The benchmark used mimics actual amaru pathological paths: account rewards update.

See cbor-db.

Improved mithril packaging stability.

What outcome/key result did it support?

  • Refine and document 100% of Amaru's traces

What's immediately next?

  • data love
  • more traces

2026-03-06

Weekly Update (@jeluard)

What did you work on this week?

Observability

Improve otel-ui

  • bridge persist traces
  • historical traces can be played back (slow-motion, step-by-step)
  • multiple instances are now supported; a span attribute can be used to correlate traces

PRS

Extra
  • first release. Now cargo +nightly install amaru works and allows to install the amaru binary
  • dig amarost

What outcome/key result did it support?

  • Refine and document 100% of Amaru's traces

What's immediately next?

  • still some traces love
  • experiment with optimized CBOR persistency

2026-03-02

Simulation test fixes (@etorreborre)

There are several issues related to the simulation:

  • The simulation is flaky on CI: https://github.com/pragma-org/amaru/actions/runs/22588131557/job/65438666562 => I added more logging to display the seed and / or the node config when the actions stage is failing or the node can't be built for some reason. => I fixed some initialization/shrinking issues when there are many peers involved. => There is still a failing case in the simulation, I need to investigate it but now I have more information.

  • The simulation revealed a concurrency bug: => Here is the PR fixing it: https://github.com/pragma-org/amaru/pull/700 => Unfortunately I just saw that bug by looking at the logs. The main chainsync property was still passing. Maybe I could add the fact that there should never be an ERROR log but that's not exactly true if we start simulating faults.

Animations updates (@etorreborre)

  • I updated the actions.html page that shows the generated data for the simulation => I started working on traces.html to show all the execution stages but more work is required here.

Demo of amaru as a relay

I wanted to start packaging the demos done on Friday in a PR but before that:

  • I need to understand why the central node in the demo relay-1 doesn't show that it is accepting headers in the steady mode.
  • I need to take a look at the demo relay-3 (the more complex topology) where the central node is terminating on a failed rollback.

Clone this wiki locally