-
Notifications
You must be signed in to change notification settings - Fork 26
log :: 2026‐03
Tip
Consensus Graph, Simulation Properties, ChainSync Responder, TxSubmission, Observability Tooling
These notes describe stabilizing the new consensus/stage graph (ledger updates, back-pressure, trace-buffer perf fix) and a set of follow-up tasks: responder robustness, re-enabling simulations, and expanding properties.
Simulation work is becoming end-to-end (full nodes + miniprotocols): defining reliable termination conditions, adding property tests for select_chain, and extending checks to txsubmission and later ledger validation.
Several protocol correctness fixes land from simulation findings: chainsync responder snapshotting to avoid races, early/buffered miniprotocol registration to prevent drops, and txsubmission behavior improvements (stream txs as available, responder readiness).
Observability matures into typed trace schemas with CI regression checks, improved OTEL tooling/UI (playback, multi-instance correlation), and new data/DB semantic traces—supporting relay demos and debugging flaky CI.
The PR #721 fixes the consensus graph with a proper update of the ledger, back-pressure and so on. However the build was not passing because the end to end tests were too slow. I tried to adjust different parameters like the batch size for the fetch block request but that didn't help. Eventually Roland noticed that the trace buffer was always serializing data even when it was not supposed to be enabled. That fix made the tests pass and the PR is now green.
We are now left with:
- Addressing the CodeRabbit comments
- Possibly going back to the fixes I did in #729 for the chainsync responder because I don't think that the one in #721 is robust enough
- Re-enabling the simulation tests on that PR.
Then, when it is merged, I will add on top:
- The extension of the simulation tests with the txsubmission property
- The updated animation of traces
At some stage we also need to revisit the simulation support to control finely how time is passing. Unfortunately this might require some heavy duty rework of how we currently handle synchronous effects, like the store effects.
Finalized trace PR; now waiting for feedback before merging. Some experimentation with amarost.
Work on tooling/scripts to automate and simplify snapshots creation.
- Refine and document 100% of Amaru's traces
- Continuously validate Amaru against the latest available ledger snapshot within a day of an epoch ending
- work on snapshots creation
- merge trace PR
This week I investigated issues found by or related to the simulation:
- I still had to tweak some code to try to find the best way to tell if a simulation is done or not.
- I fixed some issues with the sending of header in the chainsync responder: #729.
- We need to snapshot the best chain state before returning headers downstream.
- On top of that PR, I added some setup and checks for the tx submission protocol and found another issue: #728.
- We should return transactions as soon as some of them are available and not wait until we fill the required number of transactions.
- There was also a possible concurrency issue where the txsubmission initiator starts sending transactions while the responder is not yet registered.
I also upgraded the animation: #731 to display the connection + miniprotocol stages and added many features.
This supports making sure that the node is robust enough to act as a relay in production.
-
I have tested my latest fixes on top of @rkuhn's PR for finishing the new consensus graph and I made the simulation tests (#721) pass after additional fixes (on the
rk/fetch-blocks-simulation-tests-fixbranch). When Roland is back next week we need to finish all this work and push it tomain. -
Then more simulation testing should be done + I should start having a look at the mempool.
Chainsync responder fixes
The responder gets "NewTip" messages and must deduce what messages need to be sent to a downstream peer based on the current position of that peer compared to the current tip. The previous code was getting that information from the best chain stored in the ChainStore but that chain can be mutated while the information is retrieved which leads to incorrect headers being sent downstream.
The fix in PR #729 consist in returning headers from the database directly if they are contained in the immutable part of the best chain, and otherwise snapshot a fragment the current chain to feed the downstream peer. A similar treatment has been applied to the search for an intersection point between 2 peers when the protocol is initialized.
Miniprotocol registration fix
Running the simulation found one issue in the implementation of the miniprotocols. The initiator of the txsubmission protocol was sending a message at a time when the responder was not yet registered for that protocol. The fix consists in registering all the protocols early in buffering mode so that messages can not be dropped (@rkuhn you will want to review this in case I did not interpret the Manager machinery correctly :-)).
Simulation termination
I worked again on trying to find the best way to stop a simulation before checking properties. I'm eventually using the generated trace to determine if the only effects being executed are keepalive and chainsync next effects.
Found bugs in the chainsync responder
Re-running the simulation triggered some issues with the chainsync responder implementation. In order to send headers to the initiator we query the best chain in the chain store but that best chain might be mutated at the same time. I worked towards some fixes but still need to refine this PR.
Tx submission property
The simulation has been extended with the injection of transactions in downstream nodes and a check that they are effectively transmitted upstream (PR).
New select_chain stage property test
This PR adds a property test for that stage to make sure that interleavings of new tips + validated/invalidated blocks yield the correct results.
Added an API to post CBOR-serialized transactions: #727.
This is working towards having a fully functioning relay node.
Next week I will refine the in-progress PRs above so that they are ready for review. Then I need to review PRs. After that I will extend the simulation to involve the ledger validations.
Added more traces for better applicative introspectability.
Migrate traces to trace_span!: simpler, more readable, less potential to break things.
What we now have:
- typed traces using
trace_span!and associated schema - json schema file
- CI process ensuring no regression to the schema
- semantic traces based on OTEL standard
- web page showing up trace schemas nicely
- script providing analysis of traces
- CI process ensuring no regression of traces
- experiment with amarost: leverage amaru to build a provable alternative to blockfrost (commitment, authenticated data structure)
- Refine and document 100% of Amaru's traces
- data love
- merge trace PR
- last experiment with custom
Trace IDand distributed traces
Now running a simulation implies running full nodes, with their mini-protocols, including keep-alive.
This makes the determination of when a test is over more difficult (an amaru node should never crash).
This PR: #724, fixes the termination of tests by draining all the effects that can executed, except time-dependent ones, when all the test actions have been sent to the node under test. This means that stages sending events after a given amount of time, typically for keep-alive, will stop doing so.
I am not convinced that this is the best way to determine the end of a test even if it seems to work for now. Maybe a better way would be to use header hashes as trace ids (cc @jeluard) and have a way to query for the termination of a trace. Then we would know if a test action has been fully "consumed" or not.
That stage is now responsible for receiving new tips from peers and block validations. From there is it able to tell if there is a new best tip or not. I have now added a property tests that simulates a series of rollforwards sent by peers and marks some of the referenced blocks as invalid (so that tips including them in their chain should not be sent downstream).
Good news: no issue was found by the property test, even when varying chain depth, number of peers, number of tests!
Added new traces based on db semantic conventions. This allows to extract basic statistics related to amaru db access, improving our global understanding of internal working.
Experimented with leveraging CBOR structure for improved data access. The overall idea is to leverage CDDL schemas to leverage direct byte array access and remove access to irrelevant fields (also removing associated (de)serialization costs).
Also took this opportunity to benchmark most prominent key/value store in the rust ecosystem. The benchmark used mimics actual amaru pathological paths: account rewards update.
See cbor-db.
Improved mithril packaging stability.
- Refine and document 100% of Amaru's traces
- data love
- more traces
Improve otel-ui
- bridge persist traces
- historical traces can be played back (slow-motion, step-by-step)
- multiple instances are now supported; a span attribute can be used to correlate traces
PRS
- https://github.com/pragma-org/amaru/pull/710 (more semantic conventions)
- first release. Now
cargo +nightly install amaruworks and allows to install theamarubinary - dig amarost
- Refine and document 100% of Amaru's traces
- still some traces love
- experiment with optimized CBOR persistency
There are several issues related to the simulation:
-
The simulation is flaky on CI: https://github.com/pragma-org/amaru/actions/runs/22588131557/job/65438666562 => I added more logging to display the seed and / or the node config when the actions stage is failing or the node can't be built for some reason. => I fixed some initialization/shrinking issues when there are many peers involved. => There is still a failing case in the simulation, I need to investigate it but now I have more information.
-
The simulation revealed a concurrency bug: => Here is the PR fixing it: https://github.com/pragma-org/amaru/pull/700 => Unfortunately I just saw that bug by looking at the logs. The main chainsync property was still passing. Maybe I could add the fact that there should never be an ERROR log but that's not exactly true if we start simulating faults.
- I updated the
actions.htmlpage that shows the generated data for the simulation => I started working ontraces.htmlto show all the execution stages but more work is required here.
I wanted to start packaging the demos done on Friday in a PR but before that:
- I need to understand why the central node in the demo
relay-1doesn't show that it is accepting headers in the steady mode. - I need to take a look at the demo
relay-3(the more complex topology) where the central node is terminating on a failed rollback.