Skip to content

Commit

Permalink
Added Marconi resuming strategy ADR
Browse files Browse the repository at this point in the history
  • Loading branch information
koslambrou committed Jun 2, 2023
1 parent 10d87b6 commit 48a9d13
Show file tree
Hide file tree
Showing 2 changed files with 201 additions and 0 deletions.
200 changes: 200 additions & 0 deletions doc/read-the-docs-site/adr/0006-indexer-resuming-strategy.rst
@@ -0,0 +1,200 @@
.. _adr6:

ADR 6: Indexer resuming strategy
================================

Date: 2023-03-29

Authors
-------

koslambrou <konstantinos.lambrou@iohk.io>

Status
------

Draft

Context
-------

When building a Marconi indexer, you need to provide the points from which you can resume from.
Typically, the latest point will be used to bootstrap the node-to-client chain-sync protocol.

However, the user will sometimes want to run *multiple* indexers in parallel.
Even in that scenario, the user will need to provide a single point to the node-to-client chain-sync protocol.

The initial resuming strategy that was implemented was a naive implementation.
The implementation would have ``resumeFromStorage`` return *all* points that the indexer can resume from inside a list.
The issue is that most indexers *can* resume from *any* point in time up until the point they have indexed to.
The result is a ``resumeFromStorage`` that can return millions of points and it takes a significant amount of time to run because the results need to be sorted in descending order.

Given this performance issue, we want to define an efficient resuming strategy which satifies the
following general goals:

* fast
* low hardware resource consumption
* does not require indexer to re-index data they have already indexed
* indexers are *always* in a consistent state. They *must* delete any data in points that have been rollbacked (even if the rollback happens when the indexers are stopped).

Decision
--------

* We will change the return type of ``resumeFromStorage`` from ``StorablePoint h`` to ``[StorablePoint h]``

* We will change the ``resumeFromStorage`` implementaton of all existing indexers so that they returns a limited set of resumable points.
More specifically, ``securityParam * onDiskBufferRatio + 1`` (in the current Marconi interface, it is actually just ``securityParam + 1`` as we assume that all rollbackable blocks *can* be fully stored on disk) worth of points.

* We will use the full set of points provided by ``resumableFromStorage`` of an indexer as resuming points for the node-to-client chain-sync protocol.
These points need to ordered in descending order so that protocol can priotise the selection of the latest ones.

* We will change the ``Coordinator`` so that each indexer runs it's own node-to-client chain-sync protocol instead of sharing the same connection for all the indexer.

* We will have each indexer start the node-to-client chain-sync protocol at different points in time.
However, the ``Coordinator`` will make sure that all indexers advance at the same speed (i.e. they can only request the next block once all the indexers have finished processing the current block).
Additionnally, an indexer can *always* process the next block if there's another indexer that have already process a later block.

* We will call ``rewind`` at the resuming point in order to make sure that we re-index the block we previously stopped at.
That step is to ensure we remove partially indexed information.

Argument
--------

In order to justify the decision, we will present various use case scenarios and show how the decision satisfies them.
We assume two indexers: ``A`` and ``B`` which have started indexing information, and then were stopped.
The use cases will show what will happen when resuming them.

We use the notation ``[x..y]`` to define the resumable interval.
Also note that we use the operator ``-`` for calculating the difference between two intervals.
For example, ``[1..3] - [2..4] = [1..1]`` and ``[1..3] - [5..10] = [1..3]``.
In Haskell, that would look something like:

.. code-block:: haskell
Set.fromList [1..3] `Set.difference` Set.fromList [2..4]
``A`` has a resumable interval other than genesis outside of the rollbackable chain point interval
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

::

Rollbackable
|--------------|
1 2 3 4 5 6 7 8 9 10 11 12
|---| |
A Tip

``A``'s resumable points provided for the chain-sync protocol are ``[4]``.
The chain-sync protocol is started at point ``4``, thus ``A`` is rewinded to point ``4``.
The rewind would remove any data indexed at point ``4`` in order to ensure that we remove partially indexed information at the point the indexer was stopped.

``A`` has a resumable interval fully included in the rollbackable chain point interval
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

::

Rollbackable
|-------------|
1 2 3 4 5 6 7 8 9 10 11 12 13
|----| |
A Tip

``A``'s resumable points provided for the chain-sync protocol are ``[10, 9, 8]``.
The chain-sync protocol will try each of these points and identify the first one which is known by the local node.
As rollbacks can occur between points ``[8..13]`` after the indexer was stopped, the points ``[8..9]`` provided by the indexer could be invalid.
Thus, if any of those points fail, the chain-sync protocol will start from genesis.

``A`` has a resumable interval overlapping the rollbackable chain point interval
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

::

Rollbackable
|-------------|
1 2 3 4 5 6 7 8 9 10 11 12 13
|--------| |
A Tip

``A``'s resumable points provided for the chain-sync protocol are ``[10, 9, 8, 7]``.
The chain-sync protocol will try each of these points and identify the first one which is known by the local node.
As rollbacks can occur between points ``[8..13]`` after the indexer was stopped, the points ``[8..10]`` provided by the indexer could be invalid.
Thus, if any of those rollbackable points fail, we can guaranty that the chain-sync protocol will start at point ``7`` (unless the node database was deleted and the nod re-sync did not get past point ``7``).

``A`` and ``B`` are resuming at different points
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

::

Rollbackable
|-------------|
1 2 3 4 5 6 7 8 9 10 11 12 13
|-| |
B Tip
|---------|
A

``A`` and ``B``'s resumable points provided for the chain-sync protocol are ``[6]`` and ``[3]`` respectively.
The coordinator will block syncing of ``A`` until ``B`` reaches the same point (``6``).
Then, both indexers can only process the next block once the other has finished processing the current block.

Alternative solutions
---------------------

Single node-to-client chain-sync protocol
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This was our initial implementation.
We started a single node-to-client chain-sync protocol and then the ``Coordinator`` would pass the ``ChainSyncEvent`` to all indexers.
Once *all* indexers have finished processing the event, the ``Coordinator`` would fetch the next ``ChainSyncEvent`` and propagate it to all indexers.

The major issue with this solution is that, for multiple indexers, they don't always share the same resumable point.
If they don't share any resumable points, all of the indexers are restarted from genesis (losing all data they previously indexed).

A possible extension would have been to start from the ...






Have ``resumeFromStorage`` return the largest point instead of an interval
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Let's take the following situation.

::

Rollbackable
|-------------|
1 2 3 4 5 6 7 8 9 10 11 12 13
|---------------|
A Tip

::

Rollbackable
|-------------|
1 2 3 4 5 6 7 8 9 10 11 12 13
|--------| |
A Tip

If we implemented this solution, then ``resumeFromStorage`` would return point ``10``.
However, that point is rollbackable, thus it could possibly be invalid when restarting the indexer.
For example, let's say the node is rollbacked to point ``8`` after the indexer was stopped, and the node continued syncing until point ``13``.
In that scenario, resuming the indexer from point ``10`` would not yield an error, but it will put the indexer into an inconsistent state with regards to the data that it has indexed.

Of course, that problem would not occur if ``resumeFromStorage`` would only return the largest point that is outside the rollbackable interval.
That would imply that the indexer needs to be aware of the current node tip in order to derive latest immutable point.
However, we think that it should *not* be of concern to the user writing an indexer, and removing rollbackable points should be done outside the indexer.



::

Rollbackable
|-------------|
1 2 3 4 5 6 7 8 9 10 11 12 13
|---------------|
A Tip

The implementaton of ``resumeFromStorage`` of ``A`` should return a limited set of resumable points: ``securityParam * onDiskBufferRatio + 1`` worth of points.
1 change: 1 addition & 0 deletions doc/read-the-docs-site/adr/index.rst
Expand Up @@ -34,3 +34,4 @@ The general process for creating an ADR is:
0003-marconi-indexer-rollbacks
0004-marconi-query-interface
0005-marconi-indexers-query-synchronisation-primitive
0006-indexer-resuming-strategy

0 comments on commit 48a9d13

Please sign in to comment.