Added Marconi resuming strategy ADR

input-output-hk · Jun 2, 2023 · 48a9d13 · 48a9d13
1 parent 10d87b6
commit 48a9d13
Show file tree

Hide file tree

Showing 2 changed files with 201 additions and 0 deletions.
diff --git a/doc/read-the-docs-site/adr/0006-indexer-resuming-strategy.rst b/doc/read-the-docs-site/adr/0006-indexer-resuming-strategy.rst
@@ -0,0 +1,200 @@
+.. _adr6:
+
+ADR 6: Indexer resuming strategy
+================================
+
+Date: 2023-03-29
+
+Authors
+-------
+
+koslambrou <konstantinos.lambrou@iohk.io>
+
+Status
+------
+
+Draft
+
+Context
+-------
+
+When building a Marconi indexer, you need to provide the points from which you can resume from.
+Typically, the latest point will be used to bootstrap the node-to-client chain-sync protocol.
+
+However, the user will sometimes want to run *multiple* indexers in parallel.
+Even in that scenario, the user will need to provide a single point to the node-to-client chain-sync protocol.
+
+The initial resuming strategy that was implemented was a naive implementation.
+The implementation would have ``resumeFromStorage`` return *all* points that the indexer can resume from inside a list.
+The issue is that most indexers *can* resume from *any* point in time up until the point they have indexed to.
+The result is a ``resumeFromStorage`` that can return millions of points and it takes a significant amount of time to run because the results need to be sorted in descending order.
+
+Given this performance issue, we want to define an efficient resuming strategy which satifies the
+following general goals:
+
+* fast
+* low hardware resource consumption
+* does not require indexer to re-index data they have already indexed
+* indexers are *always* in a consistent state. They *must* delete any data in points that have been rollbacked (even if the rollback happens when the indexers are stopped).
+
+Decision
+--------
+
+* We will change the return type of ``resumeFromStorage`` from ``StorablePoint h`` to ``[StorablePoint h]``
+
+* We will change the ``resumeFromStorage`` implementaton of all existing indexers so that they returns a limited set of resumable points.
+  More specifically, ``securityParam * onDiskBufferRatio + 1`` (in the current Marconi interface, it is actually just ``securityParam + 1`` as we assume that all rollbackable blocks *can* be fully stored on disk) worth of points.
+
+* We will use the full set of points provided by ``resumableFromStorage`` of an indexer as resuming points for the node-to-client chain-sync protocol.
+  These points need to ordered in descending order so that protocol can priotise the selection of the latest ones.
+
+* We will change the ``Coordinator`` so that each indexer runs it's own node-to-client chain-sync protocol instead of sharing the same connection for all the indexer.
+
+* We will have each indexer start the node-to-client chain-sync protocol at different points in time.
+  However, the ``Coordinator`` will make sure that all indexers advance at the same speed (i.e. they can only request the next block once all the indexers have finished processing the current block).
+  Additionnally, an indexer can *always* process the next block if there's another indexer that have already process a later block.
+
+* We will call ``rewind`` at the resuming point in order to make sure that we re-index the block we previously stopped at.
+  That step is to ensure we remove partially indexed information.
+
+Argument
+--------
+
+In order to justify the decision, we will present various use case scenarios and show how the decision satisfies them.
+We assume two indexers: ``A`` and ``B`` which have started indexing information, and then were stopped.
+The use cases will show what will happen when resuming them.
+
+We use the notation ``[x..y]`` to define the resumable interval.
+Also note that we use the operator ``-`` for calculating the difference between two intervals.
+For example, ``[1..3] - [2..4] = [1..1]`` and ``[1..3] - [5..10] = [1..3]``.
+In Haskell, that would look something like:
+
+  .. code-block:: haskell
+
+    Set.fromList [1..3] `Set.difference` Set.fromList [2..4]
+
+``A`` has a resumable interval other than genesis outside of the rollbackable chain point interval
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+::
+
+              Rollbackable
+            |--------------|
+  1 2 3 4 5 6 7 8 9 10 11 12
+    |---|                  |
+      A                   Tip
+
+``A``'s resumable points provided for the chain-sync protocol are ``[4]``.
+The chain-sync protocol is started at point ``4``, thus ``A`` is rewinded to point ``4``.
+The rewind would remove any data indexed at point ``4`` in order to ensure that we remove partially indexed information at the point the indexer was stopped.
+
+``A`` has a resumable interval fully included in the rollbackable chain point interval
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+::
+
+                 Rollbackable
+                |-------------|
+  1 2 3 4 5 6 7 8 9 10 11 12 13
+                |----|        |
+                   A         Tip
+
+``A``'s resumable points provided for the chain-sync protocol are ``[10, 9, 8]``.
+The chain-sync protocol will try each of these points and identify the first one which is known by the local node.
+As rollbacks can occur between points ``[8..13]`` after the indexer was stopped, the points ``[8..9]`` provided by the indexer could be invalid.
+Thus, if any of those points fail, the chain-sync protocol will start from genesis.
+
+``A`` has a resumable interval overlapping the rollbackable chain point interval
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+::
+
+                 Rollbackable
+                |-------------|
+  1 2 3 4 5 6 7 8 9 10 11 12 13
+            |--------|        |
+                A            Tip
+
+``A``'s resumable points provided for the chain-sync protocol are ``[10, 9, 8, 7]``.
+The chain-sync protocol will try each of these points and identify the first one which is known by the local node.
+As rollbacks can occur between points ``[8..13]`` after the indexer was stopped, the points ``[8..10]`` provided by the indexer could be invalid.
+Thus, if any of those rollbackable points fail, we can guaranty that the chain-sync protocol will start at point ``7`` (unless the node database was deleted and the nod re-sync did not get past point ``7``).
+
+``A`` and ``B`` are resuming at different points
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+::
+
+                 Rollbackable
+                |-------------|
+  1 2 3 4 5 6 7 8 9 10 11 12 13
+    |-|                       |
+     B                       Tip
+  |---------|
+       A
+
+``A`` and ``B``'s resumable points provided for the chain-sync protocol are ``[6]`` and ``[3]`` respectively.
+The coordinator will block syncing of ``A`` until ``B`` reaches the same point (``6``).
+Then, both indexers can only process the next block once the other has finished processing the current block.
+
+Alternative solutions
+---------------------
+
+Single node-to-client chain-sync protocol
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This was our initial implementation.
+We started a single node-to-client chain-sync protocol and then the ``Coordinator`` would pass the ``ChainSyncEvent`` to all indexers.
+Once *all* indexers have finished processing the event, the ``Coordinator`` would fetch the next ``ChainSyncEvent`` and propagate it to all indexers.
+
+The major issue with this solution is that, for multiple indexers, they don't always share the same resumable point.
+If they don't share any resumable points, all of the indexers are restarted from genesis (losing all data they previously indexed).
+
+A possible extension would have been to start from the ...
+
+
+
+
+
+
+Have ``resumeFromStorage`` return the largest point instead of an interval
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Let's take the following situation.
+
+::
+
+           Rollbackable
+          |-------------|
+  1 2 3 4 5 6 7 8 9 10 11 12 13
+        |---------------|
+                A      Tip
+
+::
+
+                 Rollbackable
+                |-------------|
+  1 2 3 4 5 6 7 8 9 10 11 12 13
+            |--------|        |
+                A            Tip
+
+If we implemented this solution, then ``resumeFromStorage`` would return point ``10``.
+However, that point is rollbackable, thus it could possibly be invalid when restarting the indexer.
+For example, let's say the node is rollbacked to point ``8`` after the indexer was stopped, and the node continued syncing until point ``13``.
+In that scenario, resuming the indexer from point ``10`` would not yield an error, but it will put the indexer into an inconsistent state with regards to the data that it has indexed.
+
+Of course, that problem would not occur if ``resumeFromStorage`` would only return the largest point that is outside the rollbackable interval.
+That would imply that the indexer needs to be aware of the current node tip in order to derive latest immutable point.
+However, we think that it should *not* be of concern to the user writing an indexer, and removing rollbackable points should be done outside the indexer.
+
+
+
+::
+
+           Rollbackable
+          |-------------|
+  1 2 3 4 5 6 7 8 9 10 11 12 13
+        |---------------|
+                A      Tip
+
+The implementaton of ``resumeFromStorage`` of ``A`` should return a limited set of resumable points: ``securityParam * onDiskBufferRatio + 1`` worth of points.
diff --git a/doc/read-the-docs-site/adr/index.rst b/doc/read-the-docs-site/adr/index.rst
@@ -34,3 +34,4 @@ The general process for creating an ADR is:
    0003-marconi-indexer-rollbacks
    0004-marconi-query-interface
    0005-marconi-indexers-query-synchronisation-primitive
+   0006-indexer-resuming-strategy