Reorg resistance #2320

raphjaph · 2023-08-07T15:55:30Z

If a reorg is detected it rolls back the database 6 blocks and reindexes from there.

I had to disable some test optimisations so no the tests run quite a bit longer. Will try to find a way to get those back.

src/index.rs

raphjaph · 2023-08-07T20:33:43Z

@victorkirov if you want to have a look here :)

victorkirov · 2023-08-08T04:32:29Z

@raphjaph looks good overall. Some things to consider though:

if a reorg happens and it's more than 6 blocks deep, we won't be able to recover, so maybe it's worth keeping the reorged status
I think save points are pretty expensive to create and keep. I ended up getting OOM exception on my cluster when I kept 10 savepoints, granted I took a save point every 5 blocks, so things were different. I'd suggest only taking savepoints when you're close to the head block. I also ended up making a savepoint every 5 blocks (block height %5 == 0) and keeping just 2 savepoints. That gives you protection for 10-14 blocks and uses fewer savepoints.

veryordinally · 2023-08-08T08:31:39Z

@raphjaph looks good overall. Some things to consider though:

if a reorg happens and it's more than 6 blocks deep, we won't be able to recover, so maybe it's worth keeping the reorged status

Outside of an attack scenario the likelihood of a more than 6-block reorg are extremely low, and in Bitcoin's history we haven't seen it. But taking a snapshot only every few blocks seems like a good tradeoff. See https://blog.lopp.net/how-many-bitcoin-confirmations-is-enough/ for the math behind worst-case scenarios in an attack.

I think save points are pretty expensive to create and keep. I ended up getting OOM exception on my cluster when I kept 10 savepoints, granted I took a save point every 5 blocks, so things were different. I'd suggest only taking savepoints when you're close to the head block. I also ended up making a savepoint every 5 blocks (block height %5 == 0) and keeping just 2 savepoints. That gives you protection for 10-14 blocks and uses fewer savepoints.

I'll do some more measurements, but I haven't seen significant memory impact.

victorkirov · 2023-08-08T08:43:03Z

Outside of an attack scenario the likelihood of a more than 6-block reorg are extremely low, and in Bitcoin's history we haven't seen it. But taking a snapshot only every few blocks seems like a good tradeoff. See https://blog.lopp.net/how-many-bitcoin-confirmations-is-enough/ for the math behind worst-case scenarios in an attack.

True, the only time there has been a reorg with a lot of blocks was during a fork. Having the status there would be in case something like that happens, otherwise, I think it would currently silently fail to continue indexing.

I'll do some more measurements, but I haven't seen significant memory impact.

Strange. It's possible that I did something wrong 😅 I need to read up on how the savepoints work a bit more.

…5. This gives us a savepoint every 5 blocks when initial indexing is completed, and every 5000 blocks during initial indexing. Only keep 2 savepoints, as that allows us under this scheme to go back by 10+ blocks.

veryordinally

Reviewed and tested extensively. (Partially) incorporating suggestion from Victor.

veryordinally · 2023-08-08T11:08:32Z

Outside of an attack scenario the likelihood of a more than 6-block reorg are extremely low, and in Bitcoin's history we haven't seen it. But taking a snapshot only every few blocks seems like a good tradeoff. See https://blog.lopp.net/how-many-bitcoin-confirmations-is-enough/ for the math behind worst-case scenarios in an attack.

True, the only time there has been a reorg with a lot of blocks was during a fork. Having the status there would be in case something like that happens, otherwise, I think it would currently silently fail to continue indexing.
I've incorporated your suggestion to do savepoints only every 5 blocks, and keep 2, so we can go back by 10+. If we get a longer reorg, I'd say we have a problem quite more severe than the ord indexer stopping 😅 so not sure it's worth handling that case. We could just panic at that point (I would personally panic myself, I think)

I'll do some more measurements, but I haven't seen significant memory impact.

Strange. It's possible that I did something wrong 😅 I need to read up on how the savepoints work a bit more.

How much RAM do you have on the machines in the cluster? This may be an issue in more memory constrained environments.

victorkirov · 2023-08-08T18:12:46Z

How much RAM do you have on the machines in the cluster? This may be an issue in more memory constrained environments.

We had 32Gb with a bitcoin node, electrs API and 2 instances of Ord running. I upped it to 64Gb and seems a lot more stable now.

victorkirov · 2023-08-08T18:20:29Z

src/index/updater.rs

+      let savepoints = wtx.list_persistent_savepoints()?.collect::<Vec<u64>>();
+
+      if savepoints.len() >= 2 {
+        wtx.delete_persistent_savepoint(savepoints.into_iter().min().unwrap())?;
+      }
+
+      Index::increment_statistic(&wtx, Statistic::Commits, 2)?;
+      wtx.commit()?;
+      let wtx = self.index.begin_write()?;
+      log::debug!("creating savepoint at height {}", self.height);
+      wtx.persistent_savepoint()?;
+      wtx.commit()?;


Just 2 things to note here:

This first checks if there are more than 2 savepoints and deletes extra ones, then it creates a new one. That means that at the end you'd end up with 3 savepoints until it runs again. Maybe it should first create a savepoint and then do the greater than or equal to 2 check and clean.

Since this is in the commit function, it will only run on a full commit of an update run and could potentially contain more than 1 block in the commit. In that case, the %5 check could potentially skip over a savepoint, but this would only realistically happen on initial indexing, or starting up after being down for a few blocks, or after recovering from a reorg.

This first checks if there are more than 2 savepoints and deletes extra ones, then it creates a new one. That means that at the end you'd end up with 3 savepoints until it runs again. Maybe it should first create a savepoint and then do the greater than or equal to 2 check and clean.

If first checks if there are 2 or more checkpoints, and if so, deletes the oldest one. That means if somehow we ever get to more than 2 checkpoints, we would never reduce them down to 2. But the normal case is that we have 0, 1, or 2. When we have 0 or 1, we just add a new one. When we already have two, we delete the oldest one, and add a new one. We could change the if statement in line 662 to a loop ... Not sure that's worth it, as the only way to get more than 2 savepoints in the beginning is by manually creating them or running code that creates more than two.

Since this is in the commit function, it will only run on a full commit of an update run and could potentially contain more than 1 block in the commit. In that case, the %5 check could potentially skip over a savepoint, but this would only realistically happen on initial indexing, or starting up after being down for a few blocks, or after recovering from a reorg.

Not sure I understand. In the initial indexing, commit is called every 5k blocks, so always on blocks divisible by 5000, which are also divisible by 5. In all other cases, commit() is called whenever the indexer reaches the tip, so it will eventually hit a block that meets the condition. I am not concerned about going back a few blocks more than absolutely required.

If first checks if there are 2 or more checkpoints, and if so, deletes the oldest one. That means if somehow we ever get to more than 2 checkpoints, we would never reduce them down to 2. But the normal case is that we have 0, 1, or 2. When we have 0 or 1, we just add a new one. When we already have two, we delete the oldest one, and add a new one. We could change the if statement in line 662 to a loop ... Not sure that's worth it, as the only way to get more than 2 savepoints in the beginning is by manually creating them or running code that creates more than two.

Sorry, I didn't describe the issue properly. It's not really an issue, just that in the normal case there will be 3 savepoints instead of 2. When it runs, if there are 3 checkpoints, it'll delete one bringing it down to 2, and then it'll create a new one. That results in there always being 3 check points in the normal case. If, instead, we first create a checkpoint and then delete the oldest one, we would have 2 in the normal case. Not an issue, just bringing it up in case you want to have 2 points in the normal case.

Not sure I understand. In the initial indexing, commit is called every 5k blocks, so always on blocks divisible by 5000, which are also divisible by 5. In all other cases, commit() is called whenever the indexer reaches the tip, so it will eventually hit a block that meets the condition. I am not concerned about going back a few blocks more than absolutely required.

The user can do a ctrl+c on block 3 in the initial indexing (for arguments sake) and then the commits will happen on blocks 5003 + 5000n. Also, in the case of them doing a ctrl+c on block 802316 and then starting up again on block 802321, then a savepoint wouldn't be created. In the worst case for the web server, if a reorg happens and we jump back 5 blocks, then we end up reindexing those 5 blocks and whatever is new, we could potentially jump over a savepoint before we start doing the single block index. I know that the chances of 2 reorgs happening in such a short timespan are low, but it is possible.

The other argument for this is people using Ord not in web-server mode with continuous indexing, instead, updating their index on demand when they execute a command. If they happen to never execute a command on a block divisible by 5, they would never get a savepoint.

The solution for this is to take the savepoint logic into the update loop and execute the savepoint creation there, or to keep the logic here, but also add a condition in that loop to commit on blocks divisible by 5 that are less than 10 (maybe a bit more) blocks away from the head block.

src/index.rs

raphjaph · 2023-08-09T15:04:25Z

@victorkirov @veryordinally

I've consolidated most of the logic into one file. I've added back the /status endpoint for unrecoverable reorgs (atm 14 blocks). Also added an error type for that. I haven't incorporated the comments from above yet. Looking at it now

victorkirov · 2023-08-09T18:11:42Z

src/index/reorg.rs

+
+    let mut wtx = index.begin_write()?;
+
+    let oldest_savepoint =


For this, an alternative to consider is to rollback to the latest savepoint and delete it. That would only roll you back 1-4 blocks instead of 10-14. If it's not far enough, then when the updater runs again it will hit a reorg again and rollback to the older savepoint automatically.

victorkirov · 2023-08-09T18:17:43Z

src/index/updater.rs

@@ -130,7 +117,7 @@ impl Updater {
        self.commit(wtx, value_cache)?;


I think there should be a commit test as part of this if block. Something like:

let shouldCommitForSavepoint = starting_height - self.height < 15 && (self.height + uncommitted) % 5 === 0; if uncommitted == 5000 || shouldCommitForSavepoint { self.commit(wtx, value_cache)?;

Disclaimer: Terrible variable naming and the logic might be off by 1 😋

Or maybe a helper function in the reorg file

victorkirov · 2023-08-09T18:19:22Z

I've consolidated most of the logic into one file. I've added back the /status endpoint for unrecoverable reorgs (atm 14 blocks). Also added an error type for that. I haven't incorporated the comments from above yet. Looking at it now

Reorg in its own file is great 👍 Nice that it's contained in one, central place

victorkirov · 2023-08-10T08:18:59Z

Just a heads up, maybe run this branch for a few days to check memory usage. These are our 4 instances that have been running for 2 days. The top 2 are creating 2 savepoints, similar to this PR but with 10 blocks per savepoint, while the bottom 2 are not. Could be a memory issue in redb with savepoints?

raphjaph · 2023-08-10T09:16:35Z

Just a heads up, maybe run this branch for a few days to check memory usage. These are our 4 instances that have been running for 2 days. The top 2 are creating 2 savepoints, similar to this PR but with 10 blocks per savepoint, while the bottom 2 are not. Could be a memory issue in redb with savepoints?

To alleviate this a bit, we only create savepoints if we are 100 blocks from the tip. So inital sync should not incur a performance hit.

victorkirov · 2023-08-10T09:43:20Z

To alleviate this a bit, we only create savepoints if we are 100 blocks from the tip. So inital sync should not incur a performance hit.

I'm doing that as well. I actually only start about 25 blocks away from the head.

Here's an even better example, I restarted all 4 instances about an hour ago, this is their RAM usage over the last hour:

There are 2 instances per line, the top line being the 2 instances with savepoints and the bottom without. Maybe just run the same test with this branch and the main branch for an hour or 2 to confirm. I might've done something else wrong which isn't present in this PR.

raphjaph · 2023-08-10T10:43:15Z

I think more RAM usage represents an acceptable trade-off for increased resilience. Would be great to see what you are doing exactly on your branch. I'm just going to merge this for now but if you find an improvement definitely open a PR!

Reorg resistance

2e8b54f

raphjaph mentioned this pull request Aug 7, 2023

Reorg resistance #2317

Closed

placate clippy

6aeab41

raphjaph commented Aug 7, 2023

View reviewed changes

src/index.rs Outdated Show resolved Hide resolved

raphjaph added 3 commits August 7, 2023 18:07

Merge branch 'master' of github.com:ordinals/ord into reorg-resistance

2ce6bcf

quick fix

6144e1b

quick fix

b8c78d3

raphjaph marked this pull request as ready for review August 7, 2023 17:01

raphjaph added 4 commits August 7, 2023 19:06

remove reorge detection

7eb780e

add error types and logging

346382b

more tests

faf86ea

fix test

023aec6

raphjaph requested review from veryordinally and casey August 7, 2023 20:21

veryordinally approved these changes Aug 8, 2023

View reviewed changes

Clippy

d832b1c

victorkirov reviewed Aug 8, 2023

View reviewed changes

victorkirov reviewed Aug 9, 2023

View reviewed changes

src/index.rs Show resolved Hide resolved

collect logic in one file

21c1c08

victorkirov reviewed Aug 9, 2023

View reviewed changes

raphjaph added 3 commits August 10, 2023 10:28

fix formula

b464750

quick fix

09c49d1

quick fix

457ab8c

quick fix

8d1230d

raphjaph added 2 commits August 10, 2023 12:25

change chain tip distance

f87511e

merging

4a034d0

raphjaph merged commit 0286e79 into master Aug 10, 2023
6 checks passed

raphjaph deleted the reorg-resistance branch August 10, 2023 10:49

This was referenced Aug 10, 2023

Handle reorgs #148

Closed

error: reorg detected at or before 789147 #2080

Closed

error: reorg detected at or before 789603 #2087

Closed

reorg detected. Please rebuild the database #2311

Closed

0xxfu mentioned this pull request Dec 31, 2023

How is the ethscription number determined? ethscriptions-protocol/ESIP-Discussion#12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reorg resistance #2320

Reorg resistance #2320

raphjaph commented Aug 7, 2023 •

edited

raphjaph commented Aug 7, 2023

victorkirov commented Aug 8, 2023 •

edited

veryordinally commented Aug 8, 2023

victorkirov commented Aug 8, 2023 •

edited

veryordinally left a comment

veryordinally commented Aug 8, 2023

victorkirov commented Aug 8, 2023

victorkirov Aug 8, 2023

veryordinally Aug 8, 2023 •

edited

victorkirov Aug 9, 2023 •

edited

raphjaph commented Aug 9, 2023

victorkirov Aug 9, 2023

victorkirov Aug 9, 2023

victorkirov Aug 9, 2023

victorkirov commented Aug 9, 2023 •

edited

victorkirov commented Aug 10, 2023 •

edited

raphjaph commented Aug 10, 2023

victorkirov commented Aug 10, 2023

raphjaph commented Aug 10, 2023 •

edited

		@@ -130,7 +117,7 @@ impl Updater {
		self.commit(wtx, value_cache)?;

Reorg resistance #2320

Reorg resistance #2320

Conversation

raphjaph commented Aug 7, 2023 • edited

raphjaph commented Aug 7, 2023

victorkirov commented Aug 8, 2023 • edited

veryordinally commented Aug 8, 2023

victorkirov commented Aug 8, 2023 • edited

veryordinally left a comment

Choose a reason for hiding this comment

veryordinally commented Aug 8, 2023

victorkirov commented Aug 8, 2023

victorkirov Aug 8, 2023

Choose a reason for hiding this comment

veryordinally Aug 8, 2023 • edited

Choose a reason for hiding this comment

victorkirov Aug 9, 2023 • edited

Choose a reason for hiding this comment

raphjaph commented Aug 9, 2023

victorkirov Aug 9, 2023

Choose a reason for hiding this comment

victorkirov Aug 9, 2023

Choose a reason for hiding this comment

victorkirov Aug 9, 2023

Choose a reason for hiding this comment

victorkirov commented Aug 9, 2023 • edited

victorkirov commented Aug 10, 2023 • edited

raphjaph commented Aug 10, 2023

victorkirov commented Aug 10, 2023

raphjaph commented Aug 10, 2023 • edited

raphjaph commented Aug 7, 2023 •

edited

victorkirov commented Aug 8, 2023 •

edited

victorkirov commented Aug 8, 2023 •

edited

veryordinally Aug 8, 2023 •

edited

victorkirov Aug 9, 2023 •

edited

victorkirov commented Aug 9, 2023 •

edited

victorkirov commented Aug 10, 2023 •

edited

raphjaph commented Aug 10, 2023 •

edited