Rolling back stake pools take ages #1281

KtorZ · 2020-01-17T18:30:57Z

Context

Information	-
Version	v2020-12-23+
Platform	All
Installation	N/A

Steps to Reproduce

Create a standalone private blockchain (e.g. using the integration config).
Turn on the wallet server and wait for a few slots / epochs.
Turn the wallet off.
Change a setting in the genesis configuration and re-generate the genesis block (effectively creating another blockchain).
Restart the server on the same state directory.

Expected behavior

In cases of fork, the stake pool monitor can quickly rollback to a known point and keep syncing normally.

Actual behavior

The pool monitor rolls back roughly 10 by 10 until it finds a common ancestor.

[cardano-wallet.stake-pool-monitor:Info:362] [2020-01-17 18:09:53.65 UTC] Monitoring stake pools. Currently at 351628d6-[96.3233#64202]
[cardano-wallet.stake-pool-monitor:Info:362] [2020-01-17 18:09:53.89 UTC] Monitoring stake pools. Currently at 04fc30db-[96.3073#64193]
[cardano-wallet.stake-pool-monitor:Info:362] [2020-01-17 18:09:54.66 UTC] Monitoring stake pools. Currently at 70bf9891-[96.2978#64184]
[cardano-wallet.stake-pool-monitor:Info:362] [2020-01-17 18:09:55.17 UTC] Monitoring stake pools. Currently at edbf30a9-[96.2861#64175]
[cardano-wallet.stake-pool-monitor:Info:362] [2020-01-17 18:09:55.68 UTC] Monitoring stake pools. Currently at 105f158c-[96.2785#64166]
[cardano-wallet.stake-pool-monitor:Info:362] [2020-01-17 18:09:56.19 UTC] Monitoring stake pools. Currently at 3c67a40b-[96.2681#64157]

Resolution

Why is this happening? Because:

a) The stake pool monitoring doesn't clean up old checkpoints.
b) In case where the chain follower is unable to find a suitable intersection, it'll instrument the client to enter in a "recovery" mode, by first, asking it to rollback to the oldest known checkpoint in the current window. If there's no older checkpoint, it recover from genesis.

Because the stake pool monitoring doesn't clean up old checkpoints and because a new window is used after every rollback, the monitoring algorithm will make small hops of 10 blocks by 10 blocks until it finds an intersection point, which could possibly be many many blocks in the past.

In theory, this shouldn't be an issue if the chain follower is at least of length k = epoch stability, but since Jörmungandr doesn't abide by the k parameter, an intersection point can be arbitrarily far in the past.

Possible resolutions:

a. ~~Clean up old checkpoints in the stake pool monitoring to allow rolling back to genesis when relevant~~ ==> hard to do correctly without deeper changes in the database schemas (we need the block height for this).

b. Tweak the chain follower such that the recovery doesn't first attempt to rollback to the latest known checkpoint, but instead, directly try to rollback to genesis (the rationale is that, this first attempt to rollback to the latest known checkpoint is more of a desperate move to avoid rolling back from genesis...) ==> not 100% sure about the consequences and UX for wallets.

c. Increase the size of the window for Jörmungandr to be arbitrarily bigger (haskell nodes uses k=2160). ==> #1285

QA

Reproducing steps as described aboved shows that resyncing back to genesis now happens with bigger gaps (10x bigger) and is faster.

The text was updated successfully, but these errors were encountered:

piotr-iohk · 2020-01-21T11:15:51Z

Looks good in terms of stake pools.

Question: is this expected? 👇

I've repeated steps above slightly modified:

Create a standalone private blockchain (e.g. using the integration config).
Turn on the wallet server and wait for a few slots / epochs.
Create a wallet
Turn the wallet off.
Change a setting in the genesis configuration and re-generate the genesis block (effectively creating another blockchain).
Restart the server on the same state directory.

For version v2020-12-23:
This works fine and I can then use the wallet after the rollback.

For current master
I get:

[cardano-wallet.wallet-engine:Error:29] [2020-01-21 11:04:31.30 UTC] Worker has exited unexpectedly: ErrCheckIntegrityDifferentGenesis (Hash {getHash = "\216\163\131S\189W\b]\213+\139U\169Ue\205W\215\158\SI\131}\151ed\ACK\250\186O~\233\&7"}) (Hash {getHash = "<\a\ETX\SO6\191\255\230~..\192\158R\147\211\132c|\210\240\EOT5n\243 \243\254lR\EOT\SUB"})

and then:

$ cardano-wallet-jormungandr wallet list
Something went wrong

KtorZ · 2020-01-21T11:27:00Z

That the wallet worker exits, is normal and the expected behavior. The problem is the following:

If the genesis hashes are different, it means that we are so-to-speak on a different blockchain. So we have two choices:

We could rollback the whole wallet state and silently resync to this new chain.
We could prevent the wallet from starting because it corresponds to a different chain.

I've chosen to go for the second (also seconded by @jonathanknowles on this decision). The rationale is that, if it's a different chain, it's also a different wallet even if the mnemonic is the same. It's also best to be explicit because, you may be surprise that nothing was reported on the server but it seems that transactions are missing from your wallet or whatever... Not realizing you are on the wrong chain can take some debugging time, whereas here, the worker explictely exits telling you that the genesis hashes were different.

Having said that.... the "Something went wrong" when listing wallets afterwards is not cool and not really expected. Seems like the workers being terminated has some effect on the API result which really shouldn't be the case. We should look into this, although pretty minor.

piotr-iohk · 2020-01-21T11:46:39Z

OK. Issue created -> #1292.

KtorZ added the BUG:CONFIRMED label Jan 17, 2020

KtorZ self-assigned this Jan 20, 2020

KtorZ mentioned this issue Jan 20, 2020

Make stake pools engine rollback faster. #1285

Merged

piotr-iohk mentioned this issue Jan 21, 2020

Cannot list wallets when using same state directory but different genesis #1292

Closed

piotr-iohk closed this as completed Jan 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rolling back stake pools take ages #1281

Rolling back stake pools take ages #1281

KtorZ commented Jan 17, 2020 •

edited

piotr-iohk commented Jan 21, 2020

KtorZ commented Jan 21, 2020

piotr-iohk commented Jan 21, 2020

Rolling back stake pools take ages #1281

Rolling back stake pools take ages #1281

Comments

KtorZ commented Jan 17, 2020 • edited

Context

Steps to Reproduce

Expected behavior

Actual behavior

Resolution

QA

piotr-iohk commented Jan 21, 2020

KtorZ commented Jan 21, 2020

piotr-iohk commented Jan 21, 2020

KtorZ commented Jan 17, 2020 •

edited