Skip to content

Node Recovery

jchappelow edited this page Sep 10, 2024 · 4 revisions

In the event of a consensus failure or app hash mismatch that cannot be recovered by simply restarting the node, there are a few possible resolutions.

Resync from scratch

If there are other healthy nodes on the network, the most straightforward resolution is just to delete all data: (1) recreate the PostgreSQL database, and (2) delete the data folders under ~/.kwild but not the private key or configuration files. Then start up the node and have it sync blocks from genesis from other nodes.

This approach may require considerable time and compute resources, including network bandwidth.

Also, this is only applicable if the cause of corruption is not a bug in the code that affected the whole network.

Drop only the PostgreSQL database

kwild has different heights:

  • block height -- the block store and index internal to CometBFT
  • consensus engine state height -- CometBFT's internal state
  • application height -- our application's height that pertains to which blocks we have executed and committed our own state changes

It is possible to reset the application height to zero while keeping CometBFT's data unchanged, which will signal to CometBFT to "reapply" all of the blocks with the application. This is different from "catch up" mode, in which all data is reset and resynchronized from network peers.

To do this, you simply recreate the postgres database. Either drop/create like psql -c 'DROP DATABASE IF EXISTS kwild;' -c 'CREATE DATABASE kwild OWNER kwild;' or delete the docker volume that contains the postgres database cluster.

Rollback CometBFT blocks and drop PostgreSQL database

If the recover actions above failed to fix the node, it may be required to rollback CometBFT's blocks and state by one or more blocks. The docs for the cometbft rollback command explains this:

$ cometbft rollback -h

A state rollback is performed to recover from an incorrect application state transition,
when CometBFT has persisted an incorrect app hash and is thus unable to make
progress. Rollback overwrites a state at height n with the state at height n - 1.
The application should also roll back to height n - 1. If the --hard flag is not used, 
no blocks will be removed so upon restarting CometBFT the transactions in block n will be 
re-executed against the application. Using --hard will also remove block n. This can
be done multiple times.

Usage:
  cometbft rollback [flags]

Flags:
      --hard   remove last block as well as state
  -h, --help   help for rollback

Global Flags:
      --home string        directory for config and data (default "/home/jon/.cometbft")
      --log_level string   log level (default "info")
      --trace              print out full stack trace on errors

If this were needed, it is likely that there is a bug that needs to be fixed first. It would also be likely that many other nodes in the network may be affected by, say, a determinism bug. In that event, to recover the network, it would be necessary to:

  1. fix the bug
  2. rollback one or more blocks on the affected nodes
  3. reset the application state (PostgreSQL)
  4. deploy the new version of kwild
  5. have it reapply all the block data up to the point before the bug forked (and halted) the network.

To perform a rollback, we may use the cometbft command line application. Install to $GOPATH/bin as follows:

go install -v github.com/cometbft/cometbft/cmd/cometbft@v0.38.12

Assuming $GOPATH/bin is on your PATH, it may then be used from and folder the command line.

Using the rollback command as documented requires setting the --home folder to CometBFT's root, which is the abci subfolder in kwild's root directory:

$ cometbft rollback --hard --home ~/.kwil/abci
Rolled back both state and block to height 823 and hash 5C9824172FF1717C32671420A02E76B41779687A7642F3A19D9B5A56ACF3278F

Note that it is not possible to keep the application state intact while having reset or partially rolled back cometbft's data. In this case, an error such as the following will be received:

error on replay: app block height (824) is higher than core (823)

The CometBFT data may be ahead of the application, but not the reverse.

Using cometbft inspect

While not a recovery process, it is helpful for debugging to use CometBFT's RPC service to inspect the databases of a stopped node:

$ cometbft inspect -h

	inspect runs a subset of CometBFT's RPC endpoints that are useful for debugging
	issues with CometBFT.

	When the CometBFT detects inconsistent state, it will crash the
	CometBFT process. CometBFT will not start up while in this inconsistent state.
	The inspect command can be used to query the block and state store using CometBFT
	RPC calls to debug issues of inconsistent state.

Usage:
  cometbft inspect [flags]

As with cometbft rollback, you specify the path the abci subfolder using the --home flag.

$ cometbft inspect --home ~/.kwild/abci
I[2024-03-14|16:44:58.830] starting inspect server                      module=main 
I[2024-03-14|16:44:58.831] RPC HTTP server starting                     address=tcp://127.0.0.1:26657
I[2024-03-14|16:44:58.831] serve                                        msg="Starting RPC HTTP server on 127.0.0.1:26657"

This starts the CometBFT RPC server (not ours), which a different set of RPCs that are blind to the existence of the Kwil DB application. They are documented thoroughly with examples here.

For instance, to use the /block endpoint to get the best block height:

$ curl -s --insecure http://localhost:26657/block | jq '.result.block.header.height'
"823"

Try jq '{hash: .result.block_id.hash, app_hash: .result.block.header.app_hash}' for both hash and height from the above endpoint.

To view a formatted and colorized summary of block number 5, use jq and less:

curl --insecure 'http://localhost:26657/block?height=5' | jq -C | less -R

To list all of the transactions in block number 43241:

curl -s --insecure 'https://localhost:26657/tx_search?query="tx.height=43241"'