Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-determinism due to incorrect parameter caching #2708

Closed
chainum opened this issue Feb 22, 2020 · 8 comments · Fixed by #2709
Closed

Non-determinism due to incorrect parameter caching #2708

chainum opened this issue Feb 22, 2020 · 8 comments · Fixed by #2709
Assignees
Labels
c:bug Category: bug c:consensus/tendermint Category: Tendermint-based consensus c:security Category: security issues p:0 Priority: High! bugs, address immediately

Comments

@chainum
Copy link

chainum commented Feb 22, 2020

Reporting an attack as a part of the ongoing "The Quest" challenge, and I'm guessing this is the place to do it.

SUMMARY

I wrote a tx spammer in Go that spams the network with bogus transactions (with each tx having a payload as close as possible to the max tx size of 32768 bytes).

Initially this forced roughly 30% of the nodes on the network to crash and stop syncing, ending up with the error:

{"caller":"indexer_service.go:69","height":152154,"level":"info","module":"tendermint:txindex","msg":"Indexed block","ts":"2020-02-22T08:36:14.948753672Z"}
panic: Failed to process committed block (152155:2A579845B98EA9BF9D699409BDD19D4240A48C1ED0E90268FF8AB13547E0D02B): Wrong Block.Header.AppHash. Expected F9A183178993CE3F8D10F83BB1CBD5F41650ECE387324E94BAE7116E1C54F4F3, got E3D8BA16B0321B3E1B3A706A6AAF0835803A30F7F842C2054A24324097E52A94

(Multiple people on Slack whose nodes got terminated also confirmed that they received the same error message).

The only way to recover from this error was to remove all *.db and *.wal files, restart the node and effectively resync from block 0. This didn't work every time though - sometimes it did work, sometimes it didn't - so frustratedly enough you'd have to repeat this a couple of times in order to get the node to sync the chain properly again.

Here's how the top 25 leaderboard (using SmartStake's dashboard) looked after the initial wave of the attack:
The Quest Top 25 After the Attack

Outside of the top 25 plenty of other nodes were also affected. A lot of the nodes have since recovered (after performing full resyncs from block 0) but some are still offline since the start of the attack.

ISSUE TYPE
  • Bug Report / Attack Report
COMPONENT NAME

go/consensus/tendermint

OASIS NODE VERSION
$ ./oasis-node --version
Software version: 20.3.1
Runtime protocol version: 0.11.0
Consensus protocol version: 0.23.0
Committee protocol version: 0.7.0
Tendermint core version: 0.32.8
ABCI library version: 0.16.1
OS / ENVIRONMENT
$ go version
go version go1.13.7 linux/amd64
STEPS TO REPRODUCE
mkdir -p oasis-spammer && cd oasis-spammer
bash <(curl -s -S -L https://raw.githubusercontent.com/SebastianJ/oasis-spammer/master/scripts/install.sh)
cd data && curl -LOs https://gist.githubusercontent.com/SebastianJ/c0844c22e582bf6c20787cc9744b6964/raw/f3730bd54c03c584d96511e05e581a1d8de2e309/bubzbase64.txt && mv bubzbase64.txt data.txt && cd ..
./oasis-spammer --genesis-file PATH/TO/genesis.json --entity-path PATH/TO/entity/folder --socket unix:PATH/TO/internal.sock --count 100000 --pool-size 100

The steps above will download the statically compiled go binary (for Linux), download the list of receiver addresses (to randomly send txs to) and also download the tx payload, a Base64 encoded picture of the one and only - Mr Bubz:
Bubz

ACTUAL RESULTS

Several nodes in "The Quest" network were terminated and couldn't get back to syncing/joining consensus unless they removed all databases (*.db & *.wal), restarted their nodes and resynced from block 0.

The following error message (or similar alterations of it) was displayed:

{"caller":"node.go:735","err":"tendermint: failed to create node: error during handshake: error on replay: Wrong Block.Header.AppHash. Expected E8774EFDB10979C780E1C16BEAFD2B67AC170295E4DAC7CB978AF25D0E35C199, got DF92822F6870181DF2D62B0DFB33D7436A70B6B1C90A5033E3971F143D454108","level":"error","module":"oasis-node","msg":"failed to start tendermint service","ts":"2020-02-22T07:29:38.5682535Z"}
EXPECTED RESULTS

Nodes should be able to withstand an attack like this. They should't get terminated or have to resync from block 0 to be able to participate in the consensus process again.

@kostko kostko added c:bug Category: bug c:security Category: security issues p:0 Priority: High! bugs, address immediately labels Feb 22, 2020
@kostko
Copy link
Member

kostko commented Feb 22, 2020

Thanks for your report.

From the error message it seems that the transaction somehow causes non-deterministic execution (e.g., the nodes that crashed diverged from the majority). The transaction payload includes an extra xfer_data field, but that should be ignored on all nodes equally.

@kostko
Copy link
Member

kostko commented Feb 22, 2020

My current hypothesis is that the issue is not caused by the actual transaction spam directly, rather it just triggers a bug in nodes that restarted at any point since genesis and (due to a bug) didn't initialize the maxTxSize correctly:
https://github.com/oasislabs/oasis-core/blob/e4132f54409a75892a6f37f892c3ace9b3959550/go/consensus/tendermint/abci/mux.go#L375-L377

Since this (incorrectly) only happens in InitChain, the parameter is never set if the node restarts at some point after genesis.

This is also the reason why wiping state fixed it (unless -- I assume -- you again restarted the node before you processed the oversized transactions).

@kostko kostko added the c:consensus/tendermint Category: Tendermint-based consensus label Feb 22, 2020
@psabilla
Copy link

@sebastianj You rock.

@chainum
Copy link
Author

chainum commented Feb 23, 2020

@kostko Hey - yeah, I wouldn't really call what I did "spam" since Tendermint's way of dealing with nonces effectively kinda stops spam (it immediately cancels the ensuing transactions - other chains queue them up and process them according to the nonce order).

Compared to when I've done these attacks on Harmony and Elrond, the actual tx spamming was significantly less effective on your chain since Tendermint's way of dealing with nonces effectively acts as a rate limiter.

I also presumed that it was the actual tx spamming of valid txs that caused this - which doesn't seem to be the case (just like you also concluded). I left the "spammer" running yesterday for several hours with txs just being shy of 32768 bytes and that didn't crash my node.

I then remembered that I initially ran the spammer with invalid / large payloads when my node + other nodes started crashing.

So I just switched the tx payload over to https://gist.github.com/SebastianJ/70fbf825b0a98e420b35a0ca64c78c8d (301,552 bytes) (i.e. oversized tx payload) and after sending a couple of those txs my node went into the tendermint: failed to create node: error during handshake: error on replay: Wrong Block.Header.AppHash. Expected 759791ABF97F72ACCEB2A768974A9806173001C6881CFD6DA328C4A4D3B4A769, got 1459652747EB4D87204DC37851321E12EDA35ECC001935EFCA021A5509DBC19D mode again.

So yeah, seems you've definitely nailed down what's going on - and I have to resync from block 0 again 🤣

@chainum
Copy link
Author

chainum commented Feb 23, 2020

@psabilla Thanks :)

@kostko kostko changed the title The Quest - spam attack temporarily disabled roughly 30% of the nodes on the network Non-determinism due to incorrect parameter caching Feb 23, 2020
@kostko
Copy link
Member

kostko commented Feb 23, 2020

Successful reproduction in long-term E2E tests in #2709.

@mirrormirage0
Copy link

@sebastianj , The Breaker of Chains ! 👍

@kostko
Copy link
Member

kostko commented Feb 24, 2020

Fixed in master and backported to 20.3.x branch (will be in the next 20.3.2 release). Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c:bug Category: bug c:consensus/tendermint Category: Tendermint-based consensus c:security Category: security issues p:0 Priority: High! bugs, address immediately
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants