Non-determinism due to incorrect parameter caching #2708

chainum · 2020-02-22T16:56:50Z

Reporting an attack as a part of the ongoing "The Quest" challenge, and I'm guessing this is the place to do it.

SUMMARY

I wrote a tx spammer in Go that spams the network with bogus transactions (with each tx having a payload as close as possible to the max tx size of 32768 bytes).

Initially this forced roughly 30% of the nodes on the network to crash and stop syncing, ending up with the error:

{"caller":"indexer_service.go:69","height":152154,"level":"info","module":"tendermint:txindex","msg":"Indexed block","ts":"2020-02-22T08:36:14.948753672Z"}
panic: Failed to process committed block (152155:2A579845B98EA9BF9D699409BDD19D4240A48C1ED0E90268FF8AB13547E0D02B): Wrong Block.Header.AppHash. Expected F9A183178993CE3F8D10F83BB1CBD5F41650ECE387324E94BAE7116E1C54F4F3, got E3D8BA16B0321B3E1B3A706A6AAF0835803A30F7F842C2054A24324097E52A94

(Multiple people on Slack whose nodes got terminated also confirmed that they received the same error message).

The only way to recover from this error was to remove all *.db and *.wal files, restart the node and effectively resync from block 0. This didn't work every time though - sometimes it did work, sometimes it didn't - so frustratedly enough you'd have to repeat this a couple of times in order to get the node to sync the chain properly again.

Here's how the top 25 leaderboard (using SmartStake's dashboard) looked after the initial wave of the attack:

Outside of the top 25 plenty of other nodes were also affected. A lot of the nodes have since recovered (after performing full resyncs from block 0) but some are still offline since the start of the attack.

ISSUE TYPE

Bug Report / Attack Report

COMPONENT NAME

go/consensus/tendermint

OASIS NODE VERSION

$ ./oasis-node --version
Software version: 20.3.1
Runtime protocol version: 0.11.0
Consensus protocol version: 0.23.0
Committee protocol version: 0.7.0
Tendermint core version: 0.32.8
ABCI library version: 0.16.1

OS / ENVIRONMENT

$ go version
go version go1.13.7 linux/amd64

STEPS TO REPRODUCE

mkdir -p oasis-spammer && cd oasis-spammer
bash <(curl -s -S -L https://raw.githubusercontent.com/SebastianJ/oasis-spammer/master/scripts/install.sh)
cd data && curl -LOs https://gist.githubusercontent.com/SebastianJ/c0844c22e582bf6c20787cc9744b6964/raw/f3730bd54c03c584d96511e05e581a1d8de2e309/bubzbase64.txt && mv bubzbase64.txt data.txt && cd ..
./oasis-spammer --genesis-file PATH/TO/genesis.json --entity-path PATH/TO/entity/folder --socket unix:PATH/TO/internal.sock --count 100000 --pool-size 100

The steps above will download the statically compiled go binary (for Linux), download the list of receiver addresses (to randomly send txs to) and also download the tx payload, a Base64 encoded picture of the one and only - Mr Bubz:

ACTUAL RESULTS

Several nodes in "The Quest" network were terminated and couldn't get back to syncing/joining consensus unless they removed all databases (*.db & *.wal), restarted their nodes and resynced from block 0.

The following error message (or similar alterations of it) was displayed:

{"caller":"node.go:735","err":"tendermint: failed to create node: error during handshake: error on replay: Wrong Block.Header.AppHash. Expected E8774EFDB10979C780E1C16BEAFD2B67AC170295E4DAC7CB978AF25D0E35C199, got DF92822F6870181DF2D62B0DFB33D7436A70B6B1C90A5033E3971F143D454108","level":"error","module":"oasis-node","msg":"failed to start tendermint service","ts":"2020-02-22T07:29:38.5682535Z"}

EXPECTED RESULTS

Nodes should be able to withstand an attack like this. They should't get terminated or have to resync from block 0 to be able to participate in the consensus process again.

The text was updated successfully, but these errors were encountered:

kostko · 2020-02-22T17:09:16Z

Thanks for your report.

From the error message it seems that the transaction somehow causes non-deterministic execution (e.g., the nodes that crashed diverged from the majority). The transaction payload includes an extra xfer_data field, but that should be ignored on all nodes equally.

kostko · 2020-02-22T21:58:12Z

My current hypothesis is that the issue is not caused by the actual transaction spam directly, rather it just triggers a bug in nodes that restarted at any point since genesis and (due to a bug) didn't initialize the maxTxSize correctly:
https://github.com/oasislabs/oasis-core/blob/e4132f54409a75892a6f37f892c3ace9b3959550/go/consensus/tendermint/abci/mux.go#L375-L377

Since this (incorrectly) only happens in InitChain, the parameter is never set if the node restarts at some point after genesis.

This is also the reason why wiping state fixed it (unless -- I assume -- you again restarted the node before you processed the oversized transactions).

psabilla · 2020-02-23T03:05:25Z

@sebastianj You rock.

chainum · 2020-02-23T03:10:06Z

@kostko Hey - yeah, I wouldn't really call what I did "spam" since Tendermint's way of dealing with nonces effectively kinda stops spam (it immediately cancels the ensuing transactions - other chains queue them up and process them according to the nonce order).

Compared to when I've done these attacks on Harmony and Elrond, the actual tx spamming was significantly less effective on your chain since Tendermint's way of dealing with nonces effectively acts as a rate limiter.

I also presumed that it was the actual tx spamming of valid txs that caused this - which doesn't seem to be the case (just like you also concluded). I left the "spammer" running yesterday for several hours with txs just being shy of 32768 bytes and that didn't crash my node.

I then remembered that I initially ran the spammer with invalid / large payloads when my node + other nodes started crashing.

So I just switched the tx payload over to https://gist.github.com/SebastianJ/70fbf825b0a98e420b35a0ca64c78c8d (301,552 bytes) (i.e. oversized tx payload) and after sending a couple of those txs my node went into the tendermint: failed to create node: error during handshake: error on replay: Wrong Block.Header.AppHash. Expected 759791ABF97F72ACCEB2A768974A9806173001C6881CFD6DA328C4A4D3B4A769, got 1459652747EB4D87204DC37851321E12EDA35ECC001935EFCA021A5509DBC19D mode again.

So yeah, seems you've definitely nailed down what's going on - and I have to resync from block 0 again 🤣

chainum · 2020-02-23T03:12:55Z

@psabilla Thanks :)

kostko · 2020-02-23T18:14:36Z

Successful reproduction in long-term E2E tests in #2709.

mirrormirage0 · 2020-02-24T09:48:33Z

@sebastianj , The Breaker of Chains ! 👍

kostko · 2020-02-24T12:47:50Z

Fixed in master and backported to 20.3.x branch (will be in the next 20.3.2 release). Thanks again!

kostko added c:bug Category: bug c:security Category: security issues p:0 Priority: High! bugs, address immediately labels Feb 22, 2020

kostko mentioned this issue Feb 22, 2020

Add more txsource workloads #2506

Closed

4 tasks

kostko added the c:consensus/tendermint Category: Tendermint-based consensus label Feb 22, 2020

kostko changed the title ~~The Quest - spam attack temporarily disabled roughly 30% of the nodes on the network~~ Non-determinism due to incorrect parameter caching Feb 23, 2020

kostko mentioned this issue Feb 23, 2020

go/oasis-test-runner: Improve txsource E2E test #2709

Merged

1 task

kostko self-assigned this Feb 24, 2020

kostko closed this as completed in #2709 Feb 24, 2020

kostko mentioned this issue Feb 24, 2020

backport: go/consensus/tendermint: Properly cache consensus parameters #2713

Merged

kostko mentioned this issue Feb 28, 2020

Node sync replay error #2743

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-determinism due to incorrect parameter caching #2708

Non-determinism due to incorrect parameter caching #2708

chainum commented Feb 22, 2020 •

edited

kostko commented Feb 22, 2020 •

edited

kostko commented Feb 22, 2020

psabilla commented Feb 23, 2020

chainum commented Feb 23, 2020 •

edited

chainum commented Feb 23, 2020

kostko commented Feb 23, 2020

mirrormirage0 commented Feb 24, 2020

kostko commented Feb 24, 2020

Non-determinism due to incorrect parameter caching #2708

Non-determinism due to incorrect parameter caching #2708

Comments

chainum commented Feb 22, 2020 • edited

SUMMARY

ISSUE TYPE

COMPONENT NAME

OASIS NODE VERSION

OS / ENVIRONMENT

STEPS TO REPRODUCE

ACTUAL RESULTS

EXPECTED RESULTS

kostko commented Feb 22, 2020 • edited

kostko commented Feb 22, 2020

psabilla commented Feb 23, 2020

chainum commented Feb 23, 2020 • edited

chainum commented Feb 23, 2020

kostko commented Feb 23, 2020

mirrormirage0 commented Feb 24, 2020

kostko commented Feb 24, 2020

chainum commented Feb 22, 2020 •

edited

kostko commented Feb 22, 2020 •

edited

chainum commented Feb 23, 2020 •

edited