Save observed instances of ChunkStateWitness to the database for later analysis #11137

jancionear · 2024-04-23T12:12:25Z

Save all of the observed instances of ChunkStateWitness to the database.
They can later be fetched for debugging and analysis.

Sometimes things go wrong with witness validation, and we're not sure why. In such cases it would be good to take a look at the witness that failed the validation, but currently there's no way to see the witness, it disappears after validation. Saving the witnesses in the database will allow us to inspect the witnesses after it failed validation and see what exactly is wrong, even after node crash.

Adds a new database column DBCol::LatestChunkStateWitnesses, which keeps a set of latest observed witnesses. Size of this set is limited to 4GB and 60*30 (30 min) of instances. When the limit is hit, the witness with the lowest height is removed from the set.
We can't really save all witnesses, at 4MB/s that would add up to 345GB/day, so we're forced to garbage collect the old witnesses. Having only the latest witnesses should be enough for debugging and analysis.

Witness saving could potentially be an attack vector, as we save all witnesses, even the ones that failed validation. An attacker could spam the node with thousands of witnesses, which would all be saved to the database, which could cause denial of service.
Because of that the feature is guarded by a new config option: save_latest_witnesses. By default it's false, so production nodes won't save anything, and won't be vulnerable to such attacks. It can be selectively enabled on test/canary nodes when needed.

A new database command is added to access the saved witnesses:

Print observed ChunkStateWitnesses at the given block height (and shard id).
Observed witnesses are only saved when `save_latest_witnesses` is set to true in config.json

Usage: neard database show-latest-witnesses [OPTIONS] --height <HEIGHT>

Options:
      --height <HEIGHT>      Block height (required)
      --shard-id <SHARD_ID>  Shard id (optional)
      --pretty               Pretty-print using the "{:#?}" formatting
      --binary               Print the raw &[u8], can be pasted into rust code
  -h, --help                 Print help

The tool allows to print observed witnesses with given height (and optionally shard id).
Either as a debug print {:?}/{:#?}, or as a binary blob that can be pasted into rust code.

I'm not sure if that's the best way to expose it for debugging, but it was the easiest one to implement. Maybe it should somehow be integrated with debug-ui? I'm not very familiar with it, idk if it'd make sense.

Fixes: #11110
Similar to: #10599

Save the latest observed instances of ChunkStateWitness in the database for later analysis and debugging. Sometime witness validation goes wrong, and we're not sure why. It's impossible to debug such cases without the witness, and currently there's no way to view the offending witness. Saving the witnesses in the database will allow us to inspect weird witnesses, even if the node crashes.

Saving observed witnesses could be an attack vector. Someone could send thousands of witnesses and we'd save them all, which could overload the database and cause a denial of service. We want to save invalid witnesses, so we can't really filter anything out. Let's add a config option which controls whether the witnesses are saved. It's disabled by default, as saving witnesses poses a security risk.

Add a command which allows to print observed witnesses with the given height and shard id.

jancionear · 2024-04-23T12:30:00Z

I just saw that the issue description mentions that we already save witnesses in some cases x.x

According to @robin-near and @pugachAG , it seems we are storing a completely invalid state witness, but if a witness is non-deterministically invalid (e.g. it is identified as invalid under specific state of chunk validator, such as missing contract code from cache), the data is gone forever and we have to replay a large block span, hoping to reproduce the issue.

Is this correct? I didn't find anything like that in the code. We save StateTransitionData, but that's a different thing 0_o

codecov · 2024-04-23T12:43:16Z

Codecov Report

Attention: Patch coverage is 5.48780% with 155 lines in your changes are missing coverage. Please review.

Project coverage is 71.15%. Comparing base (f95087b) to head (dbbf460).
Report is 8 commits behind head on master.

Files	Patch %	Lines
chain/chain/src/store/latest_witnesses.rs	0.00%	121 Missing ⚠️
tools/state-viewer/src/latest_witnesses.rs	0.00%	27 Missing and 1 partial ⚠️
chain/chain/src/garbage_collection.rs	0.00%	3 Missing ⚠️
...nt/src/stateless_validation/chunk_validator/mod.rs	66.66%	1 Missing ⚠️
...src/stateless_validation/state_witness_producer.rs	66.66%	1 Missing ⚠️
tools/state-viewer/src/cli.rs	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #11137      +/-   ##
==========================================
+ Coverage   71.10%   71.15%   +0.04%     
==========================================
  Files         773      775       +2     
  Lines      154176   154033     -143     
  Branches   154176   154033     -143     
==========================================
- Hits       109622   109596      -26     
+ Misses      40109    40003     -106     
+ Partials     4445     4434      -11

Flag	Coverage Δ
backward-compatibility	`0.24% <0.00%> (-0.01%)`	⬇️
db-migration	`0.24% <0.00%> (-0.01%)`	⬇️
genesis-check	`1.42% <0.60%> (-0.01%)`	⬇️
integration-tests	`36.98% <4.87%> (+0.09%)`	⬆️
linux	`69.55% <3.04%> (+0.02%)`	⬆️
linux-nightly	`70.64% <5.48%> (+0.08%)`	⬆️
macos	`54.33% <2.43%> (+0.08%)`	⬆️
pytests	`1.65% <0.60%> (-0.01%)`	⬇️
sanity-checks	`1.44% <0.60%> (-0.01%)`	⬇️
unittests	`66.79% <3.65%> (+0.02%)`	⬆️
upgradability	`0.29% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tayfunelmas · 2024-04-23T15:54:32Z

chain/client/src/stateless_validation/state_witness_producer.rs

@@ -43,6 +43,10 @@ impl Client {
            transactions_storage_proof,
        )?;

+        if self.config.save_latest_witnesses {
+            self.chain.chain_store.save_lateset_chunk_state_witness(&state_witness)?;


nit typo, lateset->latest

tayfunelmas · 2024-04-23T15:57:45Z

chain/chain/src/store/latest_witnesses.rs

+    pub shard_id: u64,
+    /// Each witness has a random UUID to ensure that the key is unique.
+    /// It allows to store multiple witnesses with the same height and shard_id.
+    pub random_uuid: [u8; 16],


can this be the account id of the chunk producer that sent the witness? from chunk producer side it will be single, from chunk validator side it should expect a single witness from each chunk producer.

I thought @pugachAG was working on a PR where we only processed one state witness for a given height, epoch, shard? This shouldn't then be needed?

We should include epoch_id in the key as well

We could, but I'm not sure if there's any benefit to it.
Usually there'll be only one witness per (height, shard_id). The situations where there is more than one will be weird, where anything can happen, including sending multiple witnesses by the same chunk producer.
For debugging I think it's best to save everything that we receive, without deduplication. If I were debugging witness issues I'd like to know that the chunk producer is sending duplicate messages, and deduplicating by account id would hide that from me.

ok, yes, using account id would hide the case of the same chunk producer sending duplicate witnesses. and witness already has the account it for debugging.

I thought @pugachAG was working on #11111 where we only processed one state witness for a given height, epoch, shard? This shouldn't then be needed?

I wanted to put saving witnesses before Anton's check, debug information should contain all witnesses, even the ones that were rejected.

We should include epoch_id in the key as well

I think it's impossible to have multiple possible epochs at a height. There's exactly one valid block per height, and this block will always have a clearly defined epoch.

Ok, but we could receive witnesses on different epochs. I can add an epoch filter as well.

tayfunelmas · 2024-04-23T15:58:47Z

chain/chain/src/store/latest_witnesses.rs

+
+impl LatestWitnessesKey {
+    /// `LatestWitnessesKey` has custom serialization to ensure that the binary representation
+    /// starts with big-endan height and shard_id.


nit typo in big-endan

tayfunelmas · 2024-04-23T16:02:21Z

chain/chain/src/store/latest_witnesses.rs

+
+        // Go over witnesses with increasing (height, shard_id) and remove them until the limits are satisfied.
+        // Height and shard id are stored in big-endian representation, so sorting the binary representation is
+        // the same as sorting the integeres.


nit typo in integeres

chain/chain/src/store/latest_witnesses.rs

shreyan-gupta

oh crap, I forgot to post the comments... Had reviewed this some time ago.

shreyan-gupta · 2024-04-23T21:34:11Z

chain/chain/src/store/latest_witnesses.rs

+    pub shard_id: u64,
+    /// Each witness has a random UUID to ensure that the key is unique.
+    /// It allows to store multiple witnesses with the same height and shard_id.
+    pub random_uuid: [u8; 16],


I thought @pugachAG was working on a PR where we only processed one state witness for a given height, epoch, shard? This shouldn't then be needed?

shreyan-gupta · 2024-04-23T21:34:52Z

chain/chain/src/store/latest_witnesses.rs

+    pub shard_id: u64,
+    /// Each witness has a random UUID to ensure that the key is unique.
+    /// It allows to store multiple witnesses with the same height and shard_id.
+    pub random_uuid: [u8; 16],


We should include epoch_id in the key as well

shreyan-gupta · 2024-04-23T21:40:39Z

chain/client/src/stateless_validation/state_witness_producer.rs

@@ -43,6 +43,10 @@ impl Client {
            transactions_storage_proof,
        )?;

+        if self.config.save_latest_witnesses {
+            self.chain.chain_store.save_lateset_chunk_state_witness(&state_witness)?;


Shouldn't we also be calling this function at process_chunk_state_witness where we first receive the state witness from an external chunk producer?

It's also called in process_chunk_state_witness

shreyan-gupta · 2024-04-23T22:47:35Z

chain/chain/src/store/latest_witnesses.rs

+        witness: &ChunkStateWitness,
+    ) -> Result<(), std::io::Error> {
+        let serialized_witness = borsh::to_vec(witness)?;
+        let serialized_witness_size: u64 =


dumb question, why aren't we working with usize here?

I have to serialize the size, so I wanted it to have a clearly defined representation. The exact type of usize depends on the architecture, and I didn't want to do that in serialized data.
I guess it doesn't really matter, because no one is going to move the saved witnsses data to a different arch, but it feels more proper to use u64.

jancionear · 2024-04-26T10:56:49Z

I just learned that there's a neard view-state command, I think it would be a better place for show-latest-witnesses than neard database, I'll move it there.

There should be a return there, we don't want to save big witnesses.

pugachAG · 2024-04-30T10:11:09Z

chain/chain/src/store/latest_witnesses.rs

+    /// `LatestWitnessesKey` has custom serialization to ensure that the binary representation
+    /// starts with big-endian height and shard_id.
+    /// This allows to query using a key prefix to find all witnesses for a given height (and shard_id).
+    pub fn serialized(&self) -> [u8; 64] {


why do we need manual ser/deser instead of just using borsh?
from what I understand we use borsh whenever possible when encoding data in the db

I want the rows to be ordered by (height, shard_id) in big endian representation. I'm not sure if borsh would do that, it was quicker and easier to manually serialize it.

jancionear added 3 commits April 23, 2024 14:04

Add neard database show-latest-witnesses command

da740af

Add a command which allows to print observed witnesses with the given height and shard id.

jancionear requested a review from a team as a code owner April 23, 2024 12:12

jancionear requested review from akhi3030, pugachAG, Longarithm and robin-near and removed request for akhi3030 April 23, 2024 12:12

jancionear added the A-stateless-validation Area: stateless validation label Apr 23, 2024

tayfunelmas reviewed Apr 23, 2024

View reviewed changes

chain/chain/src/store/latest_witnesses.rs Show resolved Hide resolved

tayfunelmas reviewed Apr 23, 2024

View reviewed changes

chain/chain/src/store/latest_witnesses.rs Show resolved Hide resolved

fix typos

b5fbac5

tayfunelmas approved these changes Apr 24, 2024

View reviewed changes

Longarithm approved these changes Apr 24, 2024

View reviewed changes

shreyan-gupta reviewed Apr 25, 2024

View reviewed changes

jancionear added 6 commits April 26, 2024 11:53

Move latest-witnesses command to state-viewer

cbed5c5

Add epoch_id to LatestWitnessesKey

4174878

Add ability to filter by height, shard_id and epoch_id

bff38d9

Add pub to keep things consistent

9f53522

fix: avoid saving big witnesses, don't just print a message

a261d4d

There should be a return there, we don't want to save big witnesses.

Print extra space to separate printed witnesses

dbbf460

jancionear requested a review from shreyan-gupta April 26, 2024 12:29

pugachAG reviewed Apr 30, 2024

View reviewed changes

pugachAG approved these changes Apr 30, 2024

View reviewed changes

jancionear added this pull request to the merge queue Apr 30, 2024

Merged via the queue into near:master with commit c2fb494 Apr 30, 2024
27 of 29 checks passed

jancionear deleted the save-witness branch April 30, 2024 12:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save observed instances of ChunkStateWitness to the database for later analysis #11137

Save observed instances of ChunkStateWitness to the database for later analysis #11137

jancionear commented Apr 23, 2024 •

edited

Loading

jancionear commented Apr 23, 2024

codecov bot commented Apr 23, 2024 •

edited

Loading

tayfunelmas Apr 23, 2024

tayfunelmas Apr 23, 2024

shreyan-gupta Apr 23, 2024

shreyan-gupta Apr 23, 2024

jancionear Apr 24, 2024

tayfunelmas Apr 24, 2024

jancionear Apr 26, 2024

jancionear Apr 26, 2024

jancionear Apr 26, 2024

tayfunelmas Apr 23, 2024

tayfunelmas Apr 23, 2024

shreyan-gupta left a comment

shreyan-gupta Apr 23, 2024

shreyan-gupta Apr 23, 2024

shreyan-gupta Apr 23, 2024

jancionear Apr 26, 2024

shreyan-gupta Apr 23, 2024

jancionear Apr 26, 2024

jancionear commented Apr 26, 2024

pugachAG Apr 30, 2024

jancionear Apr 30, 2024

Save observed instances of ChunkStateWitness to the database for later analysis #11137

Save observed instances of ChunkStateWitness to the database for later analysis #11137

Conversation

jancionear commented Apr 23, 2024 • edited Loading

jancionear commented Apr 23, 2024

codecov bot commented Apr 23, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shreyan-gupta left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jancionear commented Apr 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jancionear commented Apr 23, 2024 •

edited

Loading

codecov bot commented Apr 23, 2024 •

edited

Loading