Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save observed instances of ChunkStateWitness to the database for later analysis #11137

Merged
merged 10 commits into from
Apr 30, 2024

Conversation

jancionear
Copy link
Contributor

@jancionear jancionear commented Apr 23, 2024

Save all of the observed instances of ChunkStateWitness to the database.
They can later be fetched for debugging and analysis.

Sometimes things go wrong with witness validation, and we're not sure why. In such cases it would be good to take a look at the witness that failed the validation, but currently there's no way to see the witness, it disappears after validation. Saving the witnesses in the database will allow us to inspect the witnesses after it failed validation and see what exactly is wrong, even after node crash.

Adds a new database column DBCol::LatestChunkStateWitnesses, which keeps a set of latest observed witnesses. Size of this set is limited to 4GB and 60*30 (30 min) of instances. When the limit is hit, the witness with the lowest height is removed from the set.
We can't really save all witnesses, at 4MB/s that would add up to 345GB/day, so we're forced to garbage collect the old witnesses. Having only the latest witnesses should be enough for debugging and analysis.

Witness saving could potentially be an attack vector, as we save all witnesses, even the ones that failed validation. An attacker could spam the node with thousands of witnesses, which would all be saved to the database, which could cause denial of service.
Because of that the feature is guarded by a new config option: save_latest_witnesses. By default it's false, so production nodes won't save anything, and won't be vulnerable to such attacks. It can be selectively enabled on test/canary nodes when needed.

A new database command is added to access the saved witnesses:

Print observed ChunkStateWitnesses at the given block height (and shard id).
Observed witnesses are only saved when `save_latest_witnesses` is set to true in config.json

Usage: neard database show-latest-witnesses [OPTIONS] --height <HEIGHT>

Options:
      --height <HEIGHT>      Block height (required)
      --shard-id <SHARD_ID>  Shard id (optional)
      --pretty               Pretty-print using the "{:#?}" formatting
      --binary               Print the raw &[u8], can be pasted into rust code
  -h, --help                 Print help

The tool allows to print observed witnesses with given height (and optionally shard id).
Either as a debug print {:?}/{:#?}, or as a binary blob that can be pasted into rust code.

I'm not sure if that's the best way to expose it for debugging, but it was the easiest one to implement. Maybe it should somehow be integrated with debug-ui? I'm not very familiar with it, idk if it'd make sense.

Fixes: #11110
Similar to: #10599

Save the latest observed instances of ChunkStateWitness in the database
for later analysis and debugging.

Sometime witness validation goes wrong, and we're not sure why.
It's impossible to debug such cases without the witness, and currently
there's no way to view the offending witness. Saving the witnesses
in the database will allow us to inspect weird witnesses, even
if the node crashes.
Saving observed witnesses could be an attack vector. Someone
could send thousands of witnesses and we'd save them all,
which could overload the database and cause a denial of service.
We want to save invalid witnesses, so we can't really filter
anything out.

Let's add a config option which controls whether the witnesses
are saved. It's disabled by default, as saving witnesses poses
a security risk.
Add a command which allows to print observed witnesses
with the given height and shard id.
@jancionear jancionear requested a review from a team as a code owner April 23, 2024 12:12
@jancionear jancionear requested review from akhi3030, pugachAG, Longarithm and robin-near and removed request for akhi3030 April 23, 2024 12:12
@jancionear jancionear added the A-stateless-validation Area: stateless validation label Apr 23, 2024
@jancionear
Copy link
Contributor Author

I just saw that the issue description mentions that we already save witnesses in some cases x.x

According to @robin-near and @pugachAG , it seems we are storing a completely invalid state witness, but if a witness is non-deterministically invalid (e.g. it is identified as invalid under specific state of chunk validator, such as missing contract code from cache), the data is gone forever and we have to replay a large block span, hoping to reproduce the issue.

Is this correct? I didn't find anything like that in the code. We save StateTransitionData, but that's a different thing 0_o

Copy link

codecov bot commented Apr 23, 2024

Codecov Report

Attention: Patch coverage is 5.48780% with 155 lines in your changes are missing coverage. Please review.

Project coverage is 71.15%. Comparing base (f95087b) to head (dbbf460).
Report is 8 commits behind head on master.

Files Patch % Lines
chain/chain/src/store/latest_witnesses.rs 0.00% 121 Missing ⚠️
tools/state-viewer/src/latest_witnesses.rs 0.00% 27 Missing and 1 partial ⚠️
chain/chain/src/garbage_collection.rs 0.00% 3 Missing ⚠️
...nt/src/stateless_validation/chunk_validator/mod.rs 66.66% 1 Missing ⚠️
...src/stateless_validation/state_witness_producer.rs 66.66% 1 Missing ⚠️
tools/state-viewer/src/cli.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #11137      +/-   ##
==========================================
+ Coverage   71.10%   71.15%   +0.04%     
==========================================
  Files         773      775       +2     
  Lines      154176   154033     -143     
  Branches   154176   154033     -143     
==========================================
- Hits       109622   109596      -26     
+ Misses      40109    40003     -106     
+ Partials     4445     4434      -11     
Flag Coverage Δ
backward-compatibility 0.24% <0.00%> (-0.01%) ⬇️
db-migration 0.24% <0.00%> (-0.01%) ⬇️
genesis-check 1.42% <0.60%> (-0.01%) ⬇️
integration-tests 36.98% <4.87%> (+0.09%) ⬆️
linux 69.55% <3.04%> (+0.02%) ⬆️
linux-nightly 70.64% <5.48%> (+0.08%) ⬆️
macos 54.33% <2.43%> (+0.08%) ⬆️
pytests 1.65% <0.60%> (-0.01%) ⬇️
sanity-checks 1.44% <0.60%> (-0.01%) ⬇️
unittests 66.79% <3.65%> (+0.02%) ⬆️
upgradability 0.29% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -43,6 +43,10 @@ impl Client {
transactions_storage_proof,
)?;

if self.config.save_latest_witnesses {
self.chain.chain_store.save_lateset_chunk_state_witness(&state_witness)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit typo, lateset->latest

pub shard_id: u64,
/// Each witness has a random UUID to ensure that the key is unique.
/// It allows to store multiple witnesses with the same height and shard_id.
pub random_uuid: [u8; 16],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be the account id of the chunk producer that sent the witness? from chunk producer side it will be single, from chunk validator side it should expect a single witness from each chunk producer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought @pugachAG was working on a PR where we only processed one state witness for a given height, epoch, shard? This shouldn't then be needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should include epoch_id in the key as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but I'm not sure if there's any benefit to it.
Usually there'll be only one witness per (height, shard_id). The situations where there is more than one will be weird, where anything can happen, including sending multiple witnesses by the same chunk producer.
For debugging I think it's best to save everything that we receive, without deduplication. If I were debugging witness issues I'd like to know that the chunk producer is sending duplicate messages, and deduplicating by account id would hide that from me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, yes, using account id would hide the case of the same chunk producer sending duplicate witnesses. and witness already has the account it for debugging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought @pugachAG was working on #11111 where we only processed one state witness for a given height, epoch, shard? This shouldn't then be needed?

I wanted to put saving witnesses before Anton's check, debug information should contain all witnesses, even the ones that were rejected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should include epoch_id in the key as well

I think it's impossible to have multiple possible epochs at a height. There's exactly one valid block per height, and this block will always have a clearly defined epoch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but we could receive witnesses on different epochs. I can add an epoch filter as well.


impl LatestWitnessesKey {
/// `LatestWitnessesKey` has custom serialization to ensure that the binary representation
/// starts with big-endan height and shard_id.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit typo in big-endan


// Go over witnesses with increasing (height, shard_id) and remove them until the limits are satisfied.
// Height and shard id are stored in big-endian representation, so sorting the binary representation is
// the same as sorting the integeres.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit typo in integeres

Copy link
Contributor

@shreyan-gupta shreyan-gupta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh crap, I forgot to post the comments... Had reviewed this some time ago.

pub shard_id: u64,
/// Each witness has a random UUID to ensure that the key is unique.
/// It allows to store multiple witnesses with the same height and shard_id.
pub random_uuid: [u8; 16],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought @pugachAG was working on a PR where we only processed one state witness for a given height, epoch, shard? This shouldn't then be needed?

pub shard_id: u64,
/// Each witness has a random UUID to ensure that the key is unique.
/// It allows to store multiple witnesses with the same height and shard_id.
pub random_uuid: [u8; 16],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should include epoch_id in the key as well

@@ -43,6 +43,10 @@ impl Client {
transactions_storage_proof,
)?;

if self.config.save_latest_witnesses {
self.chain.chain_store.save_lateset_chunk_state_witness(&state_witness)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we also be calling this function at process_chunk_state_witness where we first receive the state witness from an external chunk producer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also called in process_chunk_state_witness

witness: &ChunkStateWitness,
) -> Result<(), std::io::Error> {
let serialized_witness = borsh::to_vec(witness)?;
let serialized_witness_size: u64 =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dumb question, why aren't we working with usize here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to serialize the size, so I wanted it to have a clearly defined representation. The exact type of usize depends on the architecture, and I didn't want to do that in serialized data.
I guess it doesn't really matter, because no one is going to move the saved witnsses data to a different arch, but it feels more proper to use u64.

@jancionear
Copy link
Contributor Author

I just learned that there's a neard view-state command, I think it would be a better place for show-latest-witnesses than neard database, I'll move it there.

/// `LatestWitnessesKey` has custom serialization to ensure that the binary representation
/// starts with big-endian height and shard_id.
/// This allows to query using a key prefix to find all witnesses for a given height (and shard_id).
pub fn serialized(&self) -> [u8; 64] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need manual ser/deser instead of just using borsh?
from what I understand we use borsh whenever possible when encoding data in the db

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want the rows to be ordered by (height, shard_id) in big endian representation. I'm not sure if borsh would do that, it was quicker and easier to manually serialize it.

@jancionear jancionear added this pull request to the merge queue Apr 30, 2024
Merged via the queue into near:master with commit c2fb494 Apr 30, 2024
27 of 29 checks passed
@jancionear jancionear deleted the save-witness branch April 30, 2024 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-stateless-validation Area: stateless validation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Store N latest invalid state witnesses
5 participants