feat(o11y): Inter-process tracing #8004

nikurt · 2022-11-07T13:09:43Z

Serialize TraceId and SpanId to a new field of PeerMessage. This lets the receiving node link traces to a trace that generated the network request.

TextMapPropagator is the interface designed to solve a similar problem, but given that:

we have to invoke it manually
our custom propagator is stateless

... following the TextMapPropagator interface doesn't add value. Setting a global text map propagator doesn't add value for the same reason.

Because we want the inter-process tracing to be enabled at the debug level:

We need handling of SendMessage to enable tracing at the debug level
And a corresponding PeerManagerActor tracing needs to be enabled at the debug level.

https://pagodaplatform.atlassian.net/browse/ND-172

Move the check for is_height_processed before process_block_header. Previously, this check happens after, which means, the node will re-process the block header (which takes a few ms) and re-broadcast an invalid block before drops it. In the case when there are many invalid blocks circulating in the network, this can cause the node to be too busy,

Display a list of peers stored in peer store - together with information on when we attempted to connect to them. You can see it working in: http://34.147.53.32:3030/debug/pages/network_info This is at the bottom of the page - and you have to click the button to fetch this info (as this is often over 10k peers - and loading takes a while).

As an intermediate step, we will only enable flat storage for storate_read, but not storage_write.

Instead of checking the number of values and their sizes, the caches are now limited by the actual (approximated) memory consumption. This changes what `total_size` in `TrieCacheInner` means, which is also observable through Prometheus metrics. Existing configuration works with slightly altered effects. Number of entries convert to an implicit size limit. Since the explicit default size limit currently is 3GB and the default max entries is set 50k, the implicit limit = 50k * 1000B = 50MB is stronger. This still limits the number of largest entries to 50k but allows the cache to be filled with more entries when the values are smaller. For shard 3, however, where the number of entries is set to 45M in code, the memory limit of 3GB is active. Since we change how this limit is calculated we will see fewer entries cached with this change. Shard 3 should still be okay since we have a prefetcher in place now that works even when the cache is empty.

This adds code that mirrors traffic from a source chain (e.g. mainnet or testnet) to a test chain with genesis state forked from the source chain. The goal is to produce traffic that looks like source chain traffic. So in a mocknet test where we fork mainnet state for example, we can then actually observe what happens when we subsequently get traffic equivalent to mainnet traffic after the fork point. For more info, see the README in this commit.

`anyhow` is the type to return from `main`, we dont' get any value here from preserving well-typed errors, and creatng more work down the line to add all future error variants: *surely* we can fail due to more than these two errors, right?

* doc: fix typo Acton -> Action * doc: fix typo falied -> failed * doc: fix typo recieve -> receive * doc: fix typo infomation -> information * Update tools/delay-detector/README.md Co-authored-by: Michal Nazarewicz <mina86@mina86.com>

The module has been introduced in commit cbcf678: ‘Cryptographic code for randomness beacon’ and then never used. Get rid of it.

Update near logo in README.md #7875

List of peers wasn't printed if we were in sync mode (especially during header/state sync)

EpochSync was never implemented, there is just a bunch of stubs left here and there. Removing them.

There’s still `impl From<AccountId> for String`. It’s left intentionally as it avoids string allocation when used compared to using Display.

The concrete implementation wrapping ClientActor and ViewClientActor has been moved to near_client crate. Network(View)ClientMessage will be moved to near_client crate in a separate PR.

this will print more easy to understand info on which source chain transactions are making it into the target chain. for now we just log them to debug logs but it would be nice to have some HTTP debug page that shows an easy to understand summary

The TxStatusError::InvalidTx variant is never constructed so get rid of it.

Having conversion from near_chain_primitves::Error to TxStatusError eliminates a handful of trivial map_error calls.

@posvyatokum

cc: @posvyatokum

Add a `cold_store` Cargo feature which enables the option to configure the node with cold storage. At the moment, all this does is open the cold database and doesn’t enable any other features. The idea is that this can now allow experimenting with code that needs access to the cold storage.

These actix messages are an implementation detail of near_client crate.

Fixes some minor grammar issues from #7918.

…k processing to client (#7898) This PR is a pure refactoring. The context is that any processing details should be put in Client instead of ClientActor. ClientActor should just serve as a coordinator class to handle messages and check triggers and immediately pass it to Client. This is better for testing since we can't write unit test for any logic in ClientActor and also better for code readability as the logic is not scattered in two classes. This PR only moves the part around block processing. The rest is tracked by #7899

* New last-blocks debug page. * Use JSX with babel * Minor fix * Minor fix 2 * Rename is_block_missing

- links for gas_param sections in summary didn't work - rename gas_param to just gas - pin link to commit - also mention that we spend gas on tx other than wasm

Also removed some actix messages which are not needed any more.

chain/network/src/network_protocol/propagator.rs

chain/network/src/network_protocol/network.proto

chain/network/src/network_protocol/propagator.rs

chain/network/src/network_protocol/mod.rs

pompon0 · 2022-11-09T13:41:58Z

chain/network/src/network_protocol/mod.rs

+                if proto_msg.trace_context.is_some() {
+                    let propagator = NodePropagator::new();
+                    if let Ok(extracted_span_context) =
+                        propagator.extract_span_context(&proto_msg.trace_context)


shouldn't the parsing error be logged here?

I'm worried about spamming the logs.
I wish I had LOG_EVERY_N_SECONDS().

chain/network/src/network_protocol/mod.rs

chain/network/src/network_protocol/proto_conv/trace_context.rs

nikurt and others added 30 commits October 25, 2022 14:36

Prototype

b1af20a

Do not use flat storage for storage_write (#7885)

34f847e

As an intermediate step, we will only enable flat storage for storate_read, but not storage_write.

doc: fix typos (#7904)

2b1a8bb

* doc: fix typo Acton -> Action * doc: fix typo falied -> failed * doc: fix typo recieve -> receive * doc: fix typo infomation -> information * Update tools/delay-detector/README.md Co-authored-by: Michal Nazarewicz <mina86@mina86.com>

crypto: Remove unused randomness module (#7907)

445ce05

The module has been introduced in commit cbcf678: ‘Cryptographic code for randomness beacon’ and then never used. Get rid of it.

doc: update logo (#7905)

2f69ec3

Update near logo in README.md #7875

[Debug UI] Fixed bug in network html when syncing (#7906)

3cd74cd

List of peers wasn't printed if we were in sync mode (especially during header/state sync)

removed messages of unimplemented EpochSync (#7911)

90c943b

EpochSync was never implemented, there is just a bunch of stubs left here and there. Removing them.

Prefer implementing Display to From<T> for String (#7914)

2667e4d

There’s still `impl From<AccountId> for String`. It’s left intentionally as it avoids string allocation when used compared to using Display.

replaced Client struct with async_trait (#7913)

7698da9

The concrete implementation wrapping ClientActor and ViewClientActor has been moved to near_client crate. Network(View)ClientMessage will be moved to near_client crate in a separate PR.

chain: remove TxStatusError::InvalidTx variant (#7915)

9683b6f

The TxStatusError::InvalidTx variant is never constructed so get rid of it.

core: add chain Error → TxStatusError conversion (#7912)

cb50b86

Having conversion from near_chain_primitves::Error to TxStatusError eliminates a handful of trivial map_error calls.

doc: gas cost parameter chapter (#7918)

e2856cf

Changelog: include o11y changes (#7889)

b8fefd3

cc: @posvyatokum

moved Network(View)Client(Messages/Responses) to near_client (#7908)

390f52b

These actix messages are an implementation detail of near_client crate.

doc: Minor grammar fixes (#7922)

1d283bf

Fixes some minor grammar issues from #7918.

core: remove unused to_base58 function (#7920)

1e621ce

chain: remove unnecessary mut from self reference (#7924)

21a82cf

store: Update cold storage with one column (Block) #7744 (#7745)

4777e77

Fix proposals shuffling implementation (#7921)

c45f615

[Debug UI] Improve last-blocks debug page (#7902)

78836a9

* New last-blocks debug page. * Use JSX with babel * Minor fix * Minor fix 2 * Rename is_block_missing

doc: fix gas section links and other small fixes (#7931)

df22fc7

- links for gas_param sections in summary didn't work - rename gas_param to just gas - pin link to commit - also mention that we spend gas on tx other than wasm

moved PeerStore from PeerManagerActor to NetworkState. (#7890)

c58fb7d

Also removed some actix messages which are not needed any more.

nikurt and others added 3 commits November 8, 2022 13:31

Use protobuf instead of serializing u128 and u64 to strings.

81f933a

Merge branch 'master' into nikurt-interprocess

e218664

changelog

f63326d

nagisa approved these changes Nov 8, 2022

View reviewed changes

pompon0 reviewed Nov 8, 2022

View reviewed changes

chain/network/src/network_protocol/propagator.rs Outdated Show resolved Hide resolved

nikurt and others added 2 commits November 8, 2022 15:31

order

3daf58c

Merge branch 'master' into nikurt-interprocess

4f7912e

pompon0 reviewed Nov 8, 2022

View reviewed changes

chain/network/src/network_protocol/propagator.rs Outdated Show resolved Hide resolved

nikurt and others added 2 commits November 9, 2022 14:10

Move enum to proto

da24895

Merge branch 'master' into nikurt-interprocess

739d41f

nikurt requested a review from pompon0 November 9, 2022 13:12