feat(o11y): Configure opentelemetry filter at runtime #7701

nikurt · 2022-09-27T11:58:55Z

Enabling or disabling tracing at runtime can be useful to debug issues with nodes running too slow. This allows us to avoid restarting a node, which is known to have a rather large impact on the node.

Tested that all combinations of logs verbosity and opentelemetry verbosity work well together.
Tested by reconfiguring logging as follows:

echo '{"verbose_module":"", "opentelemetry_level": "DEBUG"}' > ~/.near/betanet/log_config.json ; killall -SIGHUP neard

…he_ and near_shard_cache_ Crate `itoa` provides the fastest way of converting ints as strings.

Turns out `CryptoHash::hash_borsh(&[foo])` doesn’t quite do what one might expect. Rather than hashing one-element slice (which is serialised as length of one plus serialisation of the element) it hashes one-element array (which is serialised as just the element). Introduce `CryptoHash::hash_borsh_slice` to provide less error-prone method and fix tests which fell victim of the confusion. The tests were broken by commit d5f650a: ‘Prefer CryptoHash::hash_borsh to try_to_vec+hash combo’.

Part of #6910 The grand plan is to separate EpochManager out of NightshadeRuntime. As a first step, I want to separate the *types* of runtime and epoch manager, while keeping them in the same object. I want to do this via the following series of steps: ``` trait RuntimeAdapter {} ``` ``` trait EpochManagerAdapter {} trait RuntimeAdapter: EpochManagerAdapter {} ``` ``` trait EpochManagerAdapter {} trait RuntimeAdapter {} ``` This PR introduces the EpochManagerAdapter infra. You can see the rough idea of the plan in #7606

Part of #6910, depends on #7652 Must be reviewed per commit.

#7656) Part of #6910, depends on #7655 Must be reviewed per commit.

What this test is intended to do is to spawn a local neard node and to run near-api-js tests against that. The purpose is to detect unintentional changes to near public API. We have disabled actually running the test a while ago, and today the build broke b/c api-js moved from yarn to pnpm. I think at this it's safe to kill this infra completely, as we weren't running it for a while. It would of course be nice to have some in-repo tests covering public JSON API, but I think we are currently server well-enough by betanet and testnet anyway. Testing public APIs via running tests of downstream projects upstream is clearly a wrong way to do that!

Use a faster way of converting `shard_id` to a string. Crate `itoa` provides the fastest way of converting ints to strings. Snapshot of metrics from a mainnet node. Note that some metrics are not incremented by 1 (`inc()`), but rather by a larger number using `inc_by()`. ``` 1948843343 near_chunk_cache_hits # Not hot 399446179 near_chunk_cache_misses # Not hot 376782133 near_shard_cache_hits # Not hot 276622570 near_peer_manager_messages_time_count 239378614 near_applied_trie_deletions # Optimized 206898552 near_database_op_latency_by_op_and_column_count 184492507 near_peer_message_sent_by_type_total 179740890 near_peer_message_received_total 179344494 near_peer_message_received_by_type_total 168098152 near_applied_trie_insertions # Optimized 160684011 near_shard_cache_pop_hits # Not hot 100536077 near_client_triggers_time_count 100064991 near_requests_count_by_type_total 98868135 near_client_messages_processing_time_count 98868135 near_client_messages_count 97408226 near_peer_client_message_received_by_type_total 94021301 near_network_routed_msg_latency_count 80880283 partial_encoded_chunk_request_delay_count 78394032 near_shard_cache_gc_pop_misses 51277767 partial_encoded_chunk_response_delay_count 43248133 near_partial_encoded_chunk_request_processing_time_count 22664046 near_shard_cache_misses # Not hot ```

Adding an explicit "test fixture" struct makes call-sites more readable (named args!) and makes making further changes easier. No particular grand plan here, just figured this small change to be a marginal improvement!

Part of #6910, depends on #7656 This is slighly less trivial than pure move -- the API is adjusted to remove direct dependency on doomslug which lives in a different crate

Part of #6910, depends on #7657 Must be reviewed per commit.

Part of #6910, depends on #7660 Must be reviewed per commit.

Part of #6910, depends on #7665 Previously, this was only available on NightshadeRuntime, but seems fine to expose everywhere.

https://blog.rust-lang.org/2022/09/22/Rust-1.64.0.html Notable changes: * rust-analyzer is now available via rustup :o) * cargo supports "workspace inheritance". We can use this to cut some repeated deps and avoid repating rust-version everywhere. I'll handle that in a separate PR * poll_fn and ready! are stabilized, we might use them for some of our async stuff * add Nintendo Switch as tier 3 target so I guess we can run neard on switch now?

* monitor_peers_max_period - controls how ofen we think about connecting to a new peer (default is 60 seconds) * peer_expiration_duration - controls when we remove the peer from candidate set (default - 7 days since we heard from it) Also hooked up peer_stats_period to use the value from config if set.

See https://near.zulipchat.com/#narrow/stream/295558-pagoda.2Fcore/topic/chunk.20management.20tests

Prefetch receipt meta data (account and access keys) ahead of time. This recent performance optimization has been disabled by default. In lab settings, performance improvement is confirmed. Using the estimator to measure the time it takes to process empty receipts, on a DB with 2 million accounts, on a local SSD, with enabled shard caches. The result is as follows. - sender = receiver: 737us -> 386 us - sender != receiver: 1014us -> 644us - overhead per block: 6.9us -> 7.4us Note that this is with 100 empty receipts in the same block, with all different accounts. In real traffic it usually does not happen that so many different accounts are accessed in the same block. But it is allowed and we must be able process this case in reasonable time. So even if it might not help in the average case, it makes sense to activate this feature to speed up the worst-case. Currently we use 8 IO threads per shard. Repeated experiments with more threads showed no difference. Decreasing it to 4 threads is about equal to 8 threads. Going lower is significantly worse. Thus, overall, 8 threads seems reasonable here. Canary nodes in testnet and mainnet with the feature enabled show that the feature also works as expected on real traffic. The memory impact is minimal, usually less than 40MB of reserved capacity, which is less than 8MB actual memory because 8 threads reserve 4MB each ahead of actually fetching the data.

#7670

* doc: wordsmithing * Update book/src/README.md Co-authored-by: Akhilesh Singhania <akhi@near.org> Co-authored-by: Akhilesh Singhania <akhi@near.org>

This PR bulk-moves the content from docs to book, and reformats it according to our code style (mostly wrapping). I very specifically don't want to proof-read the contents, fix broken links, etc. I want to do that initially, but I'd love to start with just moving the stuff over.

docs feels like a much more self-explanatory name than book. I originally put the stuff into book folder to not mess up existing docs, but now that those are moved into the book, we can set up proper name. This was the plan all along! This also finalizes, for the time being * the name "Guide to Nearcore Development" (unispiringly stolen from https://github.com/rust-lang/rustc-dev-guide/blob/master/book.toml) * the url https://near.github.io/nearcore/. We can add some fancier DNS here, but I don't see a big need here. I also hope that most links would be *internal* to the guide, so I wouldnt' worry about URL stability too much. This is by design not the reference doc!

It has been a while since the last time we cut the crates releases. Let's do that with 1.29.0 release of nearcore. Tagged: https://github.com/near/nearcore/releases/tag/crates-0.15.0 Notable changes: - #6924 - #6941 - (BREAKING CHANGE) #7308 Check here for a full changelog: crates-0.14.0...crates-0.15.0

Make it clear what kind of thing is missing from the DB. Ideally, we'd add a db_col field to the error, but that's going to be a much larger change.

Also see #7669

…nd at runtime.

resolves #7004, see that issue for reasoning behind the change

See #7674 for more details.

This PR implements FlatStorageState, the struct that manages deltas in flat storage. After this PR, the main implementation of flat storage is mostly ready. The next step is to add more tests.

* Upgrade tracing and opentelemetry crates. * Avoid touching Cargo.toml's for minior version upgrade of `tracing`. * HTTP exporter is not the best but it: a) links b) doesn't require protoc * Undo opentelemtry upgrade, because it introduces an external dependency on protoc or link-time errors. Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>

nagisa · 2022-09-27T14:16:53Z

core/o11y/src/lib.rs

    Handle<
-        Filtered<
-            tracing_subscriber::fmt::Layer<Registry, DefaultFields, Format, NonBlocking>,
-            EnvFilter,
+        LevelFilter,
+        Layered<
+            Filtered<
+                tracing_subscriber::fmt::Layer<Registry, DefaultFields, Format, NonBlocking>,
+                reload::Layer<EnvFilter, Registry>,
+                Registry,
+            >,
            Registry,
        >,
-        Registry,
    >,


Yikes. I wish we were able to use TAIT.

core/o11y/src/lib.rs

nagisa

Overall seems good.

neard/src/log_config_watcher.rs

core/o11y/src/lib.rs

Enabling or disabling tracing at runtime can be useful to debug issues with nodes running too slow. This allows us to avoid restarting a node, which is known to have a rather large impact on the node. Tested that all combinations of logs verbosity and opentelemetry verbosity work well together. Tested by reconfiguring logging as follows: ``` echo '{"verbose_module":"", "opentelemetry_level": "DEBUG"}' > ~/.near/betanet/log_config.json ; killall -SIGHUP neard ```

nikurt and others added 30 commits September 21, 2022 13:29

Optimize two metrics which are incremented most often: near_chunk_cac…

effb47b

…he_ and near_shard_cache_ Crate `itoa` provides the fastest way of converting ints as strings.

Dynamic configuration of opentelemetry level

03d8423

refactor: move validator accessor to EpochManagerHandle (#7655)

919ff26

Part of #6910, depends on #7652 Must be reviewed per commit.

refactor: move signature verification functions to EpochManagerAdapter (

043cbb6

#7656) Part of #6910, depends on #7655 Must be reviewed per commit.

refactor: prettify chunk management test (#7659)

60a7e06

Adding an explicit "test fixture" struct makes call-sites more readable (named args!) and makes making further changes easier. No particular grand plan here, just figured this small change to be a marginal improvement!

refactor: move verify_approvals_and_threshold_orphan to EMA (#7657)

3287c48

Part of #6910, depends on #7656 This is slighly less trivial than pure move -- the API is adjusted to remove direct dependency on doomslug which lives in a different crate

refactor: move shard layout API to EpochManagerAdapter (#7660)

0a01cb1

Part of #6910, depends on #7657 Must be reviewed per commit.

move block->epoch APIs to EpochManagerHandle (#7665)

beada8d

Part of #6910, depends on #7660 Must be reviewed per commit.

refactor: expose get_epoch_id as a method of EpochManagerAdapter (#7666)

a38d4ba

Part of #6910, depends on #7665 Previously, this was only available on NightshadeRuntime, but seems fine to expose everywhere.

fix: actually test dropping chunk forwarding (#7664)

7fe60db

See https://near.zulipchat.com/#narrow/stream/295558-pagoda.2Fcore/topic/chunk.20management.20tests

feat: scaffold for dev guide (#7668)

bf59f61

doc: book skeloton (#7675)

ae4bb2c

#7670

doc: improve wording (#7678)

ee801d7

* doc: wordsmithing * Update book/src/README.md Co-authored-by: Akhilesh Singhania <akhi@near.org> Co-authored-by: Akhilesh Singhania <akhi@near.org>

refactor: more useful error message (#7684)

ac95593

Make it clear what kind of thing is missing from the DB. Ideally, we'd add a db_col field to the error, but that's going to be a much larger change.

Move chunk completion out of ShardsManager. (#7622)

201a24f

chore: Update a couple of more crates to 1.64 (#7688)

a2171e4

Also see #7669

Upgrade tracing and opentelemetry crates.

d5cd375

Dynamic configuration of opentelemetry level

d9e8127

Prototyping separate reloadable layers for filters of logs and traces

08ae5ef

Log and opentelemetry layers can reload their filters independently a…

ce38f2d

…nd at runtime.

jakmeier and others added 10 commits September 27, 2022 13:56

feat: always display uncertain estimation warnings (#7690)

724d8ea

resolves #7004, see that issue for reasoning behind the change

chore: Use inheritance to reduce copy-paste in Cargo.tomls (#7687)

c12b29a

See #7674 for more details.

Fix upgradable.py db migration flags too; like #7607 (#7694)

9dd83c7

Implement FlatStorageState (#7663)

ca00ea1

This PR implements FlatStorageState, the struct that manages deltas in flat storage. After this PR, the main implementation of flat storage is mostly ready. The next step is to add more tests.

Remove DBCol::ChunkPerHeightShard (#7671)

f14956a

fix: logo URL in README.md (#7696)

a17dfc8

chore: remove unused dependency (#7699)

bcba369

Upgrade tracing and opentelemetry crates.

677f7a1

Merge

1e20b6c

nikurt requested a review from a team as a code owner September 27, 2022 11:58

nikurt requested a review from mm-near September 27, 2022 11:58

nikurt and others added 2 commits September 27, 2022 14:02

serde

7cd1bbd

Merge branch 'master' into nikurt-nodekey

3d72c51

nikurt requested a review from nagisa September 27, 2022 13:13

Merge branch 'master' into nikurt-nodekey

4d227ef

nagisa reviewed Sep 27, 2022

View reviewed changes

core/o11y/src/lib.rs Outdated Show resolved Hide resolved

nagisa reviewed Sep 27, 2022

View reviewed changes

core/o11y/src/lib.rs Outdated Show resolved Hide resolved

nagisa reviewed Sep 27, 2022

View reviewed changes

neard/src/log_config_watcher.rs Outdated Show resolved Hide resolved

core/o11y/src/lib.rs Outdated Show resolved Hide resolved

core/o11y/src/lib.rs Outdated Show resolved Hide resolved

nikurt and others added 2 commits September 28, 2022 14:52

Review comments

dab9da9

Merge branch 'master' into nikurt-nodekey

235366e

nagisa reviewed Sep 28, 2022

View reviewed changes

core/o11y/src/lib.rs Outdated Show resolved Hide resolved

nagisa approved these changes Sep 28, 2022

View reviewed changes

Generic type alias

c8a97d7

nikurt added the S-automerge label Sep 28, 2022

near-bulldozer bot merged commit c6f0059 into master Sep 28, 2022

near-bulldozer bot deleted the nikurt-nodekey branch September 28, 2022 18:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(o11y): Configure opentelemetry filter at runtime #7701

feat(o11y): Configure opentelemetry filter at runtime #7701

nikurt commented Sep 27, 2022 •

edited

nagisa Sep 27, 2022

nagisa left a comment

feat(o11y): Configure opentelemetry filter at runtime #7701

feat(o11y): Configure opentelemetry filter at runtime #7701

Conversation

nikurt commented Sep 27, 2022 • edited

nagisa Sep 27, 2022

Choose a reason for hiding this comment

nagisa left a comment

Choose a reason for hiding this comment

nikurt commented Sep 27, 2022 •

edited