feat(o11y): Export basic node config values as static opentelemetry attributes #7711

nikurt · 2022-09-28T12:02:49Z

This PR lets us make sure that chain_id, node_id and account_id of the node are attached as attributes to every root span. That lets us easily distinguish traces from different nodes when multiple nodes export traces to the same backend.

To make it possible, the PR refactors creation of the layers. The new setup:

Initialize a logging-only subscriber at the very beginning of neard starting. Make it a thread default subscriber.
After NearConfig is read, initialize a different subscriber, which has logging, tracing and iotracing enabled, and make that subscriber the global default.

Note that with this PR only neard run command can enable tracing. We definitely don't need tracing for init, localnet and recompress-storage commands. However, view-state may benefit from tracing in some cases.

…he_ and near_shard_cache_ Crate `itoa` provides the fastest way of converting ints as strings.

Use a faster way of converting `shard_id` to a string. Crate `itoa` provides the fastest way of converting ints to strings. Snapshot of metrics from a mainnet node. Note that some metrics are not incremented by 1 (`inc()`), but rather by a larger number using `inc_by()`. ``` 1948843343 near_chunk_cache_hits # Not hot 399446179 near_chunk_cache_misses # Not hot 376782133 near_shard_cache_hits # Not hot 276622570 near_peer_manager_messages_time_count 239378614 near_applied_trie_deletions # Optimized 206898552 near_database_op_latency_by_op_and_column_count 184492507 near_peer_message_sent_by_type_total 179740890 near_peer_message_received_total 179344494 near_peer_message_received_by_type_total 168098152 near_applied_trie_insertions # Optimized 160684011 near_shard_cache_pop_hits # Not hot 100536077 near_client_triggers_time_count 100064991 near_requests_count_by_type_total 98868135 near_client_messages_processing_time_count 98868135 near_client_messages_count 97408226 near_peer_client_message_received_by_type_total 94021301 near_network_routed_msg_latency_count 80880283 partial_encoded_chunk_request_delay_count 78394032 near_shard_cache_gc_pop_misses 51277767 partial_encoded_chunk_response_delay_count 43248133 near_partial_encoded_chunk_request_processing_time_count 22664046 near_shard_cache_misses # Not hot ```

…nd at runtime.

nagisa · 2022-09-28T14:09:37Z

neard/src/cli.rs

+        let _subscriber_guard = default_subscriber(
+            make_env_filter(neard_cmd.opts.verbose_target())?,
+            &neard_cmd.opts.o11y,
+        )
+        .local();


An approach I had used in the past was instead:

Save all the events, spans into a some sort of buffer;

Once the configuration is loaded, initialize the logging system;

Replay all of the saved events, spans again, now that the subscriber has been initialized.

While the current approach works for now, it is probably going to be quite inconvenient if there's a need to track down some consequential information being emit during the prep stage.

I actually like the current approach, because it:

lets us easily and reliably configure logging for simple uses cases, such as tools/storage-usage-delta-calculator/src/main.rs . Previously those use cases created tokio runtime and then initialized logging. I remember seeing that one of the tools was not able to log anything at all because of mis-managing that tokio runtime, but I didn't have time to fix it.

Logging from the main()-thread directly is good enough and unlikely that we need non-blocking logging from that thread. Ideally a node should be up and running after a restart as soon as possible, which means it is supposed to do as little work as possible in the main()-thread.

Maybe I don't understand how buffering is supposed to work, but here is my understanding. The main()-thread buffers the spans and logs until nearconfig is available, which means anything logged before loading nearconfig will only be shown to the user after nearconfig is loaded. For example, if an error is encountered when reading nearconfig, that error will not be able to be shown to the user.

I think that’s fine, yeah. The only thing I’m not seeing is this subscriber guard being dropped sometime around the point in time where the global subscriber is initialized.

This, I believe, would mean that the main thread uses the simplified logger indefinitely, which doesn’t seem quite right to me.

uses the simplified logger indefinitely

Which is good, isn't it? It lets us log messages after actix is shutdown. For example, such messages as the RocksDB shutdown message:

info!(target: "neard", "Waiting for RocksDB to gracefully shutdown");

Well, once there is a problem, being able to figure out what version a broken node was running is useful, isn’t it?

Though, again, this is kinda back to the discussion about whether we consider needing to hunt down and look at the local logs for the particular broken instance. I personally don’t think that’s super convenient, and would rather if everything eventually made into the same place (for regular neard validator deployments, anyway.)

It isn’t necessarily the case that this needs to be moved per se, I could see this being solved by duplication (i.e. log the version again after initializing the otel logger.)

The caveat is that I’m not an operating a validator, so all I’m saying is basically just me armchairing.

Part of #6910

Enabling or disabling tracing at runtime can be useful to debug issues with nodes running too slow. This allows us to avoid restarting a node, which is known to have a rather large impact on the node. Tested that all combinations of logs verbosity and opentelemetry verbosity work well together. Tested by reconfiguring logging as follows: ``` echo '{"verbose_module":"", "opentelemetry_level": "DEBUG"}' > ~/.near/betanet/log_config.json ; killall -SIGHUP neard ```

We create FlatStorageState on Chain creation, but it currently happens for any RuntimeAdapter impl. Because KeyValueRuntime doesn't and shouldn't support flat storage, it leads to tests failures. Here we move FSS creation inside RuntimeAdapter impl. ## Testing https://buildkite.com/nearprotocol/nearcore-flat-state/builds/60#018383c5-7838-4e5b-86f1-da2baa381789 - number of failing tests drops from 77 to 4 - which are failing due to other reasons

* Use the underlying reproducible random impls. * Improve testing * fix deny * fmt Co-authored-by: Michal Nazarewicz <mina86@mina86.com>

…nd at runtime.

nagisa

Overall LGTM, but needs to be fixed to pass the CI.

nagisa · 2022-09-29T10:17:27Z

neard/src/cli.rs

@@ -58,7 +58,7 @@ impl NeardCmd {
            }
        }


Should we move these, or at least the version log, to after the global subscriber has been initialized?

Why do you want the version log to be in the global subscriber? Restarts are relatively rare, and I don't expect to find this specific log line in Grafana Loki or Tempo.

Overall, I'd still want to keep this version message logged to the simple thread-default subscrtiber:

only the run command initializes the global subscriber.

this info can and should be printed as soon as the binary starts. I don't see a point in delaying this until after the NearConfig is available.

nagisa · 2022-09-29T10:19:46Z

neard/src/cli.rs

+        let _subscriber_guard = default_subscriber(
+            make_env_filter(neard_cmd.opts.verbose_target())?,
+            &neard_cmd.opts.o11y,
+        )
+        .local();


I think that’s fine, yeah. The only thing I’m not seeing is this subscriber guard being dropped sometime around the point in time where the global subscriber is initialized.

This, I believe, would mean that the main thread uses the simplified logger indefinitely, which doesn’t seem quite right to me.

…ttributes (#7711) This PR lets us make sure that `chain_id`, `node_id` and `account_id` of the node are attached as attributes to every root span. That lets us easily distinguish traces from different nodes when multiple nodes export traces to the same backend. To make it possible, the PR refactors creation of the layers. The new setup: * Initialize a logging-only subscriber at the very beginning of `neard` starting. Make it a **thread default** subscriber. * After NearConfig is read, initialize a different subscriber, which has logging, tracing and iotracing enabled, and make that subscriber the **global default**. Note that with this PR only `neard run` command can enable tracing. We definitely don't need tracing for `init`, `localnet` and `recompress-storage` commands. However, `view-state` may benefit from tracing in some cases.

I think IO traces were not working in the state viewer since #7711 changed how initialization works. Using state viewer to generate these traces is one of the main use cases of the IO traces, so we want this to work. This PR adds initialization to the default logger, which is used in the state viewer.

nikurt and others added 9 commits September 28, 2022 13:09

Optimize two metrics which are incremented most often: near_chunk_cac…

006bcab

…he_ and near_shard_cache_ Crate `itoa` provides the fastest way of converting ints as strings.

Dynamic configuration of opentelemetry level

18e4ea3

Upgrade tracing and opentelemetry crates.

0a57b71

Dynamic configuration of opentelemetry level

208f3d7

Prototyping separate reloadable layers for filters of logs and traces

5f37144

Log and opentelemetry layers can reload their filters independently a…

f21600d

…nd at runtime.

Init subscriber in tokio, but this can't shutdown.

bd43150

Initialize Opentelemetry layer after the nearconfig is available.

c68a8a7

nikurt requested a review from a team as a code owner September 28, 2022 12:02

nikurt requested review from mm-near and nagisa September 28, 2022 12:02

nagisa reviewed Sep 28, 2022

View reviewed changes

nikurt and others added 14 commits September 28, 2022 16:58

Simplify

8ba173e

refactor: move shard_layout_from_prev_block to EMA (#7673)

87706b5

Part of #6910

Use the underlying reproducible random impls instead of StdRng. (#7717)

5e6f94b

* Use the underlying reproducible random impls. * Improve testing * fix deny * fmt Co-authored-by: Michal Nazarewicz <mina86@mina86.com>

Upgrade tracing and opentelemetry crates.

59836a1

Prototyping separate reloadable layers for filters of logs and traces

90ec71a

Log and opentelemetry layers can reload their filters independently a…

459b311

…nd at runtime.

Init subscriber in tokio, but this can't shutdown.

0d06b11

Initialize Opentelemetry layer after the nearconfig is available.

f534e9b

Merge

f29c909

Merge

098195d

Merge branch 'master' into nikurt-nodekey-delay

eb6a4ea

Merge

44c46bd

nikurt requested a review from nagisa September 29, 2022 09:03

nikurt added 2 commits September 29, 2022 11:11

Remove example

61c1f6b

Fix doc example

be7a1c6

nagisa approved these changes Sep 29, 2022

View reviewed changes

nikurt and others added 3 commits October 3, 2022 10:41

Merge branch 'master' into nikurt-nodekey-delay

469602d

Fix init_cluster. Now tracing::info!() gets logged to stdout.

d4caab0

Merge branch 'master' into nikurt-nodekey-delay

2c11f25

nikurt added the S-automerge label Oct 3, 2022

Merge master into nikurt-nodekey-delay

2306c7e

near-bulldozer bot merged commit bacef84 into master Oct 3, 2022

near-bulldozer bot deleted the nikurt-nodekey-delay branch October 3, 2022 12:06

jakmeier mentioned this pull request Feb 21, 2023

fix: enable io trace in state viewer #8615

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(o11y): Export basic node config values as static opentelemetry attributes #7711

feat(o11y): Export basic node config values as static opentelemetry attributes #7711

nikurt commented Sep 28, 2022

nagisa Sep 28, 2022

nikurt Sep 28, 2022

nagisa Sep 29, 2022

nikurt Sep 29, 2022

nagisa Sep 29, 2022

nagisa left a comment

nagisa Sep 29, 2022

nikurt Sep 29, 2022

nagisa Sep 29, 2022

feat(o11y): Export basic node config values as static opentelemetry attributes #7711

feat(o11y): Export basic node config values as static opentelemetry attributes #7711

Conversation

nikurt commented Sep 28, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nagisa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment