Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some debug-span and debug-log changes to help with filtering during tracing analysis #11289

Merged
merged 20 commits into from
May 17, 2024

Conversation

tayfunelmas
Copy link
Contributor

@tayfunelmas tayfunelmas commented May 10, 2024

This is not the complete list of changes I plan to do but wanted to get a first batch reviewed first.

There are two kinds of changes applied to debug logs and traces that would help filtering the them when analyzing traces (either through the new tracing UI or through the log files):

  1. Simplify the name, eg. instead of full sentence (unless sentence makes more sense), replace with an identifier-like name (in most cases reflecting the type of operation or function). Move the params substituted in the string to separate log/span fields.

  2. Add missing fields and attempt to standardize the naming for the fields so that when it comes to filtering traces for a certain analysis we know what kind of fields are available across the logs/traces for the same kind of entity/operation. For example:

  • shard_id for shard id (also prefer shard_id over UID as it has extra version and it can be obtain by other means).
  • sync_hash for hash of the state-sync
  • height for the block height (use last_has and prev_hash if it is not clear from the context)
  • block_hash for hash of the block
  • chunk_hash/chunk_hashes for hash of the chunk
  • part_id for the id of the state-sync part
  • error for the error type or message
  • sync_type for the sync type (eg. block, head, state)
  • height_included for the block height a chunk is included
  • height_created for the block height a chunk is created

@tayfunelmas tayfunelmas requested a review from a team as a code owner May 10, 2024 19:44
@tayfunelmas tayfunelmas requested review from saketh-are and robin-near and removed request for saketh-are May 10, 2024 19:44
Copy link

codecov bot commented May 10, 2024

Codecov Report

Attention: Patch coverage is 57.33333% with 32 lines in your changes are missing coverage. Please review.

Project coverage is 71.10%. Comparing base (6e96382) to head (53febb4).
Report is 1 commits behind head on master.

Files Patch % Lines
chain/chain/src/chain.rs 35.71% 5 Missing and 4 partials ⚠️
.../stateless_validation/chunk_endorsement_tracker.rs 0.00% 2 Missing and 5 partials ⚠️
...alidation/partial_witness/partial_witness_actor.rs 0.00% 5 Missing and 1 partial ⚠️
...idation/partial_witness/partial_witness_tracker.rs 0.00% 0 Missing and 2 partials ⚠️
...src/stateless_validation/state_witness_producer.rs 50.00% 0 Missing and 2 partials ⚠️
chain/client/src/sync_jobs_actor.rs 60.00% 1 Missing and 1 partial ⚠️
chain/chain-primitives/src/error.rs 0.00% 0 Missing and 1 partial ⚠️
chain/chain/src/chain_update.rs 80.00% 0 Missing and 1 partial ⚠️
...nt/src/stateless_validation/chunk_validator/mod.rs 50.00% 1 Missing ⚠️
...client/src/stateless_validation/shadow_validate.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #11289      +/-   ##
==========================================
+ Coverage   71.08%   71.10%   +0.01%     
==========================================
  Files         783      783              
  Lines      156875   156813      -62     
  Branches   156875   156813      -62     
==========================================
- Hits       111517   111495      -22     
+ Misses      40522    40495      -27     
+ Partials     4836     4823      -13     
Flag Coverage Δ
backward-compatibility 0.24% <0.00%> (+<0.01%) ⬆️
db-migration 0.24% <0.00%> (+<0.01%) ⬆️
genesis-check 1.39% <0.00%> (+<0.01%) ⬆️
integration-tests 37.12% <57.33%> (-0.04%) ⬇️
linux 68.82% <53.33%> (+0.03%) ⬆️
linux-nightly 70.53% <57.33%> (+<0.01%) ⬆️
macos 52.22% <37.33%> (+0.05%) ⬆️
pytests 1.61% <0.00%> (+<0.01%) ⬆️
sanity-checks 1.40% <0.00%> (+<0.01%) ⬆️
unittests 65.52% <40.00%> (+0.01%) ⬆️
upgradability 0.29% <0.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

chain/chain/src/chain.rs Outdated Show resolved Hide resolved
chain/chain/src/chain.rs Outdated Show resolved Hide resolved
chain/chain/src/chain.rs Outdated Show resolved Hide resolved
chain/chain/src/chain.rs Outdated Show resolved Hide resolved
@@ -2525,7 +2519,7 @@ impl Chain {
"get_state_response_part",
shard_id,
part_id,
%sync_hash)
?sync_hash)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this change necessary? By default Display is intended for anything reaches user eyes, including logs...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like they are the same

impl fmt::Debug for CryptoHash {
    fn fmt(&self, fmtr: &mut fmt::Formatter<'_>) -> fmt::Result {
        fmt::Display::fmt(self, fmtr)
    }
}

impl fmt::Display for CryptoHash {
    fn fmt(&self, fmtr: &mut fmt::Formatter<'_>) -> fmt::Result {
        self.to_base58_impl(|encoded| fmtr.write_str(encoded))
    }
}

(Though for hashes in particular, I have a habit of using Debug, because I think the ethereum libraries implement Display by displaying the hash with ellipsis in the middle and it's really annoying. I guess Near code doesn't do that but the habit and fear has already formed :) )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for the consistency because in majority of other places hashes are logged with Debug.

chain/chain/src/chain_update.rs Outdated Show resolved Hide resolved
chain/client/src/client.rs Outdated Show resolved Hide resolved
chain/client/src/client.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@robin-near robin-near left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the significance of the "target" parameter in the tracing analysis? How should we choose the target going forward?

@@ -2525,7 +2519,7 @@ impl Chain {
"get_state_response_part",
shard_id,
part_id,
%sync_hash)
?sync_hash)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like they are the same

impl fmt::Debug for CryptoHash {
    fn fmt(&self, fmtr: &mut fmt::Formatter<'_>) -> fmt::Result {
        fmt::Display::fmt(self, fmtr)
    }
}

impl fmt::Display for CryptoHash {
    fn fmt(&self, fmtr: &mut fmt::Formatter<'_>) -> fmt::Result {
        self.to_base58_impl(|encoded| fmtr.write_str(encoded))
    }
}

(Though for hashes in particular, I have a habit of using Debug, because I think the ethereum libraries implement Display by displaying the hash with ellipsis in the middle and it's really annoying. I guess Near code doesn't do that but the habit and fear has already formed :) )

chain/chain/src/chain_update.rs Outdated Show resolved Hide resolved
chain/client/src/client.rs Outdated Show resolved Hide resolved
@nagisa
Copy link
Collaborator

nagisa commented May 13, 2024

What's the significance of the "target" parameter in the tracing analysis?

It aids the ability to filter (out) the (un)interesting parts of the code. I personally think that a single word that's used in too many spans/events is insufficiently precise, but this is something that each of us needs to discover independently, unfortunately.

@robin-near
Copy link
Contributor

@nagisa I understand that it aids filtering at the logging level, but this is talking about tracing where we have the ability to take in all traces and then filter/analyze them later. So I wanted to know how @tayfunelmas is using the target tag in his analysis, and if anyone changes the tags in the future, what impact it would have on the tracing analysis tooling.

@nagisa
Copy link
Collaborator

nagisa commented May 14, 2024

but this is talking about tracing where we have the ability to take in all traces and then filter/analyze them later.

It isn't actually practical to send everything that's traced off to somewhere in many cases. It is already a constant >500KiB/s of ingest with the few things we trace at the debug level and that's already enough for grafana tempo to start dropping some of the ingest traffic at our fairly conservative default ingest limits. And with an increase of those limits, so would increase our monthly cost (this holds true regardless of where we ship the traces off to, unless they are held locally for immediate inspection, such as with your tool; but even then storage ain't free…)

If we didn't filter our traces at the emitter we'd be looking at potentially dozens of megabytes of traces per second. Components like the compiler or hyper are particularly chatty. So even for traces it is important to have well thought-out targets and levels. Otherwise even just gathering an useful trace becomes a chore (as we recently found out -- enabling host function tracing ended up with traces so truncated that they were largely incomprehensible…)

tayfunelmas and others added 4 commits May 14, 2024 14:00
Co-authored-by: Simonas Kazlauskas <github@kazlauskas.me>
Co-authored-by: Simonas Kazlauskas <github@kazlauskas.me>
Co-authored-by: Simonas Kazlauskas <github@kazlauskas.me>
…r.rs

Co-authored-by: Simonas Kazlauskas <github@kazlauskas.me>
@tayfunelmas
Copy link
Contributor Author

What's the significance of the "target" parameter in the tracing analysis? How should we choose the target going forward?

Actually both useful and not. First, I am planning to primarily use the span/log names and the fields. The target field will only help to filter down the logs, even though the data that needs to be processed may be still large, but better than having everything. I am expecting that for certain analysis (eg. chunk/block production), the interesting pieces of the logs will be coming from certain targets such as "client" and "network", this is why I did some changes to the target names for the events that I think are interesting.

@nagisa
Copy link
Collaborator

nagisa commented May 16, 2024

How should we choose the target going forward?

Our style guide, by the way, has some guidance on that. It reads as such:

Always specify the target explicitly. A good default value to use is the crate name, or the module path (e.g. chain::client) so that events and spans common to a topic can be grouped together. This grouping can later be used for customizing which events to output.

@tayfunelmas tayfunelmas added this pull request to the merge queue May 17, 2024
Merged via the queue into near:master with commit 7d5a6c5 May 17, 2024
27 of 29 checks passed
@tayfunelmas tayfunelmas deleted the more-tracing branch May 17, 2024 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants