Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(o11y): Inter-process tracing #8004

Merged
merged 109 commits into from Nov 11, 2022
Merged

feat(o11y): Inter-process tracing #8004

merged 109 commits into from Nov 11, 2022

Conversation

nikurt
Copy link
Contributor

@nikurt nikurt commented Nov 7, 2022

Serialize TraceId and SpanId to a new field of PeerMessage. This lets the receiving node link traces to a trace that generated the network request.

TextMapPropagator is the interface designed to solve a similar problem, but given that:

  1. we have to invoke it manually
  2. our custom propagator is stateless

... following the TextMapPropagator interface doesn't add value. Setting a global text map propagator doesn't add value for the same reason.

Because we want the inter-process tracing to be enabled at the debug level:

  • We need handling of SendMessage to enable tracing at the debug level
  • And a corresponding PeerManagerActor tracing needs to be enabled at the debug level.

https://pagodaplatform.atlassian.net/browse/ND-172

nikurt and others added 30 commits October 25, 2022 14:36
Move the check for is_height_processed before process_block_header. Previously, this check happens after, which means, the node will re-process the block header (which takes a few ms) and re-broadcast an invalid block before drops it. In the case when there are many invalid blocks circulating in the network, this can cause the node to be too busy,
Display a list of peers stored in peer store - together with information on when we attempted to connect to them.

You can see it working in: http://34.147.53.32:3030/debug/pages/network_info

This is at the bottom of the page - and you have to click the button to fetch this info (as this is often over 10k peers - and loading takes a while).
As an intermediate step, we will only enable flat storage for storate_read, but not storage_write.
Instead of checking the number of values and their sizes, the caches are
now limited by the actual (approximated) memory consumption.

This changes what `total_size` in `TrieCacheInner` means, which is also
observable through Prometheus metrics.

Existing configuration works with slightly altered effects.
Number of entries convert to an implicit size limit. Since the explicit
default size limit currently is 3GB and the default max entries is set
50k, the implicit limit = 50k * 1000B = 50MB is stronger. This still
limits the number of largest entries to 50k but allows the cache to
be filled with more entries when the values are smaller.

For shard 3, however, where the number of entries is set to 45M in code,
the memory limit of 3GB is active. Since we change how this limit is
calculated we will see fewer entries cached with this change.
Shard 3 should still be okay since we have a prefetcher in place now
that works even when the cache is empty.
This adds code that mirrors traffic from a source chain (e.g. mainnet
or testnet) to a test chain with genesis state forked from the source
chain. The goal is to produce traffic that looks like source chain
traffic. So in a mocknet test where we fork mainnet state for example,
we can then actually observe what happens when we subsequently get
traffic equivalent to mainnet traffic after the fork point. For more
info, see the README in this commit.
`anyhow` is the type to return from `main`, we dont' get any value here from preserving well-typed errors, and creatng more work down the line to add all future error variants: *surely* we can fail due to more than these two errors, right?
* doc: fix typo Acton -> Action

* doc: fix typo falied -> failed

* doc: fix typo recieve -> receive

* doc: fix typo infomation -> information

* Update tools/delay-detector/README.md

Co-authored-by: Michal Nazarewicz <mina86@mina86.com>
The module has been introduced in commit cbcf678: ‘Cryptographic
code for randomness beacon’ and then never used.  Get rid of it.
Update near logo in README.md

#7875
List of peers wasn't printed if we were in sync mode (especially during header/state sync)
EpochSync was never implemented, there is just a bunch of stubs left here and there. Removing them.
There’s still `impl From<AccountId> for String`.  It’s left
intentionally as it avoids string allocation when used compared
to using Display.
The concrete implementation wrapping ClientActor and ViewClientActor has been moved to near_client crate.
Network(View)ClientMessage will be moved to near_client crate in a separate PR.
this will print more easy to understand info on which source chain transactions are making it into the target chain. for now we just log them to debug logs but it would be nice to have some HTTP debug page that shows an easy to understand summary
The TxStatusError::InvalidTx variant is never constructed
so get rid of it.
Having conversion from near_chain_primitves::Error to TxStatusError
eliminates a handful of trivial map_error calls.
Add a `cold_store` Cargo feature which enables the option to configure
the node with cold storage.  At the moment, all this does is open the
cold database and doesn’t enable any other features.  The idea is that
this can now allow experimenting with code that needs access to the
cold storage.
These actix messages are an implementation detail of near_client crate.
Fixes some minor grammar issues from #7918.
…k processing to client (#7898)

This PR is a pure refactoring. The context is that any processing details should be put in Client instead of ClientActor. ClientActor should just serve as a coordinator class to handle messages and check triggers and immediately pass it to Client. This is better for testing since we can't write unit test for any logic in ClientActor and also better for code readability as the logic is not scattered in two classes.

This PR only moves the part around block processing. The rest is tracked by #7899
* New last-blocks debug page.

* Use JSX with babel

* Minor fix

* Minor fix 2

* Rename is_block_missing
- links for gas_param sections in summary didn't work
- rename gas_param to just gas
- pin link to commit
- also mention that we spend gas on tx other than wasm
Also removed some actix messages which are not needed any more.
@nikurt nikurt requested a review from pompon0 November 9, 2022 13:12
if proto_msg.trace_context.is_some() {
let propagator = NodePropagator::new();
if let Ok(extracted_span_context) =
propagator.extract_span_context(&proto_msg.trace_context)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the parsing error be logged here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about spamming the logs.
I wish I had LOG_EVERY_N_SECONDS().

@nikurt nikurt requested a review from pompon0 November 10, 2022 14:25
@near-bulldozer near-bulldozer bot merged commit 844297e into master Nov 11, 2022
@near-bulldozer near-bulldozer bot deleted the nikurt-interprocess branch November 11, 2022 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet