Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Networking simplification #2264

Merged
merged 20 commits into from
May 17, 2022
Merged

Networking simplification #2264

merged 20 commits into from
May 17, 2022

Conversation

tomaka
Copy link
Contributor

@tomaka tomaka commented May 4, 2022

This big unreviewable PR refactors the networking part of the code, and more precisely the layers that coordinate all the TCP/WebSocket connections together. The code concerning individual connections has barely been touched.

The main file that has been modified is collection.rs. This file contains a data structure that represents a set of connections.

Before this PR, this data structure was "atomic". All the methods were taking &self as parameter, meaning that multiple methods could be called at the same time, and many of these methods were asynchronous and could be interrupted. If you needed to inject data into a connection, this had to be done through the data structure.
This design made the code extremely difficult to understand, because of all the corner cases to handle, most notably around the fact that futures can be cancelled by the user before their completion.

After this PR, this data structure has been split in two: one with the set of connections, and one ConnectionTask for each individual connection. The set of connections has a queue of messages destined to the ConnectionTasks, and the ConnectionTasks have a queue of messages destined to the set. This is exposed in the APIs of these objects, and it is the role of the user to do the messages passing. Before this PR, all the "locking/multithreading strategy" was handled internally by the collection, while after this PR it needs to be done by the user.

Thanks to this change in paradigm, the data structures are no longer atomic and are now simple mutable state machines. You have getters and you have methods that modify the state, and that's it. This considerably simplifies the implementation.

In the same vein, peers.rs and service.rs, which are data structures built on top of collection.rs, have been modified in the same way.

This new paradigm is in theory slightly less optimal than the one before. Before this PR, locking was fine-grained. If multiple threads wanted to access the set of connections at the same time, they could call a method at the same time, and if their changes didn't overlap they would actually run at the same time. After this PR, if multiple threads want to access the set, each thread needs to lock a mutex around the entire set.
However, the complexity of the previous implementation, notably around cancellable futures, has lead to a lot of overhead as well, where for example operations in progress needed to be buffered so that it can be resumed later in case the user interrupts the operation.
Overall I think that this change is more than worth it.

@tomaka
Copy link
Contributor Author

tomaka commented May 4, 2022

This PR isn't completely finished, but the only thing that remains to update is the code of the full node.

Copy link
Contributor

@mergify mergify bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automatically approving tomaka's pull requests. This auto-approval will be removed once more maintainers are active.

@github-actions
Copy link
Contributor

github-actions bot commented May 4, 2022

twiggy diff report

Difference in .wasm size before and after this pull request.


 Delta Bytes │ Item
─────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
      -46124 ┊ smoldot::executor::host::ReadyToRun::run_once::h9fbe42d9b906904d
      +46124 ┊ smoldot::executor::host::ReadyToRun::run_once::hfc452f475e091cdd
      +41864 ┊ smoldot::json_rpc::methods::MethodCall::from_defs::h7cb5a357dee63457
      -41864 ┊ smoldot::json_rpc::methods::MethodCall::from_defs::h87eda0c7c25e9df4
      -12051 ┊ smoldot::network::service::ChainNetwork<TNow>::next_event::{{closure}}::h4eea15e8e5e22b4f
      +11616 ┊ <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll::h190dfa9030081709
      -11616 ┊ <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll::h50df569abb293116
      -11199 ┊ <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll::h4019dbba31ff7cf1
      +11199 ┊ <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll::ha6738d4012a39afe
      -11113 ┊ smoldot_light_base::Client<TChain,TPlat>::add_chain::h009c2eec1e98d69c
      +11113 ┊ smoldot_light_base::Client<TChain,TPlat>::add_chain::ha0522f1d6faf3be7
      +10473 ┊ smoldot::network::service::ChainNetwork<TNow>::next_event::h8c3fbc17c9efa90a
      +10402 ┊ <parity_wasm::elements::ops::Instruction as parity_wasm::elements::Deserialize>::deserialize::h2cf195427700fab0
      -10402 ┊ <parity_wasm::elements::ops::Instruction as parity_wasm::elements::Deserialize>::deserialize::he072ea64b5fc2251
      +10168 ┊ smoldot::json_rpc::methods::Response::to_json_response::h29508715e5d54042
      -10168 ┊ smoldot::json_rpc::methods::Response::to_json_response::h2c80cc85b760bea0
      +10089 ┊ <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll::h4507f5ca3ea840f8
      -10089 ┊ <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll::hd43ea00e82f8eccc
       -9880 ┊ smoldot::libp2p::connection::established::substream::Substream<TNow,TRqUd,TNotifUd>::read_write2::h1848d8501371c422
       +9713 ┊ smoldot::libp2p::connection::established::substream::Substream<TNow,TRqUd,TNotifUd>::read_write2::h517e999dc3d40e80
      +34739 ┊ ... and 26650 more.
      +59531 ┊ Σ [26670 Total Rows]

@tomaka
Copy link
Contributor Author

tomaka commented May 4, 2022

As part of this PR, I've removed the node-info feature of the full node that would print the PeerId given an IP address.
It leads to too much code duplication, and it is unclear right now how to properly de-duplicate this code.

@tomaka tomaka marked this pull request as ready for review May 4, 2022 13:54
@tomaka
Copy link
Contributor Author

tomaka commented May 4, 2022

There are high chances that this PR introduces some bugs, and I admit that I don't have the courage to chase bugs for days, and would prefer to merge this so that I do other networking-related changes on top of it.
However, everything seems to work completely fine for the light node, and the changes introduced by this PR reduce the chances of bugs in the long term by making the code so much more simple.

@tomaka
Copy link
Contributor Author

tomaka commented May 16, 2022

@melekes Do you want to review that? You can completely say no (and actually I'd expect you to say no)

Copy link
Contributor

@melekes melekes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

man this PR is massive 🙈 I've reviewed some code, but not everything. Probably okay to merge. I can review the logic later.

Also would be interesting to compare how introducing global lock affected the performance. Will the tracing or some other instrument (similar to what go tool pprof provides) can give an insight into lock contention in Rust?

bin/full-node/src/run/network_service.rs Outdated Show resolved Hide resolved
fnv::FnvBuildHasher,
>,

messages_from_connections_tx:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can one receive messages from sender? 😐 messages_to_connections_tx ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You get the messages from the messages_from_connections_rx, not sure I get the question

bin/full-node/src/run/network_service.rs Show resolved Hide resolved
@@ -421,29 +319,14 @@ impl NetworkService {
futures_timer::Delay::new(next_discovery).await;
next_discovery = cmp::min(next_discovery * 2, Duration::from_secs(120));

match inner
let mut lock = inner.guarded.lock().await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut lock = inner.guarded.lock().await;
let mut guarded = inner.guarded.lock().await;

don't you think lock is too general? like just by looking at the line below, you probably won't understand what lock is (because it can represent anything).

Copy link
Contributor Author

@tomaka tomaka May 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, guarded isn't much better. guarded here is just supposed to mean "protected by a mutex".
In the context of this module, there's only one mutex in the entire code, so to me it's not a problem to just call that lock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except one can find Guarded struct above, but there's no Lock

bin/full-node/src/run/network_service.rs Outdated Show resolved Hide resolved
bin/light-base/src/network_service.rs Outdated Show resolved Hide resolved
bin/light-base/src/network_service.rs Outdated Show resolved Hide resolved
bin/light-base/src/network_service.rs Outdated Show resolved Hide resolved
bin/light-base/src/network_service.rs Show resolved Hide resolved
src/libp2p/collection.rs Show resolved Hide resolved
@tomaka
Copy link
Contributor Author

tomaka commented May 16, 2022

Also would be interesting to compare how introducing global lock affected the performance. Will the tracing or some other instrument (similar to what go tool pprof provides) can give an insight into lock contention in Rust?

I don't think we can actually measure the lock contention at the moment, for what it's worth.
In principle there can be lock contention issues when you have GrandPa running, and parachains, and broadcasting transactions, and maybe other things I don't think of.
But right now we only ever access the lock to process messages/events and to start block requests. Starting a block request happens once every 6 seconds, and outside of this there's only ever one task accessing the lock at any given time.
In other words, it's going to be completely negligible.

As for the light node, it's single threaded, so it's not measurable either.

@tomaka
Copy link
Contributor Author

tomaka commented May 17, 2022

I think I've addressed everything 👍

Copy link
Contributor

@melekes melekes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomaka tomaka added the automerge Automatically merge pull request as soon as possible label May 17, 2022
@mergify mergify bot merged commit e7768f5 into paritytech:main May 17, 2022
@tomaka tomaka deleted the net-cleanup branch May 17, 2022 09:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge Automatically merge pull request as soon as possible
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants