protocols/kad: Improve options to efficiently retrieve #2712

dignifiedquire · 2022-06-16T18:18:49Z

Still a draft as I am opening this to discuss the details and direction of this.

Description

The end goal is to allow the following

when calling get_providers give the caller results back in real time, not collect all provider records
when calling get_providers allow the total number of records to be of a fixed limit, such that not necessarily 20 peers need to be contacted (as is the case in the current impl)

Implementation

There are currently two different commits, the first one implements the basic limit functionality, but this gets replaced with a more general solution in 2868e7b, which streams the results and allows the caller to .finish() the query when they have enough.

Things Left To Do

decide which queries should be transitioned to the new progress based api
potentially remove Quorum based on the new functionality, as this now could be done by the caller if using the progression based api
write changelog

Integrates libp2p/rust-libp2p#2712 to fetch providers more quickly. For widely available providers this reduces the fetch time considerably.

mxinden

I am guessing that this is already significantly speeding up your query times @dignifiedquire? Thanks for providing a patch with the discussion right away.

decide which queries should be transitioned to the new progress based api

Without having looked into every one, I would appreciate the consistency across the query types, thus in favor of this.

potentially remove Quorum based on the new functionality, as this now could be done by the caller if using the progression based api

I would be in favor of that for the sake of simplicity. I don't think anyone depends on the Quorum concept specifically. Not offering Quorum is at the expense of potentially doing extra unneeded work, e.g. when contacting another peer immediately followed by a finish call through a user. I doubt this is an issue though.

mxinden · 2022-06-20T07:05:20Z

protocols/kad/src/behaviour.rs

    pub key: record::Key,
-    pub providers: HashSet<PeerId>,
+    pub provider: PeerId,


Why return them one by one? Why not as a batch? One always discovers them in batches, no?

Yeah, I went back and for on this one. The nice thing about this, is that I can reuse the sequence number as the count of results, and the processing is a little more streamlined. But batching is definitely more efficient given they come in as batches.

Do we have any idea about the difference in the performance profile?

From an API PoV, it seems nicer to return them one-by-one given that we are already returning a list of events now. For example, if I want to write a log line for each progress entry, being given a set makes this slightly more cumbersome.

Do we have any idea about the difference in the performance profile?

I don't know, the cost is 1 vs n events per discovery. From my understanding that cost should be quite low, but I might be mistaken.

Intuitively I am also assuming that neither has a performance benefit.

Returning them one by one does loose information. Returning them batched gives the user the information that all records within the same batch came from the same source.

@mxinden updated, please check out

mxinden · 2022-06-20T07:22:24Z

protocols/kad/src/behaviour.rs

        result: QueryResult,
        /// Execution statistics from the query.
        stats: QueryStats,
    },

+    /// An outbound query has produced a result.
+    OutboundQueryProgressed {


Just a slight preference for now. Should probably put more thoughts into this.

How about the following:

Remove OutboundQueryProgressed.

Rename OutboundQueryCompleted to OutboundQueryProgressed.

Add field index: Index to OutboundQueryProgressed (formerly OutboundQueryCompleted) where enum Index { Continuing(usize), Final(usize) } indicating the sequence number and whether it is the final update for the query.

I don't have a strong preference of my solution over this, happy to change it, but lets get decide on a version before I go change it again 😅

Unless anyone of the recently tagged people (or anyone else) raises any objections, I suggest going with the above.

I am unfamiliar with the internals of kademlia but do we know ahead of time, how many we will get? If yes, then always returning both numbers could be another design.

What does the number within Final mean? Would it work to include a boolean completed in the event and only have count otherwise?

The one thing I am worried about with your design suggestion @mxinden is that it introduces more nesting which can be cumbersome to deal with.

I am unfamiliar with the internals of kademlia but do we know ahead of time, how many we will get?

At least not in the providers case

you were right, I was being stupid, please check out the new structure

@mxinden I started expanding this to GetRecord and am running into issues with this construction. There is a need to differentiate the type and information between the progress events and the last event, e.g. for GetRecord the last event will not contain an actual record, only the optional cache.
Thoughts?

How about instead returning the cache_candidates on each progression? I would expect this BTreeMap to be small and thus cloning to be cheap. Would you agree @dignifiedquire? Would that solve the issue?

not entirely, as I would still need to make the record be optional, which is pretty confusing as a consumer

Or we return an event for each contacted peer directly with an optional record.

dignifiedquire · 2022-06-20T10:31:53Z

I am guessing that this is already significantly speeding up your query times @dignifiedquire?

yes very much

mxinden · 2022-06-22T07:31:07Z

@kpp @tomaka do you have thoughts on this? This will require a small change in the Authority Discovery module.

@koivunej what do you think of this proposal? I would guess rust-ipfs could benefit from the performance gains as well.

koivunej · 2022-06-22T10:03:36Z

Thanks for the ping @mxinden but I've been out of the loop for a while again so I cannot really comment anything. Will try to keep up to date with this and any follow-up release.

kpp · 2022-06-22T12:28:56Z

This will require a small change in the Authority Discovery module.

I am not sure how many changes this will require. I looked through the code and found 0 usages of Providers.

mxinden · 2022-06-23T07:12:09Z

This will require a small change in the Authority Discovery module.

I am not sure how many changes this will require. I looked through the code and found 0 usages of Providers.

@kpp the API changes would as well apply to get_record, which Substrate is using last time I worked on it.

mxinden · 2022-07-14T07:55:16Z

@dignifiedquire let me know once this is ready for another review. Excited for this to eventually land.

mxinden

Thank you for making the relevant changes and thanks for propagating them across the code-base.

I left a couple of comments. Overall the direction looks great to me.

I am of the opinion that we should apply this pattern to all query types (e.g. as well to get_closest_peers). As far as I understand no one is objecting.

I would like to keep the master branch consistent at all times. Thus I would prefer the changes to the other query types to happen in this pull request as well. Would you be willing to do that work as well @dignifiedquire?

mxinden · 2022-07-27T10:52:10Z

protocols/kad/src/behaviour.rs

    pub providers: HashSet<PeerId>,
+    /// How many providers have been discovered so var.
+    pub providers_so_far: usize,


Why is this needed? In case a user is interested in all of the providers found, would they not need to keep track of all of them anyways?

My use case was that I need it, and it is a lot easier to track in here, than to track externally. We might also need it internally if we want to do any early cancellations.

Ok. I am fine with leaving it as is.

I actually realized it doesn't help me much after all, so I think we could savely remove this.

Please remove it then. Thanks.

mxinden · 2022-07-27T10:53:36Z

protocols/kad/src/behaviour.rs

@@ -2472,6 +2557,25 @@ pub enum KademliaEvent {
    PendingRoutablePeer { peer: PeerId, address: Multiaddr },
 }

+/// Information about progress events.
+#[derive(Debug, Clone)]
+pub struct Index {


While I proposed the name Index, I don't think it is intuitive.

In case anyone has more intuitive naming suggestions, that would be appreciated. //CC @thomaseizinger who usually has good ideas.

I also dislike the name, but haven't had any better ideas 😓

What about ProgressStep?

pub struct ProgressStep { index: usize, last: bool }

Sounds good to me. I find ProgressStep a lot more intuitive than Index.

mxinden · 2022-07-27T10:54:15Z

protocols/kad/src/behaviour.rs

-    /// An outbound query has produced a result.
-    OutboundQueryCompleted {
+    /// An outbound query has finished.
+    OutboundQueryProgressed {


Is the doc comment still up to date? Should it not be "An outbound query progressed"?

mxinden · 2022-07-27T10:56:37Z

protocols/kad/src/behaviour.rs

        /// The ID of the query that finished.
        id: QueryId,
-        /// The result of the query.
+        /// The optional final result of the query.


Would "The interim result of the query" not be more descriptive? I find this doc comment confusing.

mxinden · 2022-07-27T10:57:16Z

protocols/kad/src/behaviour.rs

        result: QueryResult,
        /// Execution statistics from the query.
        stats: QueryStats,
+        /// Indicates which event this is, if therer are multiple responses for a single request.


Suggested change

/// Indicates which event this is, if therer are multiple responses for a single request.

/// Indicates which event this is, if therer are multiple responses for a single query.

I think we refer to it as query everywhere else, right?

mxinden · 2022-07-27T10:58:04Z

protocols/kad/src/behaviour/test.rs

            }
        });
    }
    QuickCheck::new().tests(10).quickcheck(prop as fn(_))
 }
+
+fn get_providers_limit<const N: usize>() {


I think our first const generic 🎉

dignifiedquire · 2022-07-27T11:27:55Z

I would like to keep the master branch consistent at all times. Thus I would prefer the changes to the other query types to happen in this pull request as well. Would you be willing to do that work as well @dignifiedquire?

Yeah, I'll work on it, just wanted to get the structure right first.

dignifiedquire · 2022-08-09T10:59:52Z

@mxinden if we remove the quorum from get_record, is your expectation that there is no termination condition anymore, and the caller always has to manually terminate, or should there be a default termination?

mxinden · 2022-08-10T06:36:09Z

@mxinden if we remove the quorum from get_record, is your expectation that there is no termination condition anymore, and the caller always has to manually terminate, or should there be a default termination?

I would expect the QueryPeerIter to eventually run out of peers and thus I would expect the query to eventually terminate.

Do you think this is too implicit / not intuitive for users? In other words should we consider the potentially unnecessary work significant and thus the whole API a footgun?

mxinden · 2022-09-23T11:49:50Z

Friendly ping @dignifiedquire. Would be unfortunate for this to go stale. Anything you would need from my end?

Do I understand correctly, that this is the only reason why iroh depends on a fork of rust-libp2p?

dignifiedquire · 2022-09-23T16:28:17Z

Do I understand correctly, that this is the only reason why iroh depends on a fork of rust-libp2p?

yes, that is correct

dignifiedquire · 2022-09-23T16:35:28Z

Do you think this is too implicit / not intuitive for users? In other words should we consider the potentially unnecessary work significant and thus the whole API a footgun?

I am a little afraid that could happen to be honest.

dignifiedquire · 2022-09-23T16:45:21Z

Friendly ping @dignifiedquire. Would be unfortunate for this to go stale. Anything you would need from my end?

Honestly I am not sure how much sense it really makes to change the others behaviour atm, other than the changes I made so far. I will likely have a better understanding down the line when starting to use the rest of the API similarly intensely as I do provider finding atm.

So I would love if we can find something close to what is there that we can merge, and I (or others) can work on improving the other pieces of the API as needed.

mxinden · 2022-09-27T15:10:02Z

Honestly I am not sure how much sense it really makes to change the others behaviour atm, other than the changes I made so far.

Do you see any issues with applying the new mechanism to GetRecord?

I understand your reasoning on only doing this change to the GetRecord system once we have a deeper understanding of it. That said, I do want to keep the libp2p-kad API surface consistent, i.e. not offer two different mechanisms to execute operations on the DHT. I argue that this consistency helps both user experience and maintainability. Thus I suggest to either include the changes across the entire API surface, or delaying this pull request until we have more information, but not merge as is.

dignifiedquire · 2022-09-27T19:44:28Z

@mxinden fair enough, I updated the GetRecord api, looks pretty alright to me in tests. Do you think that is enough, in terms of updates or do you want any other api calls to be changed to return multiple results?

Co-authored-by: Max Inden <mail@max-inden.de>

dignifiedquire · 2022-11-24T23:15:59Z

Some small suggestions

thanks a lot, merged them all

mxinden

Looks good to me. Thanks for bearing with me @dignifiedquire!

refs libp2p/rust-libp2p#2712

rkuhn · 2022-11-30T10:37:31Z

Hmm, this is quite a lot to read: now that I’m facing a somewhat non-obvious upgrade for ipfs-embed (as I’ve never used Kademlia), what is the upgrade path for the quorum removal from get_record? @mxinden @dignifiedquire

mxinden · 2022-12-02T13:57:00Z

@rkuhn can you ping me on your upgrade pull request? Happy to guide or provide patches.

I am assuming that you already read the changelog entry. Unfortunately that is not a step-by-step instruction manual.

rkuhn · 2022-12-02T15:16:27Z

Here’s what I came up with, hope it isn’t too far off :-)

With #2712 merged, this can be marked as done.

With libp2p#2712 merged, this can be marked as done.

* upgrade libp2p to 0.50.0 * on_swarm_event and on_connection_handler_event * replace `Swarm::new` with `Swarm::with_threadpool_executor` * on_swarm_event and on_connection_handler_event part 2 * on_swarm_event and on_connection_handler_event part 3 * on_swarm_event and on_connection_handler_event part 4 * update libp2p * libp2p 0.50.0 * rename OutboundQueryCompleted to OutboundQueryProgressed refs libp2p/rust-libp2p#2712 * remove unused var * accumulate outbound_query_records until query is finished * format code * use p_handler instead of new_handler #12734 (comment) * pass ListenFailure to kademlia #12734 (comment) * use tokio executor in tests #12734 (comment) * use chrono Local::now instead of deprecated Local::today * remove unused vars from request_responses tests * attempt to fix pallet UI tests * restart CI * restart CI * restart CI * restart CI * restart CI * restart CI * restart CI * restart CI

Before libp2p 0.50.0 we used a quorum of one to fetch records from the DHT. In the pr that upgraded to libp2p 0.50.0 we accidentally changed this behavior. This pr brings back the old behavior of using a qorum of one and thus, a faster discovery. After finding the first value, we directly finish the query. There was also another behavior change in libp2p, they stopped automatic caching on remote nodes. This pr also brings back the remote caching on nodes that are nearest to the key from our point of view of the network. The pr that changed the behavior in libp2p: libp2p/rust-libp2p#2712

* upgrade libp2p to 0.50.0 * on_swarm_event and on_connection_handler_event * replace `Swarm::new` with `Swarm::with_threadpool_executor` * on_swarm_event and on_connection_handler_event part 2 * on_swarm_event and on_connection_handler_event part 3 * on_swarm_event and on_connection_handler_event part 4 * update libp2p * libp2p 0.50.0 * rename OutboundQueryCompleted to OutboundQueryProgressed refs libp2p/rust-libp2p#2712 * remove unused var * accumulate outbound_query_records until query is finished * format code * use p_handler instead of new_handler paritytech#12734 (comment) * pass ListenFailure to kademlia paritytech#12734 (comment) * use tokio executor in tests paritytech#12734 (comment) * use chrono Local::now instead of deprecated Local::today * remove unused vars from request_responses tests * attempt to fix pallet UI tests * restart CI * restart CI * restart CI * restart CI * restart CI * restart CI * restart CI * restart CI

Before libp2p 0.50.0 we used a quorum of one to fetch records from the DHT. In the pr that upgraded to libp2p 0.50.0 we accidentally changed this behavior. This pr brings back the old behavior of using a qorum of one and thus, a faster discovery. After finding the first value, we directly finish the query. There was also another behavior change in libp2p, they stopped automatic caching on remote nodes. This pr also brings back the remote caching on nodes that are nearest to the key from our point of view of the network. The pr that changed the behavior in libp2p: libp2p/rust-libp2p#2712

* upgrade libp2p to 0.50.0 * on_swarm_event and on_connection_handler_event * replace `Swarm::new` with `Swarm::with_threadpool_executor` * on_swarm_event and on_connection_handler_event part 2 * on_swarm_event and on_connection_handler_event part 3 * on_swarm_event and on_connection_handler_event part 4 * update libp2p * libp2p 0.50.0 * rename OutboundQueryCompleted to OutboundQueryProgressed refs libp2p/rust-libp2p#2712 * remove unused var * accumulate outbound_query_records until query is finished * format code * use p_handler instead of new_handler paritytech#12734 (comment) * pass ListenFailure to kademlia paritytech#12734 (comment) * use tokio executor in tests paritytech#12734 (comment) * use chrono Local::now instead of deprecated Local::today * remove unused vars from request_responses tests * attempt to fix pallet UI tests * restart CI * restart CI * restart CI * restart CI * restart CI * restart CI * restart CI * restart CI

Before libp2p 0.50.0 we used a quorum of one to fetch records from the DHT. In the pr that upgraded to libp2p 0.50.0 we accidentally changed this behavior. This pr brings back the old behavior of using a qorum of one and thus, a faster discovery. After finding the first value, we directly finish the query. There was also another behavior change in libp2p, they stopped automatic caching on remote nodes. This pr also brings back the remote caching on nodes that are nearest to the key from our point of view of the network. The pr that changed the behavior in libp2p: libp2p/rust-libp2p#2712

* upgrade libp2p to 0.50.0 * on_swarm_event and on_connection_handler_event * replace `Swarm::new` with `Swarm::with_threadpool_executor` * on_swarm_event and on_connection_handler_event part 2 * on_swarm_event and on_connection_handler_event part 3 * on_swarm_event and on_connection_handler_event part 4 * update libp2p * libp2p 0.50.0 * rename OutboundQueryCompleted to OutboundQueryProgressed refs libp2p/rust-libp2p#2712 * remove unused var * accumulate outbound_query_records until query is finished * format code * use p_handler instead of new_handler paritytech/substrate#12734 (comment) * pass ListenFailure to kademlia paritytech/substrate#12734 (comment) * use tokio executor in tests paritytech/substrate#12734 (comment) * use chrono Local::now instead of deprecated Local::today * remove unused vars from request_responses tests * attempt to fix pallet UI tests * restart CI * restart CI * restart CI * restart CI * restart CI * restart CI * restart CI * restart CI

dignifiedquire force-pushed the feat-kad-count branch from e21e3c4 to 013eeaf Compare June 16, 2022 18:19

dignifiedquire added a commit to n0-computer/iroh that referenced this pull request Jun 16, 2022

feat: improve provider fetching

38ce246

Integrates libp2p/rust-libp2p#2712 to fetch providers more quickly. For widely available providers this reduces the fetch time considerably.

dignifiedquire mentioned this pull request Jun 16, 2022

feat: improve provider fetching n0-computer/iroh#124

Merged

dignifiedquire force-pushed the feat-kad-count branch 2 times, most recently from 792ea96 to 9ed7e70 Compare June 19, 2022 17:40

mxinden reviewed Jun 20, 2022

View reviewed changes

dignifiedquire force-pushed the feat-kad-count branch 2 times, most recently from 345a530 to 1fc7b8b Compare July 4, 2022 11:09

dignifiedquire force-pushed the feat-kad-count branch from 1fc7b8b to 2b89ae9 Compare July 13, 2022 11:30

dignifiedquire force-pushed the feat-kad-count branch from 2b89ae9 to dc5db66 Compare July 26, 2022 20:46

mxinden reviewed Jul 27, 2022

View reviewed changes

dignifiedquire force-pushed the feat-kad-count branch from e646c48 to 730de19 Compare August 9, 2022 09:49

dignifiedquire force-pushed the feat-kad-count branch from 730de19 to 2fb1ee8 Compare September 23, 2022 16:34

dignifiedquire force-pushed the feat-kad-count branch from 2fb1ee8 to 7dfafd9 Compare September 23, 2022 16:43

dignifiedquire force-pushed the feat-kad-count branch from 7dfafd9 to 7b05de1 Compare September 27, 2022 18:13

dignifiedquire and others added 2 commits November 25, 2022 00:15

Update protocols/kad/CHANGELOG.md

1223f02

Co-authored-by: Max Inden <mail@max-inden.de>

Update protocols/kad/CHANGELOG.md

9c28d98

Co-authored-by: Max Inden <mail@max-inden.de>

Merge branch 'master' into feat-kad-count

cfd5461

mxinden approved these changes Nov 25, 2022

View reviewed changes

mxinden added the send-it label Nov 25, 2022

mergify bot merged commit a997181 into libp2p:master Nov 25, 2022

This was referenced Nov 25, 2022

feat: Add Identify + Kademlia chat example #3150

Closed

Return stream for Ipfs::get_providers dariusc93/rust-ipfs#14

Closed

melekes added a commit to melekes/substrate that referenced this pull request Nov 28, 2022

rename OutboundQueryCompleted to OutboundQueryProgressed

b3c33b5

refs libp2p/rust-libp2p#2712

melekes mentioned this pull request Nov 28, 2022

upgrade libp2p to 0.50.0 paritytech/substrate#12734

Merged

mxinden added a commit that referenced this pull request Dec 13, 2022

chore(ROADMAP): Mark Kademlia efficient querying as done

06430e6

With #2712 merged, this can be marked as done.

mxinden mentioned this pull request Dec 13, 2022

chore(ROADMAP): Mark Kademlia efficient querying as done #3233

Merged

4 tasks

mergify bot pushed a commit that referenced this pull request Dec 13, 2022

chore(ROADMAP): Mark Kademlia efficient querying as done (#3233)

c39d25e

With #2712 merged, this can be marked as done.

jxs pushed a commit to jxs/rust-libp2p that referenced this pull request Dec 14, 2022

chore(ROADMAP): Mark Kademlia efficient querying as done (libp2p#3233)

2171c36

With libp2p#2712 merged, this can be marked as done.

bkchr mentioned this pull request Jan 5, 2023

Kademlia: Speed-up the record fetching paritytech/substrate#13081

Merged

dariusc93 mentioned this pull request Feb 12, 2023

Implement IPNS dariusc93/rust-ipfs#1

Open

5 tasks

RolandSherwin mentioned this pull request Apr 12, 2023

test: closest_peers + small fix maidsafe/safe_network#78

Merged

	/// Indicates which event this is, if therer are multiple responses for a single request.
	/// Indicates which event this is, if therer are multiple responses for a single query.

protocols/kad: Improve options to efficiently retrieve #2712

protocols/kad: Improve options to efficiently retrieve #2712

Conversation

dignifiedquire commented Jun 16, 2022 • edited Loading

Description

Implementation

Things Left To Do

mxinden left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dignifiedquire commented Jun 20, 2022

mxinden commented Jun 22, 2022

koivunej commented Jun 22, 2022

kpp commented Jun 22, 2022

mxinden commented Jun 23, 2022

mxinden commented Jul 14, 2022

mxinden left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dignifiedquire commented Jul 27, 2022

dignifiedquire commented Aug 9, 2022

mxinden commented Aug 10, 2022

mxinden commented Sep 23, 2022

dignifiedquire commented Sep 23, 2022

dignifiedquire commented Sep 23, 2022

dignifiedquire commented Sep 23, 2022

mxinden commented Sep 27, 2022

dignifiedquire commented Sep 27, 2022

dignifiedquire commented Nov 24, 2022

mxinden left a comment

Choose a reason for hiding this comment

rkuhn commented Nov 30, 2022

mxinden commented Dec 2, 2022

rkuhn commented Dec 2, 2022

dignifiedquire commented Jun 16, 2022 •

edited

Loading