feat(iroh): downloader #1420

divagant-martian · 2023-08-28T13:29:05Z

Description

Adds the Downloader as specified in #1334 plus some backchannel convos
Features include:

Support collections
Add delays to downloads
Add download retries with an incremental backoff
Keeping peers for a bit longer than necessary in hopes they will be useful again
Having the concept of intents and deduplicating downloads efforts
Cancelling download intents
Limiting how many concurrent requests are done in total
Limiting how many concurrent requests are done per peer
Limiting the number of open connections in total
Basic error management in the form of deciding whether a peer should be dropped, the request should be dropped, or if the request should be retried

Notes & open questions

TODOs

A remaining TODO in the code is whether something special should be done when dropping quic connections
Should downloads have a timeout?
~~^{I know I've said this a hundred times with a hundred different things but would love to test this as well under stress scenarios and a large number of peers. don't hate me}~~
In reality after abstracting away all the IO most scenarios can be simulated easily. What would remain for a much later time when the need and opportunity for real case testing scenario arises is to tune the concurrency parameters

Future work

Downloading Ranges

There was the requirement of downloading a Hash, a range of a Hash, a collection and (not mentioned but potentially implied) ranges of collections. There is no support for ranges right now because of the great duplication of the get code in order to take advantage of proper errors added in #1362. In principle, adding ranges should be really easy. This is because it's an extension of the DownloadKind and would simply need calculating the missing ranges not based on the difference between what we have and the whole range, but the given range. I would prefer to find a way to deduplicate the get code before doing this extension.
Also, as far as I can tell, there is no need for this yet.

Prioritizing candidates per role: `Provider` and `Candidate`

A nice extension, as discussed at some point, is to differentiate candidates we know have the data, from those that might have the data. This has added benefit that when a peer is available to perform another download under the concurrency limits, a hash we know they have could be downloaded right away instead of waiting for the delay. At this point making this doesn't make sense because we will likely attempt a download before the peer has retrieved the data themselves. To implement this, we would need to add the notifications of fully downloaded hashes as available into gossip first.

Leveraging the info from gossip

When declaring that a hash X should be downloaded, it's also an option to query gossip for peers that are subscribed to the topic to which X belongs to use them as candidates. This could be done connecting the ProviderMap to gossip. For now I don't see the need to do this.

Open questions about Future work

In line with the described work from above, the registry only allows querying for peer candidates to a hash since that's as good as it gets in terms of what we know from a remote right now. It's not clear to me if we would want to change this to have better availability information with feat(iroh-bytes): add initial query request #1413 in progress.
More future work: downloading a large data set/blob from multiple peers would most likely require us to do a three step process:
1. understanding the data layout/size.
2. splitting the download.
3. actually performing the separate downloads.
  Generally curious how this will end. My question here is whether we should do this for every download, or just on data that we expect to be big. Is there any way to obtain such hint without relying on a query every single time?

Change checklist

Self-review.
Documentation updates if relevant.
Tests if relevant.

Frando

This is a very nice PR! The docs make it quite straightforward to read and understand. Thanks for that!

I did not get through to the end, will continue tomorrow or later tonight. I did not see anything that would block a merge, apart from minor nits. Will try to read through the tests still.

I think we will want to move away from the default 500ms delay ASAP, this just doesn't feel good - but that can come after merge with #1470 I'd say.

iroh/src/downloader.rs

Frando · 2023-09-11T22:24:30Z

iroh/src/downloader.rs

+                collection_parser,
+            };
+
+            let service = Service::new(getter, dialer, concurrency_limits, msg_rx);


nit: I usually call the Service an Actor - I think both terms work well, but maybe let's bikeshed and then (later) unifiy across the Iroh codebase? Makes it easier to understand when reading/learning the codebase if these "structs that have a async run(self) method and process events from different sources in a loop" are called similarily.

I prefer service because it does not marry with the Actor model and it's clear enough on intent, but I'm fine with adjusting if the team agrees

iroh/src/downloader.rs

Frando · 2023-09-11T22:44:22Z

iroh/src/downloader.rs

+    ) {
+        // this is simply INITIAL_REQUEST_DELAY * attempt_num where attempt_num (as an ordinal
+        // number) is maxed at INITIAL_RETRY_COUNT
+        let delay = INITIAL_REQUEST_DELAY


I've seen exponential back-off calculations used in places like this (e.g. for reconnects). Not sure if we'd gain much by that.

also not sure, current maximum accumulated delay amounts to 10seconds if my math is right. This is accumulated delay, which does not take into account the time it takes to do a download. max total delay of 10seconds sounds ok to me in general

iroh/src/downloader.rs

dignifiedquire · 2023-09-12T17:01:17Z

iroh/src/downloader.rs

+            Err(FailureAction::DropPeer(reason)) => {
+                debug!(%peer, ?kind, %reason, "peer will be dropped");
+                if let Some(_connection) = peer_info.conn.take() {
+                    // TODO(@divma): this will fail open streams, do we want this?


lets at least log this in a way that we can see if this is an issue

we log it in the previous line. Something else you think we should log or that should be added?

Maybe the connection id?

iroh/src/downloader.rs

iroh/src/downloader/get.rs

dignifiedquire · 2023-09-12T17:18:33Z

iroh/src/downloader/get.rs

+}
+
+/// Get a blob or collection
+pub async fn get<D: Store, C: CollectionParser>(


@rklaehn would be good if you could review the below get_* functions, given you are the most familiar with fetching blob

iroh/src/downloader/test.rs

iroh/src/downloader/test/invariants.rs

dignifiedquire

Very nice work. Left all the useful comments I can give for this round 😅

Frando

Looks great, and I did not find anything that @dignifiedquire did not yet mention.

I really like the test setup with check_invariants and the mocked impls.

Let's get this in :-)

iroh/src/downloader/test.rs

divagant-martian added 18 commits August 21, 2023 21:10

starting from scratch

20da0ef

add download handle

03869bd

add sources and download info

ceed4fd

add response sender

aab7849

Merge branch 'main' into download-manager

3480f79

add fields for scheduling, code cancellation

313cba4

handle receiving a start download request

031b59d

improve docs

47fcf3a

add collection parser and store

81c77ce

Merge branch 'main' into download-manager

7a97b2c

add docs

9227ea3

add pending count fn to Dialer

b8c916c

add dialer, availability registry and concurrency limits

e8c1f99

add parts

b08733d

remove unused source enum. delegated now to the availability_registry

325d1c9

add get_best_candidate

370b436

dial selected best peer

40664a8

add comments

aa2b3ef

b5 added this to the v0.6.0-alpha2 milestone Aug 29, 2023

divagant-martian added 11 commits August 29, 2023 11:41

filling in random gaps

82ab766

Merge branch 'main' into download-manager

ba29846

start adding errors based on fsm error management

0e362d5

remove progress stuff

386edae

do not add swear words in commit messages

3284fa9

reworking active vs scheduled requests

e75402c

rework pending and connected info plus retry count

e40ee09

simplify cancelling a request

0fb64b5

rework get_peer_connection_for_download

29a8068

excluding peers is not necessary

05045a2

remove unused import

e8f79bf

review suggestions

12c63dd

divagant-martian mentioned this pull request Sep 11, 2023

iroh: provider list for the downloader #1470

Closed

b5 requested a review from Frando September 12, 2023 15:27

Frando reviewed Sep 12, 2023

View reviewed changes