Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(iroh): downloader #1420

Merged
merged 97 commits into from
Sep 13, 2023

Conversation

divagant-martian
Copy link
Contributor

@divagant-martian divagant-martian commented Aug 28, 2023

Description

Adds the Downloader as specified in #1334 plus some backchannel convos
Features include:

  • Support collections
  • Add delays to downloads
  • Add download retries with an incremental backoff
  • Keeping peers for a bit longer than necessary in hopes they will be useful again
  • Having the concept of intents and deduplicating downloads efforts
  • Cancelling download intents
  • Limiting how many concurrent requests are done in total
  • Limiting how many concurrent requests are done per peer
  • Limiting the number of open connections in total
  • Basic error management in the form of deciding whether a peer should be dropped, the request should be dropped, or if the request should be retried

Notes & open questions

TODOs

  • A remaining TODO in the code is whether something special should be done when dropping quic connections
  • Should downloads have a timeout?
  • I know I've said this a hundred times with a hundred different things but would love to test this as well under stress scenarios and a large number of peers. don't hate me
    In reality after abstracting away all the IO most scenarios can be simulated easily. What would remain for a much later time when the need and opportunity for real case testing scenario arises is to tune the concurrency parameters

Future work

Downloading Ranges

There was the requirement of downloading a Hash, a range of a Hash, a collection and (not mentioned but potentially implied) ranges of collections. There is no support for ranges right now because of the great duplication of the get code in order to take advantage of proper errors added in #1362. In principle, adding ranges should be really easy. This is because it's an extension of the DownloadKind and would simply need calculating the missing ranges not based on the difference between what we have and the whole range, but the given range. I would prefer to find a way to deduplicate the get code before doing this extension.
Also, as far as I can tell, there is no need for this yet.

Prioritizing candidates per role: Provider and Candidate

A nice extension, as discussed at some point, is to differentiate candidates we know have the data, from those that might have the data. This has added benefit that when a peer is available to perform another download under the concurrency limits, a hash we know they have could be downloaded right away instead of waiting for the delay. At this point making this doesn't make sense because we will likely attempt a download before the peer has retrieved the data themselves. To implement this, we would need to add the notifications of fully downloaded hashes as available into gossip first.

Leveraging the info from gossip

When declaring that a hash X should be downloaded, it's also an option to query gossip for peers that are subscribed to the topic to which X belongs to use them as candidates. This could be done connecting the ProviderMap to gossip. For now I don't see the need to do this.

Open questions about Future work

  • In line with the described work from above, the registry only allows querying for peer candidates to a hash since that's as good as it gets in terms of what we know from a remote right now. It's not clear to me if we would want to change this to have better availability information with feat(iroh-bytes): add initial query request #1413 in progress.
  • More future work: downloading a large data set/blob from multiple peers would most likely require us to do a three step process:
    1. understanding the data layout/size.
    2. splitting the download.
    3. actually performing the separate downloads.
      Generally curious how this will end. My question here is whether we should do this for every download, or just on data that we expect to be big. Is there any way to obtain such hint without relying on a query every single time?

Change checklist

  • Self-review.
  • Documentation updates if relevant.
  • Tests if relevant.

@b5 b5 added this to the v0.6.0-alpha2 milestone Aug 29, 2023
Copy link
Member

@Frando Frando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very nice PR! The docs make it quite straightforward to read and understand. Thanks for that!

I did not get through to the end, will continue tomorrow or later tonight. I did not see anything that would block a merge, apart from minor nits. Will try to read through the tests still.

I think we will want to move away from the default 500ms delay ASAP, this just doesn't feel good - but that can come after merge with #1470 I'd say.

iroh/src/downloader.rs Outdated Show resolved Hide resolved
collection_parser,
};

let service = Service::new(getter, dialer, concurrency_limits, msg_rx);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I usually call the Service an Actor - I think both terms work well, but maybe let's bikeshed and then (later) unifiy across the Iroh codebase? Makes it easier to understand when reading/learning the codebase if these "structs that have a async run(self) method and process events from different sources in a loop" are called similarily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer service because it does not marry with the Actor model and it's clear enough on intent, but I'm fine with adjusting if the team agrees

iroh/src/downloader.rs Show resolved Hide resolved
iroh/src/downloader.rs Outdated Show resolved Hide resolved
) {
// this is simply INITIAL_REQUEST_DELAY * attempt_num where attempt_num (as an ordinal
// number) is maxed at INITIAL_RETRY_COUNT
let delay = INITIAL_REQUEST_DELAY
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen exponential back-off calculations used in places like this (e.g. for reconnects). Not sure if we'd gain much by that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also not sure, current maximum accumulated delay amounts to 10seconds if my math is right. This is accumulated delay, which does not take into account the time it takes to do a download. max total delay of 10seconds sounds ok to me in general

iroh/src/downloader.rs Outdated Show resolved Hide resolved
iroh/src/downloader.rs Show resolved Hide resolved
Err(FailureAction::DropPeer(reason)) => {
debug!(%peer, ?kind, %reason, "peer will be dropped");
if let Some(_connection) = peer_info.conn.take() {
// TODO(@divma): this will fail open streams, do we want this?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets at least log this in a way that we can see if this is an issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we log it in the previous line. Something else you think we should log or that should be added?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the connection id?

}

/// Get a blob or collection
pub async fn get<D: Store, C: CollectionParser>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rklaehn would be good if you could review the below get_* functions, given you are the most familiar with fetching blob

Copy link
Contributor

@dignifiedquire dignifiedquire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work. Left all the useful comments I can give for this round 😅

Copy link
Member

@Frando Frando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, and I did not find anything that @dignifiedquire did not yet mention.

I really like the test setup with check_invariants and the mocked impls.

Let's get this in :-)

iroh/src/downloader/test.rs Outdated Show resolved Hide resolved
@divagant-martian divagant-martian added this pull request to the merge queue Sep 13, 2023
Merged via the queue into n0-computer:main with commit c217283 Sep 13, 2023
15 checks passed
@divagant-martian divagant-martian deleted the download-manager branch September 13, 2023 14:34
@b5 b5 mentioned this pull request Sep 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

4 participants