feat(iroh-bytes)!: refactor downloader queue and add progress reporting #2085

Frando · 2024-03-14T23:35:45Z

Description

This PR contains changes to the downloader:

Remove the Role::Provider vs Role::Candidate distinction. We added this back when we did not have content propagation in docs, and it now does not make much sense for our architecture. Either we know/assume a node has something, or not. The "inbetween" state did not make sense anymore IMO.
Rework the queuing logic to be based on a simple queue of pending downloads. Before, if a download could not be started because the concurrenty limits were reached, the download was considered failed and inserted, with a delay, into the retry queue. Now, if a download cannot be started, we just wait until a slot frees up, and then start it. Note that the queue is, for now, quite simple - if the next download in the queue cannot be started (e.g. because all provider nodes are currently busy with other downloads), we do not try to start the second-top download in the queue, but instead wait until a slot is freed up. We could certainly optimize this by "jumping the queue" in certain cases, this would however also need more logic to make sure that downloads cannot be "forgotten". Therefore, for now the queue is handled strictly in order.
The retry behavior is refactored: We now retry nodes (not downloads as before) up to a retry limit, with an increasing timeout. If a download can only proceed with a retrying node, it is parked and the next item in the queue is processed. The download is unparked if the retrying node successfully connects.
Add progress reporting to downloads managed through the downloader. For this I wrote a SharedProgress handler that allows to subscribe to already running downloads: If an intent is registered for hash A, and this download is started, and while it is running, another intent is registered for the same hash, it will now receive an DownloadProgress::InitialState which contains a TransferProgress which functions as a reducer for the progress events This can be used from the client even, and further events can be reduced/merged into that struct. The PR contains a test for this concurrent progress reporting.
Expose the downloader in the iroh node. Download requests via the RPC API can now set a DownloadMode enum either to Direct or to Queued: the former will behave as currently (issue an iroh-bytes request directly, without a queue or concurrency limits) and the latter will add the download to the downloader queue.

Breaking changes

Changes in iroh:

The BlobDownloadRequest has a new field mode to select between direct and queued downloads, and now contains a list of nodes in place of a single node before

Changes in iroh_bytes:

Role enum is removed
Downloader::queue now takes a DownloadRequest with more options than before
DownloadProgress has a new variant InitialState which is emitted when attaching to an already-running download
ConcurrencyLimits gained a new field

Other changes:

SetTagOption was moved from iroh to iroh-bytes

Notes & open questions

Another followup improves content downloading in docs: refactor: improve content downloading in docs #2127 .
A few more tests around the queuing behavior would be great
I have a partially done followup which adds a hook for content discovery to the downloader

Change checklist

Self-review.
Documentation updates if relevant.
Tests if relevant.

… subsequent operations

iroh-bytes/src/get/progress.rs

iroh-bytes/src/downloader/invariants.rs

rklaehn · 2024-03-21T14:42:40Z

iroh-bytes/src/downloader/progress.rs

+use super::DownloadKind;
+
+/// The channel that can be used to subscribe to progress updates.
+pub type ProgressSubscriber = FlumeProgressSender<DownloadProgress>;


Once it is merged we can make this a BoxedProgressSender so we don't have flume in here...

iroh/src/node/rpc.rs

Frando · 2024-04-17T20:27:35Z

iroh-bytes/src/downloader.rs


+        let ActiveRequestInfo { node, temp_tag, .. } = active_request_info;
+
+        // get node info
        let node_info = self
            .nodes
            .get_mut(&node)
            .expect("node exists in the mapping");


@ppodolsky triggered this expect when using this branch. I have a suspicion and am writing a test to confirm.

This is now fixed. I added the test, it failed, and now passes. The test is fail_while_running.

## Description adds retries to the downloader: * after a node fails to dial, or fails during a transfer with IO errors, we move the node into a delay queue, with an increasing timeout. once the delay expires, the node may be dialed again. * if the head of the download queue can only proceed with nodes which are currently in the retry queue, we *park* the download in a separate set from the main queue, and proceed with the next download * once a retrying node succeeds to reconnect, we unpark all parked hashes for which this node is a candidate

divagant-martian

Minor comments. TBH this was a less thorough review just due to reviewer exhaustion, which I assume you as PR owner are suffering as well. Taking into account we just did a release, I think we have time to deal with bugs if any in future PRs

divagant-martian · 2024-04-18T16:04:03Z

iroh-bytes/src/downloader/invariants.rs

+        for entry in self.queue.iter_parked() {
+            assert!(
+                matches!(self.next_step(entry), NextStep::Park),
+                "next step for parked node ist WaitForNodeRetry"


Suggested change

"next step for parked node ist WaitForNodeRetry"

"next step for parked node is WaitForNodeRetry"

Writing this without first checking the enum changes, but this assertion message is confusing. I'm reading it as:

for each parked item in the queue, check that the next step is Park but when it's not, say that it's WaitForNodeRetry ?

The comment is wrong, the enum variant was renamed earlier but not in the comment. Also slopyy language. Rephrased to "all parked downloads evaluate to the correct next step".

divagant-martian · 2024-04-18T16:13:15Z

iroh-bytes/src/downloader/test.rs

+        async move {
+            if kind == blob_fail.into() {
+                tokio::time::sleep(Duration::from_millis(10)).await;
+                Err(FailureAction::DropPeer(anyhow!("bad!")))


divagant-martian · 2024-04-18T16:18:45Z

iroh-bytes/src/downloader/test.rs

@@ -345,3 +361,134 @@ async fn long_queue() {
        res.expect("all downloads to succeed");
    }
 }
+
+#[tokio::test]
+async fn fail_while_running() {


all the behaviour here makes sense but I don't understand why the test was needed, or let's say.. what particular scenario it's testing.

My wild guess is that it tests that.. concurrent downloads being handled by the same peer are not cancelled if one fails? Does it matter whether the peer was or not removed (DropPeer) after both are done?

In any case, could you add some docs to the test so that it can be updated more easily when we forget why it was added?

Your intuition was correct. The branch previously had a bug (spotted by @ppodolsky) where DropPeer would not first check if there are other downloads running but drop the peer unconditionally. The test recreates that scenario, and passes since the bug was fixed.
I added this comment to the test:

/// If a download errors with [`FailureAction::DropPeer`], make sure that the peer is not dropped /// while other transfers are still running.

Does it matter whether the peer was or not removed (DropPeer) after both are done?

Currently, a peer will only be dropped after a FailureAction::DropPeer if no other transfer is running. My intuition for that choice was:

If the other transfers after the failed one succeed without error, the condition that caused DropPeer was likely transient, so we just keep the peer and ignore the previous DropPeer

If the failure condition remains, the other transfers will fail with DropPeer as well, and after the last transfer failed, the peer will be dropped.

As I'm typing this, the following scenario would be possible in the current implementation, but likely should be avoided:

Queue hashA, hashB with nodeA. hashA, hashB are started

hashA fails with DropPeer -> peer is not dropped because hashB is still running

Queue hashC with nodeA. hashC is started

hashB fails with DropPeer -> peer is not dropped because hashC is still running

Queue hashD with nodeA. hashD is started

... repeat as above

In plain words: If I keep queuing new dowloads to a bad node in some frequency, it might happen that the peer is never dropped even though it is clearly faulty. Downloads will still succeed if other providers are available, but cycles will be waisted and resource capacity needlessly consumed.

I think I will add a test for this case. A fix could be to mark a node as failed (or count the failures) and not queue new downloads for suchly marked nodes (but do not drop them while things are running). After a successful transfer, the failed marker is cleared.

iroh-bytes/src/downloader/test.rs

divagant-martian · 2024-04-18T17:34:44Z

iroh-bytes/src/downloader.rs

-    /// Queue to manage dropping nodes.
+    /// Nodes to which we have an active or idle connection
+    connected_nodes: HashMap<NodeId, ConnectionInfo<D::Connection>>,
+    /// Nodes that failed and for which we track retries


failed.. dialing? or had a a failed download? both?

It's both. Adjusted the comment.

@rklaehn

* refactor: downloader * test: add test for concurrent progress reporting * chore: cleanup * refactor: rename some things for clarity * feat: limit concurrent dials per download * chore: cleanup * refactor: respect progress sender IDs to allow reusing the senders in subsequent operations * chore: fmt & clippy * cleanup: shared download progress * feat: handle tags in the downloader * fix: visibility * fixup * fixup * chore: clippy * fixup * refactor: make export a seperate operation from download * fixup after merge * refactor: remove id mapping * refactor: move BlobId up to the progress events * fix tests * cleanup * fix: invariants * fixes and cleanups * fix: nodes_have should only add to queued downloads * chore: fmt * chore: clippy! * fix: do not queue downloads with no providers * feat: set more than one node on download requests (n0-computer#2128) ## Description Based on n0-computer#2085 The downloader already has the feature to try multiple nodes for a single hash or hashseq. With n0-computer#2085 we expose the downloader over the RPC protocol, by adding a `DownloadMode { Queued, Direct }` enum to the `BlobDownloadRequest`. This PR modifies the `BlobDownloadRequest` to include a list of provider nodes for a hash instead of just a single node. For queued requests that go to the downloader, the nodes are just passed to the downloader. For direct requests, the nodes are tried in-order, until one succeeds. ## Notes & open questions  ## Change checklist - [ ] Self-review. - [ ] Documentation updates if relevant. - [ ] Tests if relevant. * fix: address review by divma * fix: remove now obsolete into_send method from ProgressSender * chore: remove unused code * fix: use Display for DownloadKind * fix: first round of changes after review * docs: downloader next_step * more review * more review * fix: only remove hashes from provider map if they are really not needed * refactor: move TagSet into util * refactor: store intents in request info * cleanup * refactor: no allocation in next_step and simpler code * fix: test after merge * refactor: better method * fix(iroh-bytes): do not log redundant file delete error (n0-computer#2199) ## Description fix(iroh-bytes): do not log when a file can not be deleted... because it is not there. this makes the delete op idempotent, and also helps in case the file was not there in the first place. Fixes n0-computer#2142 ## Notes & open questions  ## Change checklist - [ ] Self-review. - [ ] Documentation updates if relevant. - [ ] Tests if relevant. * fix(iroh-dns-server): fix bug in pkarr name parsing (n0-computer#2200) ## Description The order of labels when parsing a DNS query as pkarr name was reversed. This means all queries for second-level subdomains under a pkarr pubkey failed. * First commit is the actual fix. * Second commit adds a test for various record name and type variations * Third and forth commit improve the debug and info logging ## Notes & open questions Was discovered by @rklaehn in n0-computer#2188 and fixes the wrong behavior observed there. ## Change checklist - [x] Self-review. - [x] Documentation updates if relevant. - [x] Tests if relevant. --------- Co-authored-by: Franz Heinzmann (Frando) <frando@unbiskant.org> Co-authored-by: Rüdiger Klaehn <rklaehn@protonmail.com>

@rklaehn

* refactor: downloader * test: add test for concurrent progress reporting * chore: cleanup * refactor: rename some things for clarity * feat: limit concurrent dials per download * chore: cleanup * refactor: respect progress sender IDs to allow reusing the senders in subsequent operations * chore: fmt & clippy * cleanup: shared download progress * feat: handle tags in the downloader * fix: visibility * fixup * fixup * chore: clippy * fixup * refactor: make export a seperate operation from download * fixup after merge * refactor: remove id mapping * refactor: move BlobId up to the progress events * fix tests * cleanup * fix: invariants * fixes and cleanups * fix: nodes_have should only add to queued downloads * chore: fmt * chore: clippy! * fix: do not queue downloads with no providers * feat: set more than one node on download requests (n0-computer#2128) ## Description Based on n0-computer#2085 The downloader already has the feature to try multiple nodes for a single hash or hashseq. With n0-computer#2085 we expose the downloader over the RPC protocol, by adding a `DownloadMode { Queued, Direct }` enum to the `BlobDownloadRequest`. This PR modifies the `BlobDownloadRequest` to include a list of provider nodes for a hash instead of just a single node. For queued requests that go to the downloader, the nodes are just passed to the downloader. For direct requests, the nodes are tried in-order, until one succeeds. ## Notes & open questions  ## Change checklist - [ ] Self-review. - [ ] Documentation updates if relevant. - [ ] Tests if relevant. * fix: address review by divma * fix: remove now obsolete into_send method from ProgressSender * chore: remove unused code * fix: use Display for DownloadKind * fix: first round of changes after review * docs: downloader next_step * more review * more review * fix: only remove hashes from provider map if they are really not needed * refactor: move TagSet into util * refactor: store intents in request info * cleanup * refactor: no allocation in next_step and simpler code * fix: test after merge * refactor: better method * fix(iroh-bytes): do not log redundant file delete error (n0-computer#2199) ## Description fix(iroh-bytes): do not log when a file can not be deleted... because it is not there. this makes the delete op idempotent, and also helps in case the file was not there in the first place. Fixes n0-computer#2142 ## Notes & open questions  ## Change checklist - [ ] Self-review. - [ ] Documentation updates if relevant. - [ ] Tests if relevant. * fix(iroh-dns-server): fix bug in pkarr name parsing (n0-computer#2200) ## Description The order of labels when parsing a DNS query as pkarr name was reversed. This means all queries for second-level subdomains under a pkarr pubkey failed. * First commit is the actual fix. * Second commit adds a test for various record name and type variations * Third and forth commit improve the debug and info logging ## Notes & open questions Was discovered by @rklaehn in n0-computer#2188 and fixes the wrong behavior observed there. ## Change checklist - [x] Self-review. - [x] Documentation updates if relevant. - [x] Tests if relevant. --------- Co-authored-by: Franz Heinzmann (Frando) <frando@unbiskant.org> Co-authored-by: Rüdiger Klaehn <rklaehn@protonmail.com>

@rklaehn

* refactor: downloader * test: add test for concurrent progress reporting * chore: cleanup * refactor: rename some things for clarity * feat: limit concurrent dials per download * chore: cleanup * refactor: respect progress sender IDs to allow reusing the senders in subsequent operations * chore: fmt & clippy * cleanup: shared download progress * feat: handle tags in the downloader * fix: visibility * fixup * fixup * chore: clippy * fixup * refactor: make export a seperate operation from download * fixup after merge * refactor: remove id mapping * refactor: move BlobId up to the progress events * fix tests * cleanup * fix: invariants * fixes and cleanups * fix: nodes_have should only add to queued downloads * chore: fmt * chore: clippy! * fix: do not queue downloads with no providers * feat: set more than one node on download requests (n0-computer#2128) ## Description Based on n0-computer#2085 The downloader already has the feature to try multiple nodes for a single hash or hashseq. With n0-computer#2085 we expose the downloader over the RPC protocol, by adding a `DownloadMode { Queued, Direct }` enum to the `BlobDownloadRequest`. This PR modifies the `BlobDownloadRequest` to include a list of provider nodes for a hash instead of just a single node. For queued requests that go to the downloader, the nodes are just passed to the downloader. For direct requests, the nodes are tried in-order, until one succeeds. ## Notes & open questions  ## Change checklist - [ ] Self-review. - [ ] Documentation updates if relevant. - [ ] Tests if relevant. * fix: address review by divma * fix: remove now obsolete into_send method from ProgressSender * chore: remove unused code * fix: use Display for DownloadKind * fix: first round of changes after review * docs: downloader next_step * more review * more review * fix: only remove hashes from provider map if they are really not needed * refactor: move TagSet into util * refactor: store intents in request info * cleanup * refactor: no allocation in next_step and simpler code * fix: test after merge * refactor: better method * fix(iroh-bytes): do not log redundant file delete error (n0-computer#2199) ## Description fix(iroh-bytes): do not log when a file can not be deleted... because it is not there. this makes the delete op idempotent, and also helps in case the file was not there in the first place. Fixes n0-computer#2142 ## Notes & open questions  ## Change checklist - [ ] Self-review. - [ ] Documentation updates if relevant. - [ ] Tests if relevant. * fix(iroh-dns-server): fix bug in pkarr name parsing (n0-computer#2200) ## Description The order of labels when parsing a DNS query as pkarr name was reversed. This means all queries for second-level subdomains under a pkarr pubkey failed. * First commit is the actual fix. * Second commit adds a test for various record name and type variations * Third and forth commit improve the debug and info logging ## Notes & open questions Was discovered by @rklaehn in n0-computer#2188 and fixes the wrong behavior observed there. ## Change checklist - [x] Self-review. - [x] Documentation updates if relevant. - [x] Tests if relevant. --------- Co-authored-by: Franz Heinzmann (Frando) <frando@unbiskant.org> Co-authored-by: Rüdiger Klaehn <rklaehn@protonmail.com>

## Description based on #2085 * do not start downloads without any known node * track missing content hashes in the live sync session * start downloads after content propagation properly * reenable the sync_big test, let's see what CI says this time (was flakey before) ## Notes & open questions  ## Change checklist - [ ] Self-review. - [ ] Documentation updates if relevant. - [ ] Tests if relevant.

Frando force-pushed the feat/downloader-next branch 2 times, most recently from 96cbd60 to ae3154f Compare March 15, 2024 00:06

Frando changed the title ~~(wip) feat: downloader queue handling and progress reporting~~ feat(iroh-bytes): refactor downloader queue handling and add progress reporting Mar 15, 2024

Frando added 5 commits March 17, 2024 13:02

refactor: downloader

8f1fd10

test: add test for concurrent progress reporting

99db6ef

chore: cleanup

98fa4fc

refactor: rename some things for clarity

4fc6568

feat: limit concurrent dials per download

5090058

Frando force-pushed the feat/downloader-next branch from 30787ae to 5090058 Compare March 17, 2024 12:02

chore: cleanup

32e6cca

Frando requested review from divagant-martian, dignifiedquire and rklaehn March 18, 2024 10:00

Frando added 10 commits March 19, 2024 11:00

Merge remote-tracking branch 'origin/main' into feat/downloader-next

6064e73

refactor: respect progress sender IDs to allow reusing the senders in…

6c76517

… subsequent operations

chore: fmt & clippy

2fbd433

cleanup: shared download progress

91f436e

feat: handle tags in the downloader

bdf513a

Merge remote-tracking branch 'origin/main' into feat/downloader-next

831adcb

fix: visibility

699b009

fixup

ad3dd0e

fixup

cf73c57

chore: clippy

24ae6a3

rklaehn reviewed Mar 20, 2024

View reviewed changes

iroh-bytes/src/get/progress.rs Outdated Show resolved Hide resolved

fixup

7641ac6

rklaehn reviewed Mar 21, 2024

View reviewed changes

Merge remote-tracking branch 'origin/main' into feat/downloader-next

a54078a

Frando mentioned this pull request Mar 22, 2024

refactor: make export a seperate operation from download #2113

Merged

3 tasks

Frando added 2 commits March 22, 2024 13:19

Merge remote-tracking branch 'origin/main' into feat/downloader-next

666af33

refactor: make export a seperate operation from download

5733460

Frando added 2 commits April 17, 2024 21:31

docs: better docs for dial_per_hash_capacity and next_step

5d17429

Merge remote-tracking branch 'origin/main' into feat/downloader-next

8d06137

Frando commented Apr 17, 2024

View reviewed changes

Frando added 2 commits April 17, 2024 22:34

tests: add failing test for bug

1f71ca1

Frando force-pushed the feat/downloader-next branch from da70d6e to 0f88719 Compare April 18, 2024 00:01

chore: clippy

0367468

Frando force-pushed the feat/downloader-next branch from 0f88719 to 0367468 Compare April 18, 2024 00:01

fix: make sure to never remove busy nodes

51e251d

Frando force-pushed the feat/downloader-next branch from b94958f to 51e251d Compare April 18, 2024 11:44

divagant-martian approved these changes Apr 18, 2024

View reviewed changes

Frando added 2 commits April 19, 2024 14:58

fix: address review by divma

a9572fc

Merge remote-tracking branch 'origin/main' into feat/downloader-next

d57325b

Frando force-pushed the feat/downloader-next branch from f64b4d5 to d57325b Compare April 19, 2024 14:09

Merge remote-tracking branch 'origin/main' into feat/downloader-next

d0fc05c

Frando changed the title ~~feat(iroh-bytes): refactor downloader queue and add progress reporting~~ feat(iroh-bytes)!: refactor downloader queue and add progress reporting Apr 19, 2024

Frando added this pull request to the merge queue Apr 22, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 22, 2024

Merge branch 'main' into feat/downloader-next

9ac97bb

dignifiedquire enabled auto-merge April 22, 2024 08:58

Merge branch 'main' into feat/downloader-next

d0d19b7

dignifiedquire disabled auto-merge April 22, 2024 10:35

dignifiedquire enabled auto-merge April 22, 2024 10:35

dignifiedquire added this pull request to the merge queue Apr 22, 2024

Merged via the queue into main with commit 93290e3 Apr 22, 2024
20 of 21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(iroh-bytes)!: refactor downloader queue and add progress reporting #2085

feat(iroh-bytes)!: refactor downloader queue and add progress reporting #2085

Frando commented Mar 14, 2024 •

edited

Loading

rklaehn Mar 21, 2024

Frando Apr 17, 2024

Frando Apr 18, 2024 •

edited

Loading

divagant-martian left a comment •

edited

Loading

divagant-martian Apr 18, 2024

divagant-martian Apr 18, 2024

Frando Apr 19, 2024

divagant-martian Apr 18, 2024

divagant-martian Apr 18, 2024

Frando Apr 19, 2024 •

edited

Loading

divagant-martian Apr 18, 2024

Frando Apr 19, 2024

	"next step for parked node ist WaitForNodeRetry"
	"next step for parked node is WaitForNodeRetry"

feat(iroh-bytes)!: refactor downloader queue and add progress reporting #2085

feat(iroh-bytes)!: refactor downloader queue and add progress reporting #2085

Conversation

Frando commented Mar 14, 2024 • edited Loading

Description

Breaking changes

Notes & open questions

Change checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Frando Apr 18, 2024 • edited Loading

Choose a reason for hiding this comment

divagant-martian left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Frando Apr 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Frando commented Mar 14, 2024 •

edited

Loading

Frando Apr 18, 2024 •

edited

Loading

divagant-martian left a comment •

edited

Loading

Frando Apr 19, 2024 •

edited

Loading