Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(iroh-bytes)!: refactor downloader queue and add progress reporting #2085

Merged
merged 66 commits into from
Apr 22, 2024

Conversation

Frando
Copy link
Member

@Frando Frando commented Mar 14, 2024

Description

This PR contains changes to the downloader:

  • Remove the Role::Provider vs Role::Candidate distinction. We added this back when we did not have content propagation in docs, and it now does not make much sense for our architecture. Either we know/assume a node has something, or not. The "inbetween" state did not make sense anymore IMO.
  • Rework the queuing logic to be based on a simple queue of pending downloads. Before, if a download could not be started because the concurrenty limits were reached, the download was considered failed and inserted, with a delay, into the retry queue. Now, if a download cannot be started, we just wait until a slot frees up, and then start it. Note that the queue is, for now, quite simple - if the next download in the queue cannot be started (e.g. because all provider nodes are currently busy with other downloads), we do not try to start the second-top download in the queue, but instead wait until a slot is freed up. We could certainly optimize this by "jumping the queue" in certain cases, this would however also need more logic to make sure that downloads cannot be "forgotten". Therefore, for now the queue is handled strictly in order.
  • The retry behavior is refactored: We now retry nodes (not downloads as before) up to a retry limit, with an increasing timeout. If a download can only proceed with a retrying node, it is parked and the next item in the queue is processed. The download is unparked if the retrying node successfully connects.
  • Add progress reporting to downloads managed through the downloader. For this I wrote a SharedProgress handler that allows to subscribe to already running downloads: If an intent is registered for hash A, and this download is started, and while it is running, another intent is registered for the same hash, it will now receive an DownloadProgress::InitialState which contains a TransferProgress which functions as a reducer for the progress events This can be used from the client even, and further events can be reduced/merged into that struct. The PR contains a test for this concurrent progress reporting.
  • Expose the downloader in the iroh node. Download requests via the RPC API can now set a DownloadMode enum either to Direct or to Queued: the former will behave as currently (issue an iroh-bytes request directly, without a queue or concurrency limits) and the latter will add the download to the downloader queue.

Breaking changes

Changes in iroh:

  • The BlobDownloadRequest has a new field mode to select between direct and queued downloads, and now contains a list of nodes in place of a single node before

Changes in iroh_bytes:

  • Role enum is removed
  • Downloader::queue now takes a DownloadRequest with more options than before
  • DownloadProgress has a new variant InitialState which is emitted when attaching to an already-running download
  • ConcurrencyLimits gained a new field

Other changes:

  • SetTagOption was moved from iroh to iroh-bytes

Notes & open questions

  • Another followup improves content downloading in docs: refactor: improve content downloading in docs #2127 .

  • A few more tests around the queuing behavior would be great

  • I have a partially done followup which adds a hook for content discovery to the downloader

Change checklist

  • Self-review.
  • Documentation updates if relevant.
  • Tests if relevant.

@Frando Frando force-pushed the feat/downloader-next branch 2 times, most recently from 96cbd60 to ae3154f Compare March 15, 2024 00:06
@Frando Frando changed the title (wip) feat: downloader queue handling and progress reporting feat(iroh-bytes): refactor downloader queue handling and add progress reporting Mar 15, 2024
iroh-bytes/src/downloader/invariants.rs Outdated Show resolved Hide resolved
use super::DownloadKind;

/// The channel that can be used to subscribe to progress updates.
pub type ProgressSubscriber = FlumeProgressSender<DownloadProgress>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once it is merged we can make this a BoxedProgressSender so we don't have flume in here...

iroh/src/node/rpc.rs Show resolved Hide resolved

let ActiveRequestInfo { node, temp_tag, .. } = active_request_info;

// get node info
let node_info = self
.nodes
.get_mut(&node)
.expect("node exists in the mapping");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ppodolsky triggered this expect when using this branch. I have a suspicion and am writing a test to confirm.

Copy link
Member Author

@Frando Frando Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now fixed. I added the test, it failed, and now passes. The test is fail_while_running.

## Description

adds retries to the downloader:

* after a node fails to dial, or fails during a transfer with IO errors,
we move the node into a delay queue, with an increasing timeout. once
the delay expires, the node may be dialed again.
* if the head of the download queue can only proceed with nodes which
are currently in the retry queue, we *park* the download in a separate
set from the main queue, and proceed with the next download
* once a retrying node succeeds to reconnect, we unpark all parked
hashes for which this node is a candidate
Copy link
Contributor

@divagant-martian divagant-martian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments. TBH this was a less thorough review just due to reviewer exhaustion, which I assume you as PR owner are suffering as well. Taking into account we just did a release, I think we have time to deal with bugs if any in future PRs

for entry in self.queue.iter_parked() {
assert!(
matches!(self.next_step(entry), NextStep::Park),
"next step for parked node ist WaitForNodeRetry"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"next step for parked node ist WaitForNodeRetry"
"next step for parked node is WaitForNodeRetry"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing this without first checking the enum changes, but this assertion message is confusing. I'm reading it as:

for each parked item in the queue, check that the next step is Park but when it's not, say that it's WaitForNodeRetry ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is wrong, the enum variant was renamed earlier but not in the comment. Also slopyy language. Rephrased to "all parked downloads evaluate to the correct next step".

async move {
if kind == blob_fail.into() {
tokio::time::sleep(Duration::from_millis(10)).await;
Err(FailureAction::DropPeer(anyhow!("bad!")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤣

@@ -345,3 +361,134 @@ async fn long_queue() {
res.expect("all downloads to succeed");
}
}

#[tokio::test]
async fn fail_while_running() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the behaviour here makes sense but I don't understand why the test was needed, or let's say.. what particular scenario it's testing.

My wild guess is that it tests that.. concurrent downloads being handled by the same peer are not cancelled if one fails? Does it matter whether the peer was or not removed (DropPeer) after both are done?

In any case, could you add some docs to the test so that it can be updated more easily when we forget why it was added?

Copy link
Member Author

@Frando Frando Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your intuition was correct. The branch previously had a bug (spotted by @ppodolsky) where DropPeer would not first check if there are other downloads running but drop the peer unconditionally. The test recreates that scenario, and passes since the bug was fixed.
I added this comment to the test:

/// If a download errors with [`FailureAction::DropPeer`], make sure that the peer is not dropped
/// while other transfers are still running.

Does it matter whether the peer was or not removed (DropPeer) after both are done?

Currently, a peer will only be dropped after a FailureAction::DropPeer if no other transfer is running. My intuition for that choice was:

  • If the other transfers after the failed one succeed without error, the condition that caused DropPeer was likely transient, so we just keep the peer and ignore the previous DropPeer
  • If the failure condition remains, the other transfers will fail with DropPeer as well, and after the last transfer failed, the peer will be dropped.

As I'm typing this, the following scenario would be possible in the current implementation, but likely should be avoided:

  • Queue hashA, hashB with nodeA. hashA, hashB are started
  • hashA fails with DropPeer -> peer is not dropped because hashB is still running
  • Queue hashC with nodeA. hashC is started
  • hashB fails with DropPeer -> peer is not dropped because hashC is still running
  • Queue hashD with nodeA. hashD is started
  • ... repeat as above

In plain words: If I keep queuing new dowloads to a bad node in some frequency, it might happen that the peer is never dropped even though it is clearly faulty. Downloads will still succeed if other providers are available, but cycles will be waisted and resource capacity needlessly consumed.

I think I will add a test for this case. A fix could be to mark a node as failed (or count the failures) and not queue new downloads for suchly marked nodes (but do not drop them while things are running). After a successful transfer, the failed marker is cleared.

iroh-bytes/src/downloader/test.rs Outdated Show resolved Hide resolved
iroh-bytes/src/downloader/test.rs Outdated Show resolved Hide resolved
iroh-bytes/src/downloader/test.rs Show resolved Hide resolved
/// Queue to manage dropping nodes.
/// Nodes to which we have an active or idle connection
connected_nodes: HashMap<NodeId, ConnectionInfo<D::Connection>>,
/// Nodes that failed and for which we track retries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failed.. dialing? or had a a failed download? both?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's both. Adjusted the comment.

@Frando Frando changed the title feat(iroh-bytes): refactor downloader queue and add progress reporting feat(iroh-bytes)!: refactor downloader queue and add progress reporting Apr 19, 2024
ppodolsky added a commit to izihawa/iroh that referenced this pull request Apr 20, 2024
* refactor: downloader

* test: add test for concurrent progress reporting

* chore: cleanup

* refactor: rename some things for clarity

* feat: limit concurrent dials per download

* chore: cleanup

* refactor: respect progress sender IDs to allow reusing the senders in subsequent operations

* chore: fmt & clippy

* cleanup: shared download progress

* feat: handle tags in the downloader

* fix: visibility

* fixup

* fixup

* chore: clippy

* fixup

* refactor: make export a seperate operation from download

* fixup after merge

* refactor: remove id mapping

* refactor: move BlobId up to the progress events

* fix tests

* cleanup

* fix: invariants

* fixes and cleanups

* fix: nodes_have should only add to queued downloads

* chore: fmt

* chore: clippy!

* fix: do not queue downloads with no providers

* feat: set more than one node on download requests (n0-computer#2128)

## Description

Based on n0-computer#2085 

The downloader already has the feature to try multiple nodes for a
single hash or hashseq. With n0-computer#2085 we expose the downloader over the RPC
protocol, by adding a `DownloadMode { Queued, Direct }` enum to the
`BlobDownloadRequest`.
This PR modifies the `BlobDownloadRequest` to include a list of provider
nodes for a hash instead of just a single node. For queued requests that
go to the downloader, the nodes are just passed to the downloader. For
direct requests, the nodes are tried in-order, until one succeeds.

## Notes & open questions

<!-- Any notes, remarks or open questions you have to make about the PR.
-->

## Change checklist

- [ ] Self-review.
- [ ] Documentation updates if relevant.
- [ ] Tests if relevant.

* fix: address review by divma

* fix: remove now obsolete into_send method from ProgressSender

* chore: remove unused code

* fix: use Display for DownloadKind

* fix: first round of changes after review

* docs: downloader next_step

* more review

* more review

* fix: only remove hashes from provider map if they are really not needed

* refactor: move TagSet into util

* refactor: store intents in request info

* cleanup

* refactor: no allocation in next_step and simpler code

* fix: test after merge

* refactor: better method

* fix(iroh-bytes): do not log redundant file delete error (n0-computer#2199)

## Description

fix(iroh-bytes): do not log when a file can not be deleted...

because it is not there. this makes the delete op idempotent, and also
helps in case the file was not there in the first place.

Fixes n0-computer#2142

## Notes & open questions

<!-- Any notes, remarks or open questions you have to make about the PR.
-->

## Change checklist

- [ ] Self-review.
- [ ] Documentation updates if relevant.
- [ ] Tests if relevant.

* fix(iroh-dns-server): fix bug in pkarr name parsing (n0-computer#2200)

## Description

The order of labels when parsing a DNS query as pkarr name was reversed.
This means all queries for second-level subdomains under a pkarr pubkey
failed.

* First commit is the actual fix. 
* Second commit adds a test for various record name and type variations
* Third and forth commit improve the debug and info logging

## Notes & open questions

Was discovered by @rklaehn in n0-computer#2188 and fixes the wrong behavior
observed there.

## Change checklist

- [x] Self-review.
- [x] Documentation updates if relevant.
- [x] Tests if relevant.

---------

Co-authored-by: Franz Heinzmann (Frando) <frando@unbiskant.org>
Co-authored-by: Rüdiger Klaehn <rklaehn@protonmail.com>
ppodolsky added a commit to izihawa/iroh that referenced this pull request Apr 20, 2024
* refactor: downloader

* test: add test for concurrent progress reporting

* chore: cleanup

* refactor: rename some things for clarity

* feat: limit concurrent dials per download

* chore: cleanup

* refactor: respect progress sender IDs to allow reusing the senders in subsequent operations

* chore: fmt & clippy

* cleanup: shared download progress

* feat: handle tags in the downloader

* fix: visibility

* fixup

* fixup

* chore: clippy

* fixup

* refactor: make export a seperate operation from download

* fixup after merge

* refactor: remove id mapping

* refactor: move BlobId up to the progress events

* fix tests

* cleanup

* fix: invariants

* fixes and cleanups

* fix: nodes_have should only add to queued downloads

* chore: fmt

* chore: clippy!

* fix: do not queue downloads with no providers

* feat: set more than one node on download requests (n0-computer#2128)

## Description

Based on n0-computer#2085 

The downloader already has the feature to try multiple nodes for a
single hash or hashseq. With n0-computer#2085 we expose the downloader over the RPC
protocol, by adding a `DownloadMode { Queued, Direct }` enum to the
`BlobDownloadRequest`.
This PR modifies the `BlobDownloadRequest` to include a list of provider
nodes for a hash instead of just a single node. For queued requests that
go to the downloader, the nodes are just passed to the downloader. For
direct requests, the nodes are tried in-order, until one succeeds.

## Notes & open questions

<!-- Any notes, remarks or open questions you have to make about the PR.
-->

## Change checklist

- [ ] Self-review.
- [ ] Documentation updates if relevant.
- [ ] Tests if relevant.

* fix: address review by divma

* fix: remove now obsolete into_send method from ProgressSender

* chore: remove unused code

* fix: use Display for DownloadKind

* fix: first round of changes after review

* docs: downloader next_step

* more review

* more review

* fix: only remove hashes from provider map if they are really not needed

* refactor: move TagSet into util

* refactor: store intents in request info

* cleanup

* refactor: no allocation in next_step and simpler code

* fix: test after merge

* refactor: better method

* fix(iroh-bytes): do not log redundant file delete error (n0-computer#2199)

## Description

fix(iroh-bytes): do not log when a file can not be deleted...

because it is not there. this makes the delete op idempotent, and also
helps in case the file was not there in the first place.

Fixes n0-computer#2142

## Notes & open questions

<!-- Any notes, remarks or open questions you have to make about the PR.
-->

## Change checklist

- [ ] Self-review.
- [ ] Documentation updates if relevant.
- [ ] Tests if relevant.

* fix(iroh-dns-server): fix bug in pkarr name parsing (n0-computer#2200)

## Description

The order of labels when parsing a DNS query as pkarr name was reversed.
This means all queries for second-level subdomains under a pkarr pubkey
failed.

* First commit is the actual fix. 
* Second commit adds a test for various record name and type variations
* Third and forth commit improve the debug and info logging

## Notes & open questions

Was discovered by @rklaehn in n0-computer#2188 and fixes the wrong behavior
observed there.

## Change checklist

- [x] Self-review.
- [x] Documentation updates if relevant.
- [x] Tests if relevant.

---------

Co-authored-by: Franz Heinzmann (Frando) <frando@unbiskant.org>
Co-authored-by: Rüdiger Klaehn <rklaehn@protonmail.com>
ppodolsky added a commit to izihawa/iroh that referenced this pull request Apr 20, 2024
* refactor: downloader

* test: add test for concurrent progress reporting

* chore: cleanup

* refactor: rename some things for clarity

* feat: limit concurrent dials per download

* chore: cleanup

* refactor: respect progress sender IDs to allow reusing the senders in subsequent operations

* chore: fmt & clippy

* cleanup: shared download progress

* feat: handle tags in the downloader

* fix: visibility

* fixup

* fixup

* chore: clippy

* fixup

* refactor: make export a seperate operation from download

* fixup after merge

* refactor: remove id mapping

* refactor: move BlobId up to the progress events

* fix tests

* cleanup

* fix: invariants

* fixes and cleanups

* fix: nodes_have should only add to queued downloads

* chore: fmt

* chore: clippy!

* fix: do not queue downloads with no providers

* feat: set more than one node on download requests (n0-computer#2128)

## Description

Based on n0-computer#2085 

The downloader already has the feature to try multiple nodes for a
single hash or hashseq. With n0-computer#2085 we expose the downloader over the RPC
protocol, by adding a `DownloadMode { Queued, Direct }` enum to the
`BlobDownloadRequest`.
This PR modifies the `BlobDownloadRequest` to include a list of provider
nodes for a hash instead of just a single node. For queued requests that
go to the downloader, the nodes are just passed to the downloader. For
direct requests, the nodes are tried in-order, until one succeeds.

## Notes & open questions

<!-- Any notes, remarks or open questions you have to make about the PR.
-->

## Change checklist

- [ ] Self-review.
- [ ] Documentation updates if relevant.
- [ ] Tests if relevant.

* fix: address review by divma

* fix: remove now obsolete into_send method from ProgressSender

* chore: remove unused code

* fix: use Display for DownloadKind

* fix: first round of changes after review

* docs: downloader next_step

* more review

* more review

* fix: only remove hashes from provider map if they are really not needed

* refactor: move TagSet into util

* refactor: store intents in request info

* cleanup

* refactor: no allocation in next_step and simpler code

* fix: test after merge

* refactor: better method

* fix(iroh-bytes): do not log redundant file delete error (n0-computer#2199)

## Description

fix(iroh-bytes): do not log when a file can not be deleted...

because it is not there. this makes the delete op idempotent, and also
helps in case the file was not there in the first place.

Fixes n0-computer#2142

## Notes & open questions

<!-- Any notes, remarks or open questions you have to make about the PR.
-->

## Change checklist

- [ ] Self-review.
- [ ] Documentation updates if relevant.
- [ ] Tests if relevant.

* fix(iroh-dns-server): fix bug in pkarr name parsing (n0-computer#2200)

## Description

The order of labels when parsing a DNS query as pkarr name was reversed.
This means all queries for second-level subdomains under a pkarr pubkey
failed.

* First commit is the actual fix. 
* Second commit adds a test for various record name and type variations
* Third and forth commit improve the debug and info logging

## Notes & open questions

Was discovered by @rklaehn in n0-computer#2188 and fixes the wrong behavior
observed there.

## Change checklist

- [x] Self-review.
- [x] Documentation updates if relevant.
- [x] Tests if relevant.

---------

Co-authored-by: Franz Heinzmann (Frando) <frando@unbiskant.org>
Co-authored-by: Rüdiger Klaehn <rklaehn@protonmail.com>
@Frando Frando added this pull request to the merge queue Apr 22, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 22, 2024
@dignifiedquire dignifiedquire added this pull request to the merge queue Apr 22, 2024
Merged via the queue into main with commit 93290e3 Apr 22, 2024
20 of 21 checks passed
github-merge-queue bot pushed a commit that referenced this pull request Apr 23, 2024
## Description

based on #2085 

* do not start downloads without any known node
* track missing content hashes in the live sync session
* start downloads after content propagation properly
* reenable the sync_big test, let's see what CI says this time (was
flakey before)

## Notes & open questions

<!-- Any notes, remarks or open questions you have to make about the PR.
-->

## Change checklist

- [ ] Self-review.
- [ ] Documentation updates if relevant.
- [ ] Tests if relevant.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

4 participants