Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(iroh-bytes): Batch blob api #2339

Closed
wants to merge 23 commits into from
Closed

feat(iroh-bytes): Batch blob api #2339

wants to merge 23 commits into from

Conversation

rklaehn
Copy link
Contributor

@rklaehn rklaehn commented Jun 3, 2024

Description

This adds a new API for creating blobs.

You create a batch. Within a batch you got all the usual operations to add stuff, like add_bytes, add_stream, add_from_path etc. Notable differences to the existing api:

  • All operations work on individual blobs, so no way to add entire subdirectories with add_from_path
  • All operations return a TempTag instead of having a set tag option

The way to use the API is to just perform a complex operation within a batch, and then at the end assign a (non temporary) tag to the root(s) of the created data before dropping the batch.

It is possible to scan a directory and create a collection purely on the client side, so the code to traverse a directory can be removed from the node.

To allow the workflow described above, the tags client has been extended to allow manually setting a tag.

Ideally this API would entirely replace the current blobs API, so all read ops would always happen within the context of a batch.

Breaking Changes

At this point mostly adding stuff, but changing the RPC api for setting tags. I might decide to remove the entire non batch mutation API in this PR.

How it works is to leave a streaming RPC call open for each batch, then do operations in the context of an unique identifier for this RPC call.

Notes & open questions

Note: if things work out every add operation refers to a single blob, and the aggregation of many blobs can be driven from the client. That means that a lot of the complexity of the progress events like ids etc. can be removed from the rpc. This still needs to exist, but can be confined to the client.

Todo

  • Add back fine grained progress
    fine grained progress is not needed for all ops, but definitely for add_file and add_dir. Possibly for add_bytes and add_reader. Not sure if it is OK to have it in all cases despite it typically not being used.
  • Purge all tag setting stuff from the blobs API and the downloader.

I would propose that we merge this initially as an addition, and do the stripdown of the other APIs in a subsequent PR. Also, if this is an addition we can do the fine grained progress for add_dir in a subsequent PR as well.

Change checklist

  • Self-review.
  • Documentation updates if relevant.
  • Tests if relevant.
  • All breaking changes documented.

I found that in most places it is not needed, and if somebody needs them
to be clone they can always wrap them in an Arc.

Also, this complicates extending the concept of temp tags via the rpc
boundary.

With this change if will be possible to use the same TempTag type both in
the low level blobs store API and in the higher level blobs API of iroh.
@rklaehn
Copy link
Contributor Author

rklaehn commented Jun 4, 2024

Here is the preliminary API for batches:

pub async fn add_bytes(&self, bytes: impl Into<Bytes>, format: BlobFormat) -> Result<TempTag> {
pub async fn add_file(&self, path: PathBuf, import_mode: ImportMode, format: BlobFormat) -> Result<(TempTag, u64)> {
pub async fn add_dir(&self, root: PathBuf, import_mode: ImportMode, wrap: WrapOption) -> Result<TempTag> {
pub async fn add_collection(&self, collection: Collection) -> Result<TempTag> {
pub async fn add_stream(&self, mut input: impl Stream<Item = io::Result<Bytes>> + Send + Unpin + 'static, format: BlobFormat) -> Result<TempTag> {
pub async fn add_blob_seq(&self, iter: impl Iterator<Item = Bytes>) -> Result<TempTag> {
pub async fn temp_tag(&self, content: HashAndFormat) -> Result<TempTag> {

Basically very similar to the normal blobs api, but there are no options to create tags. Instead every fn returns a temp tag for the thing that has been created, that the user can then later assign to a permanent tag (or not).

The tags API has been extended to allow creating a tag given a hash and format.

Many of these functions are convenience functions. Probably most notably, add_dir is now traversing the file system on the client side and doing multiple add_file calls.

@dignifiedquire
Copy link
Contributor

should there be the equivalent delete versions as well`

@rklaehn
Copy link
Contributor Author

rklaehn commented Jun 4, 2024

should there be the equivalent delete versions as well`

WDYM? You delete stuff by ensuring that it is no longer tagged in some way, then GC will take care of it. There is blob delete, but that is really a low level function that you should rarely use directly.

delete_blob will just do it's thing no matter what temp tags there are, so it can live in the blobs API not in the batch API.

iroh/src/node.rs Outdated Show resolved Hide resolved
@rklaehn rklaehn marked this pull request as ready for review June 5, 2024 11:07
@@ -700,7 +703,7 @@ pub enum ImportProgress {
/// does not make any sense. E.g. an in memory implementation will always have
/// to copy the file into memory. Also, a disk based implementation might choose
/// to copy small files even if the mode is `Reference`.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)]
#[derive(Debug, Clone, Copy, PartialEq, Eq, Default, Serialize, Deserialize)]
pub enum ImportMode {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reexport this from the blobs api since I don't want to newtype it...

@@ -177,6 +208,41 @@ impl ServerStreamingMsg<RpcService> for BlobValidateRequest {
type Response = ValidateProgress;
}

/// Get the status of a blob
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is mostly unrelated to the PR. Just a way to check if you have a thing without downloading it all via the rpc.

move around the BlobStatus, use async lock
github-merge-queue bot pushed a commit that referenced this pull request Jun 6, 2024
…e client side (#2349)

## Description

A collection is just one particular way to use a hashseq, so it feels a
bit weird to have it baked in to the iroh node. With this we can move
some of it into the client.

This is a part of #2272 . We can
make more similar changes once we have the batch API
#2339 .

## Breaking Changes

<!-- Optional, if there are any breaking changes document them,
including how to migrate older code. -->

## Notes & open questions

Note: I closed #2272 because half of the changes in that PR are here,
the other half will be part of the batch PR, and moving collections into
iroh I am not convinced of yet...

## Change checklist

- [x] Self-review.
- [x] Documentation updates if relevant.
- [x] Tests if relevant.
- [x] All breaking changes documented.
# Conflicts:
#	iroh/src/client/blobs.rs
#	iroh/src/node/rpc.rs
#	iroh/src/rpc_protocol.rs
@rklaehn rklaehn force-pushed the batch-blob-api-2 branch 2 times, most recently from b13a1e9 to b8b0d80 Compare June 6, 2024 16:15
# Conflicts:
#	iroh/src/client/blobs.rs
#	iroh/src/client/tags.rs
#	iroh/src/node/rpc.rs
#	iroh/src/rpc_protocol.rs
@dignifiedquire dignifiedquire added this to the v0.19.0 milestone Jun 13, 2024
@dignifiedquire dignifiedquire modified the milestones: v0.19.0, v0.20.0 Jun 24, 2024
# Conflicts:
#	iroh/src/client/blobs.rs
#	iroh/src/client/tags.rs
#	iroh/src/node.rs
#	iroh/src/node/builder.rs
#	iroh/src/node/rpc.rs
@dignifiedquire
Copy link
Contributor

@rklaehn what's the state of this?

@rklaehn
Copy link
Contributor Author

rklaehn commented Jun 28, 2024

@rklaehn what's the state of this?

Just merged with main. I want to do another self-review, but currently trying to keep all the stuff up to date with main so it does not bitrot...

iroh-blobs/src/provider.rs Show resolved Hide resolved
iroh-blobs/src/provider.rs Show resolved Hide resolved
/// Keep the item alive until the end of the process
pub fn leak(mut self) {
// set the liveness tracker to None, so that the refcount is not decreased
// during drop. This means that the refcount will never reach 0 and the
// item will not be gced until the end of the process.
// item will not be gced until the end of the process, unless you manually
// invoke on_drop.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this reads like it should be moved from a comment to documentation

quic_rpc::RpcClient<RpcService, quic_rpc::transport::boxed::Connection<RpcService>>;
pub(crate) type RpcConnection = quic_rpc::transport::boxed::Connection<RpcService>;

/// Iroh rpc client - boxed so that we can have a concrete type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not seeing the boxed here from just looking at the type, maybe the docs could be more clear in that regard?

iroh/src/client/blobs.rs Show resolved Hide resolved

/// Set a tag to a value, overwriting any existing value.
///
/// This is a convenience wrapper around `set_opt`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to suggest [set_opt] to ensure docs always refer an existing method but I can't find it. I assume this is meant to be

Suggested change
/// This is a convenience wrapper around `set_opt`.
/// This is a convenience wrapper around [`set_with_opt`].

/// Delete a tag.
///
/// This is a convenience wrapper around `set_opt`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same.

Comment on lines +137 to +138
for (content, count) in scope.tags {
if let Some(tag_drop) = tag_drop {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inverting this produces the same logic but a single conditional check

}

/// Remove an entire batch.
fn remove(&mut self, batch: BatchId, tag_drop: Option<&dyn TagDrop>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leak is always called on store. What happens if remove and remove_one are called without a TagDrop?

@dignifiedquire dignifiedquire modified the milestones: v0.20.0, v0.21.0 Jul 8, 2024
@dignifiedquire
Copy link
Contributor

closing in favor of #2545

@dignifiedquire dignifiedquire removed this from the v0.21.0 milestone Jul 29, 2024
github-merge-queue bot pushed a commit that referenced this pull request Aug 15, 2024
## Description

This is the third attempt to add a batch API for adding blobs. Previous
one was #2339

The basic idea is the following: all changes to the store happen in the
context of a _batch_. All write operations within a batch produce temp
tags. These temp tags are scoped to the batch and keep the data alive as
long as the batch exists.

At some point, the API user has to upgrade one or more temp tags to
permanent tags.

All non-batch operations would long term be implemented in terms of
batch operations.

In a second step, the following rpc calls would be replaced by their
batch equivalent.
- AddStream
- AddPath
- CreateCollection

The third one is very nice, since it means that the notion of a
collection (as in a special kind of hashseq) no longer has to even exist
in the node code.

## Breaking Changes

- iroh::client::blobs::BlobStatus has a new case NotFound
- iroh::client::blobs::BlobStatus::Partial: size is now a BaoBlobSize
instead of a u64

All other public changes are adding of new APIs.

## Notes & open questions

Note: in the previous version I had an optimisation to avoid storing
TempTags in the case where there are multiple TempTags with the same
hash. I removed this to keep things simple. We can add it back later.

## Change checklist

- [ ] Self-review.
- [ ] Documentation updates following the [style
guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text),
if relevant.
- [ ] Tests if relevant.
- [ ] All breaking changes documented.

---------

Co-authored-by: Philipp Krüger <philipp.krueger1@gmail.com>
rklaehn added a commit to n0-computer/iroh-blobs that referenced this pull request Oct 22, 2024
…e client side (#2349)

## Description

A collection is just one particular way to use a hashseq, so it feels a
bit weird to have it baked in to the iroh node. With this we can move
some of it into the client.

This is a part of n0-computer/iroh#2272 . We can
make more similar changes once we have the batch API
n0-computer/iroh#2339 .

## Breaking Changes

<!-- Optional, if there are any breaking changes document them,
including how to migrate older code. -->

## Notes & open questions

Note: I closed #2272 because half of the changes in that PR are here,
the other half will be part of the batch PR, and moving collections into
iroh I am not convinced of yet...

## Change checklist

- [x] Self-review.
- [x] Documentation updates if relevant.
- [x] Tests if relevant.
- [x] All breaking changes documented.
rklaehn added a commit to n0-computer/iroh-blobs that referenced this pull request Oct 22, 2024
## Description

This is the third attempt to add a batch API for adding blobs. Previous
one was n0-computer/iroh#2339

The basic idea is the following: all changes to the store happen in the
context of a _batch_. All write operations within a batch produce temp
tags. These temp tags are scoped to the batch and keep the data alive as
long as the batch exists.

At some point, the API user has to upgrade one or more temp tags to
permanent tags.

All non-batch operations would long term be implemented in terms of
batch operations.

In a second step, the following rpc calls would be replaced by their
batch equivalent.
- AddStream
- AddPath
- CreateCollection

The third one is very nice, since it means that the notion of a
collection (as in a special kind of hashseq) no longer has to even exist
in the node code.

## Breaking Changes

- iroh::client::blobs::BlobStatus has a new case NotFound
- iroh::client::blobs::BlobStatus::Partial: size is now a BaoBlobSize
instead of a u64

All other public changes are adding of new APIs.

## Notes & open questions

Note: in the previous version I had an optimisation to avoid storing
TempTags in the case where there are multiple TempTags with the same
hash. I removed this to keep things simple. We can add it back later.

## Change checklist

- [ ] Self-review.
- [ ] Documentation updates following the [style
guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text),
if relevant.
- [ ] Tests if relevant.
- [ ] All breaking changes documented.

---------

Co-authored-by: Philipp Krüger <philipp.krueger1@gmail.com>
rklaehn added a commit to n0-computer/iroh-blobs that referenced this pull request Oct 22, 2024
…e client side (#2349)

## Description

A collection is just one particular way to use a hashseq, so it feels a
bit weird to have it baked in to the iroh node. With this we can move
some of it into the client.

This is a part of n0-computer/iroh#2272 . We can
make more similar changes once we have the batch API
n0-computer/iroh#2339 .

## Breaking Changes

<!-- Optional, if there are any breaking changes document them,
including how to migrate older code. -->

## Notes & open questions

Note: I closed #2272 because half of the changes in that PR are here,
the other half will be part of the batch PR, and moving collections into
iroh I am not convinced of yet...

## Change checklist

- [x] Self-review.
- [x] Documentation updates if relevant.
- [x] Tests if relevant.
- [x] All breaking changes documented.
rklaehn added a commit to n0-computer/iroh-blobs that referenced this pull request Oct 22, 2024
## Description

This is the third attempt to add a batch API for adding blobs. Previous
one was n0-computer/iroh#2339

The basic idea is the following: all changes to the store happen in the
context of a _batch_. All write operations within a batch produce temp
tags. These temp tags are scoped to the batch and keep the data alive as
long as the batch exists.

At some point, the API user has to upgrade one or more temp tags to
permanent tags.

All non-batch operations would long term be implemented in terms of
batch operations.

In a second step, the following rpc calls would be replaced by their
batch equivalent.
- AddStream
- AddPath
- CreateCollection

The third one is very nice, since it means that the notion of a
collection (as in a special kind of hashseq) no longer has to even exist
in the node code.

## Breaking Changes

- iroh::client::blobs::BlobStatus has a new case NotFound
- iroh::client::blobs::BlobStatus::Partial: size is now a BaoBlobSize
instead of a u64

All other public changes are adding of new APIs.

## Notes & open questions

Note: in the previous version I had an optimisation to avoid storing
TempTags in the case where there are multiple TempTags with the same
hash. I removed this to keep things simple. We can add it back later.

## Change checklist

- [ ] Self-review.
- [ ] Documentation updates following the [style
guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text),
if relevant.
- [ ] Tests if relevant.
- [ ] All breaking changes documented.

---------

Co-authored-by: Philipp Krüger <philipp.krueger1@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants