Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPIP-0445: Option to Skip Raw Blocks in Gateway Responses #445

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

Jorropo
Copy link
Contributor

@Jorropo Jorropo commented Oct 11, 2023

For #444

@Jorropo Jorropo requested a review from hacdias October 11, 2023 04:19
@Jorropo Jorropo changed the title ipip(1337): add skip-leaves IPIP ipip(0445): add skip-leaves IPIP Oct 11, 2023
src/ipips/ipip-0445.md Outdated Show resolved Hide resolved
src/ipips/ipip-0445.md Outdated Show resolved Hide resolved
@rvagg
Copy link
Member

rvagg commented Oct 11, 2023

I can see the rationale for this; although it's quite similar to dag-scope=entity, so I wouldn't mind a treatment in the documentation why entity can't properly cover the desired use case, and if it can't, why dag-scope=skip-leaves (or similar) wouldn't be a better approach?

I think we should have a conversation about evolution of trustless now that it's deployed and being actively used in a number of scopes. We have multiple implementations of trustless now, and this adds an extension that is not necessarily in the scope or interests of classes of users or servers, or even if it is, implementation may be delayed or deferred for various reasons. It might be helpful to have some method of feature sniffing built into the spec, perhaps with an OPTIONS pattern so that clients that know they are using newer, less widely deployed, features of trustless, can first sniff before getting a rejection. Alternatively, we could just do a reject with 415 and send a pre-defined header, such as X-Supported-Content-Types, that lists variants of Accept that won't error. If that were in the spec it would be easy to implement; I'd probably take existing Accept parsing that's lax and make it strict, because currently the 2 trustless server implementations I'm responsible for would ignore this new parameter and give you the whole thing; I wouldn't mind tightening that up but we need a better signalling mechanism.

@Jorropo
Copy link
Contributor Author

Jorropo commented Oct 11, 2023

I can see the rationale for this; although it's quite similar to dag-scope=entity, so I wouldn't mind a treatment in the documentation why entity can't properly cover the desired use case, and if it can't, why dag-scope=skip-leaves (or similar) wouldn't be a better approach?

Because you need to be able to express a request that is both dag-scope=entity&entity-bytes=42:1337 and has skip-leaves set. (/ entity-bytes implies dag-scope=entity, so can't do dag-scope=skip-leaves&entity-bytes=42:1337):

For example let's say I have a .webm file stored in deserialized form on a non IPFS HTTP server and I am implementing a client that downloads the proofs from a gateway and the leaves from the non IPFS server.
Even in this situation features like seeking and ReadAt needs to work if the user wants to seek in the video.

This is great feedback I'll add that to the docs later thx.


I agree on multiplicity of options that could become an issue and we need a way to do signaling. I think #425 is a good place to have this discussion.
Imo this IPIP is fine to do without #425 because it's generally useful, the implementation difficulty range from trivial to easy and we are still at a point where the number of implementations is low.
It is also not very hard for clients to workaround a server that does not implement this feature by doing block by block requests. In my webseed usecase this is less worst than it sounds since the bulk of the traffic wont go through the gateway and thus wont be done block by block.

Comment on lines 91 to 94
An alternative approach would be to request blocks individually.
However it adds extra round trips and more per HTTP request overhead
and thus is undesireable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
An alternative approach would be to request blocks individually.
However it adds extra round trips and more per HTTP request overhead
and thus is undesireable.
1. An alternative approach would be to request blocks individually.
However it adds extra round trips and more per HTTP request overhead
and thus is undesireable.
2. An alternative implementation may be to either specify codecs to
include, or codecs to exclude. e.g. `exclude-codec=0x55` or
`include-codec=0x70` to achieve similar results. There may be
broader utility in such an approach, but the use-cases beyond that
proposed above for `skip-leaves` are not clear.

🤷 this is an option, is it better? not sure, I'm not even sure it'd be very different on the implementation level and may afford some other possibilities. It also clears up the "leaves" terminology, which is very unixfs+CIDv1 specific.

Copy link
Contributor Author

@Jorropo Jorropo Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intresting idea. I don't think the current API is unixfs specific. I hope raw blocks are children-less bytes in all formats that use them.
I guess you could have an ADL that parse the bytes node out of the file using cbor but then you are using IPLD wrong IMO, should really be a cbor cid then.

exclude-codec and include-codec are interesting however they complexify the server implementation.
I have an idea on how we could combine a small index mapping offsets into the unixfs file into offset into a car as well as offsets from childrens to offset into the car and a CARv1 DFS body with offloading and zerocopy to make a 400Gbps gateway with extremely low CPU usage implementation and having user input surface complexify the index this situation because I don't just need to map ranges in the unixfs file to ranges in the car, I also need to mark and compute prefix sums for all the codecs in the car.
So given this might anoy me one day and I don't need it, I wouldn't do it unless someone else wants this feature right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also one weird thing, if I have a dag that is dag-json → dag-cbor → raw and I exclude the dag-cbor codec, is the gateway expected to send me 1 or 2 blocks ? Because if the answer is 2 blocks then the gateway has to process every skipped blocks which means the gateway does work which does not reflect on the client. And if the answer is 1 then it does not follow least surprise principle.

Copy link
Member

@lidel lidel Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(bit unprocessed thought) If we are framing this around DAG traversal, exclude-codec=0x55 could be a sensible attempt at future-proofing, but we can always add it later.

Rationale:

The raw codec provides an interoperable way to act as traversal stop: ensure the child block that was referred will not be used for any further traversal / deserialization, but returned as an opaque blob. Allows for creating non-UnixFS dags that are compatible with existing IPFS ecosystem.

If we add generic exclude-codec and its semantic meaning also control when the traversal stops, then there could be a real world utility for excluding DAG branches based on their root codec, e.g. when they can't be retrieved with usual transports and require special handling. Enabling partial retrieval of data that is partially encoded in proprietary codecs sounds useful, but now sure how real world need there is for this right now. People building modern things seem to be happy with things put on top of UnixFS or DAG-CBOR.

Only thing that comes to mind is Filecoin which iirc can't be pinned or fetched recursively with standard IPFS tools due to the use of unsupported codecs (cc @hsanjuan @aschmahmann if i misremembered this, I think we've hit that problem a while ago).

Is there any other use case that could benefit from open-ended exclude-codec?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, silence suggests we don't have any immediate need for exclude-codec.
Let's keep this IPIP limited to skip-raw-blocks, open-ended filtering can be proposed in future IPIP.

@hannahhoward
Copy link
Contributor

hannahhoward commented Oct 12, 2023

FWIW, this webseed like functionality would be really useful for the Saturn case.

If we can let the browser make a request for say an image in flat format, but then verify it with this sort of request, that would be a big win for us.

I do agree about adding new parameters with expectation to serve everywhere being a heavy lift. Love to get DAGHouse folks in here early -- cc: @alanshaw

I would push for something less dag-structure specific -- I'd name the parameter something like "proof-only" -- cause that's what you're asking for, the proof part of the dag, without the data. That makes it more applicable to both Blake3 or eventual non-UnixFS data.

@hannahhoward
Copy link
Contributor

hannahhoward commented Oct 12, 2023

All that said, I'm fundamentally a yes on this IPIP, with the suggestion to change the parameter name to "proof-only" instead of "skip-leaves"

@Jorropo
Copy link
Contributor Author

Jorropo commented Oct 12, 2023

All that said, I'm fundamentally a yes on this IPIP, with the suggestion to change the parameter name to "proof-only" instead of "skip-leaves"

Many formats store data in non raw leaves, so this would be making it really unixfs but with --raw-leaves enabled specific given that would be wrong for a dag-cbor HAMT for example.

What do people think about skip-raw instead ? It's self descriptive and correct.

Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for submitting this IPIP @Jorropo
Quick initial feedback inline.

Comment on lines 316 to 322
:::note Notes for implementers

A request which is rooted at a `raw` block and has `skip-leaves=y` does not
make sense and SHOULD NOT be sent by clients, it is fair for servers to
error in this situation.

:::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is to reduce guesswork and probing a client has to do, which means this behavior should not be left for implementers, but a clear MUST / MUST NOT in the spec, including the HTTP error code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something 4XX something.
My lazy side tell me that I don't want to do this because then I save on code and I can return an empty car in boxo gateway instead of implementing one more edge case.
I can do it if it feels important to someone but given sending a 200 and returning an empty car is fine outcome I'm not sure we need to do this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not OK with empty CARs for the error case, and I don't particularly like that boxo/gateway is already doing this for other error cases and would rather not embed it as a standard practice here. 415 please.

Copy link
Contributor Author

@Jorropo Jorropo Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it's different from the other cases in boxo/gateway. If you request a raw block with skip-leaves, an empty car is a correct answer. It is the complete dag minus all the raw blocks, which happen to be an empty set but that is the client's problem and there is nothing the gateway can do about it. This is not a transient error either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this section in f96a92a and made HTTP 400 (Bad Request) a MUST behavior that we will also enforce via conformance test.

(error is not 415 as we moved skip-raw-blocks from content type to URL query params – see "Alternatives" / "Why not CAR content type parameter ?" section in IPIP).

src/ipips/ipip-0445.md Outdated Show resolved Hide resolved
src/http-gateways/trustless-gateway.md Outdated Show resolved Hide resolved
@Jorropo Jorropo changed the title ipip(0445): add skip-leaves IPIP ipip(0445): add skip-raw-blocks Oct 13, 2023
@lidel lidel changed the title ipip(0445): add skip-raw-blocks IPIP-0445: Option to Skip Raw Blocks in Gateway Responses Oct 25, 2023
Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed some editorial changes, namely rename from skip-leaves content type param to skip-raw-blocks URL query param. (rationale explained in "Alternatives" → "Why not CAR content type parameter" section)

Ratification requires reference implementation and conformance tests (details inline) but it is good enough to mention during IPFS Implementers call this Thursday for final reviews from wider ecosystem.

Comment on lines +150 to +172
:::issue

TODO: update below section with CIDs or CARs from conformance tests

Scenarios we should check:
- [ ] request for `/ipfs/cid` where CID has `raw` codec MUST return HTTP 400 (Bad Request)
- [ ] reuse existing UnixFS DAG that has raw-leaves, request it with
`skip-raw-blocks=n`, confirm the response includes expected raw leaves' CIDs
- [ ] create a new CAR fixture that only have non-raw blocks. Request it with
`skip-raw-blocks=y`, confirm the response includes expected CIDs and does not
include raw blocks referenced by parents.
- important part is creating CAR fixture by hand, and ensure the raw blocks are
NEVER announced anywhere (generate fixture with random data, add to ipfs
with raw-leaves option, then export DAG without `raw` blocks (use go-car's
[`filter`](https://github.com/ipld/go-car/tree/master/cmd/car#readme) or
similar)
- Why? This goes extra mile, but ensures every conformant gateway
implementation is not doing useless work of fetching raw blocks which are
not required for fulfilling `skip-raw-blocks=y` requests). We did
similar thing for `entity-bytes` and it was the only way we could show
bugs in Saturn project's cache implementation at the time.

:::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jorropo this is the minimal set of tests I've identified for this IPIP, lmk if you think it is sufficient, or if we need more.

The way we did this in the past, was to update test fixtures section at the very end:

  1. create reference implementation in Boxo & Kubo PRs
  2. then creating a gateway-conformance PR with relevant fixtures that pass against branches from (2)
  3. after reference implementation and tests are merged in respective repos, we update IPIP with final CID or link to a CAR fixture (examples: 428, 412)

backward compatibility with existing clients and systems that are unaware
of this new flag.

### Alternatives
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ℹ️ updated this section with paths not taken, lmk if anything requires more elaboration.

src/http-gateways/trustless-gateway.md Show resolved Hide resolved
Jorropo added a commit to ipfs/boxo that referenced this pull request Nov 2, 2023
@BigLep BigLep mentioned this pull request Nov 9, 2023
11 tasks
Comment on lines +72 to +78
4. **Non-IPFS HTTP Mirrors Become Useful:** Legacy data that is already exposed
over HTTP in deserialized form can now act as sources for specific block
byte ranges, without having to support any IPFS specific APIs. Plain HTTP
Range Requests can be used for fetching remaining raw block data, and the
metadata read via `skip-raw-blocks=y` is enough for a client to verify the
remaining raw block byte ranges fetched from non-IPFS system match expected
CIDs.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to hint at that in 1 Verification Flexibility, this text goes in much more detail and I think should be merged with the first entry.

@kernelogic
Copy link

kernelogic commented Nov 18, 2023

Would like to see this implemented as well (in lassie). This can be useful to quickly inspect (and validate, for Fil+) the files in a sector. Without downloading the whole 32GB sector.

The current dag-scope=entity only covers the first level, currently will have to recursively drill down to get more information - involving a lot more requests.

@hannahhoward
Copy link
Contributor

One other thing to consider, though probably not in this IPIP -- option to not traverse files while recursively following a UnixFS Directory tree.

@rvagg
Copy link
Member

rvagg commented Dec 12, 2023

well, that could be included in here because it's quite similar - skip=none|files|leaves, being able to ls -alR and only get the directory entries would be pretty neat

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🏃 In Progress
Status: 🏃‍♀️ In Progress
Development

Successfully merging this pull request may close these issues.

5 participants