Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: skips exporting duplicate blocks as they are encountered #557

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

jtsmedley
Copy link

Title

Skip exporting duplicate blocks in @helia/car

Description

When calling export on an @helia/car instance the yielded export contains duplicate blocks. This PR skips yielding duplicate blocks resulting in compact CAR files.

Notes & open questions

Added a SET to track which CID(s) have already been written to the stream. Might need to add an option to toggle behavior if any users are relying on duplicate blocks in the output of CAR exports.

Change checklist

  • [Y] I have performed a self-review of my own code
  • [Y] I have made corresponding changes to the documentation if necessary (this includes comments as well)
  • [N] I have added tests that prove my fix is effective or that my feature works

@jtsmedley jtsmedley requested a review from a team as a code owner June 12, 2024 17:38
@jtsmedley jtsmedley changed the title Skips exporting duplicate blocks as they are encountered fix: skips exporting duplicate blocks as they are encountered Jun 12, 2024
@jtsmedley
Copy link
Author

Fixes #556

Copy link
Member

@SgtPooki SgtPooki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for submitting this. The code changes look fine to me, but CI is was failing. I would love to get thoughts from @lidel and @achingbrain on this one before proceeding. Additional thoughts below:

Some car specifications

carv1 - https://ipld.io/specs/transport/car/carv1/#duplicate-blocks
carv2 - https://ipld.io/specs/transport/car/carv2/

It doesn't seem specified that duplicate blocks should or should not be included, but the only reason I can think to include them would be issues around https://ipld.io/specs/transport/car/carv1/#determinism.


Maybe carv1 would require duplicate blocks for the final car to be the same as ones folks expect (and to rebuild dags that a CAR may represent), but with v2, it should be okay to remove them?

I'm not sure that we're clear which spec we're attempting to adhere to with @helia/car.

@jtsmedley
Copy link
Author

For the sake of consistency Kubo does not export duplicate blocks. At least from my limited testing.

@lidel
Copy link
Member

lidel commented Jun 12, 2024

Dropping a drive-by comment with some historical context, hopefully its useful and not noise :)

What specs say?

Last time we looked into this the default behavior is left unspecified, and trustless gateway responses have explicit note about this:

Gateway's CAR responses in Kubo and Rainbow (both backed by boxo/gateway) do not include duplicates by default.

There is dups parameter that can be used for making this explicit, allowing for signaling duplicate presence in response via Content-Type header params:

Kubo/Rainbow use it to signal the response does NOT have duplicates.
Afaik no client requests CARs with specific dups. If there are no duplicates, bandwidth is saved, if they are present, they are noop.

Sidenote: Ok, but when are duplicates useful?

Duplicates have niche utility when you have a light client that has limited memory, so it can't cache blocks for later user, and has to “consume” blocks by unpacking data on the fly, and then discarding them (no blockstore, no cache, even in-memory).

Such client would explicitly opt-in for receiving duplicates via dups=t.
FWIW as of today, I am not aware of any clients that require this behavior, every client is usually smart enough to spawn follow-up application/vnd.ipld.raw request for any missing blocks, as this works with every gateway.

On this PR

I think the only concern here is that writtenBlocks could grow in size and become DoS / OOM vector if the user tricks the system into exporting a big DAG with many duplicated CIDs.

Having the ability to control this potential memory leak at the library level (and disable deduplication, and the memory cost that comes with it) would be nice, but can be added once this becomes an actual and not theoretical problem. Up to Helia maintainers.

@achingbrain
Copy link
Member

I think the only concern here is that writtenBlocks could grow in size and become DoS / OOM vector if the user tricks the system into exporting a big DAG with many duplicated CIDs.

This can be mitigated by using a filter instead of a set, that way we don't need to store every CID encountered.

See createScalableCuckooFilter in @libp2p/utils

import { createScalableCuckooFilter } from '@libp2p/utils/filters'

const filter = createScaleableCuckooFilter(maxItems, errorRate)

filter.has(bytes) // returns boolean
filter.add(bytes) // `.has` will probably now return `true`

The only thing to consider is the reliability of the filter. False positives are possible but false negatives are not. That is, if .has returns false the item is definitely not present. If it returns true then the probability of the item being present is a factor of the errorRate.

The implementation should be structured so that there may be the (very) occasional duplicate block but there should never be a missed block.

@achingbrain
Copy link
Member

Good to know about the Kubo behaviour too, we should align with that though an allowDuplicateBlocks: boolean option would be good to have.

@jtsmedley
Copy link
Author

I need the allowDuplicateBlocks: boolean option anyways so I will go ahead and add that into this PR to round it out.

@jtsmedley jtsmedley requested a review from SgtPooki June 14, 2024 18:35
@jtsmedley
Copy link
Author

jtsmedley commented Jun 17, 2024

In this PR I am sticking with a set as I need no duplicates in the resulting CAR files and I need that to be consistent always.

Copy link
Member

@achingbrain achingbrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to use a filter instead of a set as the set will cause the process to OOM for very large or streaming CAR files.

If making the filter reliable enough for the use case is a concern you may wish to expose a way to configure it or otherwise pass a preconfigured one in, since the size of it will largely be dictated by the size of the DAGs in the CAR being exported.

Can you please also add some tests that ensure there are no regressions around the allowDuplicateBlocks option.

@achingbrain
Copy link
Member

If you merge main, the current CI error will go away.

@jtsmedley
Copy link
Author

Switched to passing in an external filter to allow the consumer to choose how blocks are filtered without a specific purpose.

blocksFilter: Filter(@libp2p/utils/filters)

@jtsmedley
Copy link
Author

Are any other changes needed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants