fix: skips exporting duplicate blocks as they are encountered #557

jtsmedley · 2024-06-12T17:38:14Z

Title

Skip exporting duplicate blocks in @helia/car

Description

When calling export on an @helia/car instance the yielded export contains duplicate blocks. This PR skips yielding duplicate blocks resulting in compact CAR files.

Notes & open questions

Added a SET to track which CID(s) have already been written to the stream. Might need to add an option to toggle behavior if any users are relying on duplicate blocks in the output of CAR exports.

Change checklist

[Y] I have performed a self-review of my own code
[Y] I have made corresponding changes to the documentation if necessary (this includes comments as well)
[N] I have added tests that prove my fix is effective or that my feature works

jtsmedley · 2024-06-12T18:21:21Z

Fixes #556

SgtPooki

Thanks so much for submitting this. The code changes look fine to me, but CI is was failing. I would love to get thoughts from @lidel and @achingbrain on this one before proceeding. Additional thoughts below:

Some car specifications

carv1 - https://ipld.io/specs/transport/car/carv1/#duplicate-blocks
carv2 - https://ipld.io/specs/transport/car/carv2/

It doesn't seem specified that duplicate blocks should or should not be included, but the only reason I can think to include them would be issues around https://ipld.io/specs/transport/car/carv1/#determinism.

Maybe carv1 would require duplicate blocks for the final car to be the same as ones folks expect (and to rebuild dags that a CAR may represent), but with v2, it should be okay to remove them?

I'm not sure that we're clear which spec we're attempting to adhere to with @helia/car.

jtsmedley · 2024-06-12T19:24:43Z

For the sake of consistency Kubo does not export duplicate blocks. At least from my limited testing.

lidel · 2024-06-12T20:36:12Z

Dropping a drive-by comment with some historical context, hopefully its useful and not noise :)

What specs say?

Last time we looked into this the default behavior is left unspecified, and trustless gateway responses have explicit note about this:

https://specs.ipfs.tech/http-gateways/trustless-gateway/#car-format-parameters-and-determinism

Gateway's CAR responses in Kubo and Rainbow (both backed by boxo/gateway) do not include duplicates by default.

There is dups parameter that can be used for making this explicit, allowing for signaling duplicate presence in response via Content-Type header params:

Kubo/Rainbow use it to signal the response does NOT have duplicates.
Afaik no client requests CARs with specific dups. If there are no duplicates, bandwidth is saved, if they are present, they are noop.

Sidenote: Ok, but when are duplicates useful?

Duplicates have niche utility when you have a light client that has limited memory, so it can't cache blocks for later user, and has to “consume” blocks by unpacking data on the fly, and then discarding them (no blockstore, no cache, even in-memory).

Such client would explicitly opt-in for receiving duplicates via dups=t.
FWIW as of today, I am not aware of any clients that require this behavior, every client is usually smart enough to spawn follow-up application/vnd.ipld.raw request for any missing blocks, as this works with every gateway.

On this PR

I think the only concern here is that writtenBlocks could grow in size and become DoS / OOM vector if the user tricks the system into exporting a big DAG with many duplicated CIDs.

Having the ability to control this potential memory leak at the library level (and disable deduplication, and the memory cost that comes with it) would be nice, but can be added once this becomes an actual and not theoretical problem. Up to Helia maintainers.

achingbrain · 2024-06-13T14:32:03Z

I think the only concern here is that writtenBlocks could grow in size and become DoS / OOM vector if the user tricks the system into exporting a big DAG with many duplicated CIDs.

This can be mitigated by using a filter instead of a set, that way we don't need to store every CID encountered.

See createScalableCuckooFilter in @libp2p/utils

import { createScalableCuckooFilter } from '@libp2p/utils/filters'

const filter = createScaleableCuckooFilter(maxItems, errorRate)

filter.has(bytes) // returns boolean
filter.add(bytes) // `.has` will probably now return `true`

The only thing to consider is the reliability of the filter. False positives are possible but false negatives are not. That is, if .has returns false the item is definitely not present. If it returns true then the probability of the item being present is a factor of the errorRate.

The implementation should be structured so that there may be the (very) occasional duplicate block but there should never be a missed block.

achingbrain · 2024-06-13T16:17:12Z

Good to know about the Kubo behaviour too, we should align with that though an allowDuplicateBlocks: boolean option would be good to have.

jtsmedley · 2024-06-13T16:24:19Z

I need the allowDuplicateBlocks: boolean option anyways so I will go ahead and add that into this PR to round it out.

jtsmedley · 2024-06-17T12:18:47Z

In this PR I am sticking with a set as I need no duplicates in the resulting CAR files and I need that to be consistent always.

achingbrain

This needs to use a filter instead of a set as the set will cause the process to OOM for very large or streaming CAR files.

If making the filter reliable enough for the use case is a concern you may wish to expose a way to configure it or otherwise pass a preconfigured one in, since the size of it will largely be dictated by the size of the DAGs in the CAR being exported.

Can you please also add some tests that ensure there are no regressions around the allowDuplicateBlocks option.

achingbrain · 2024-06-18T12:25:08Z

If you merge main, the current CI error will go away.

…uplicate-blocks

jtsmedley · 2024-06-19T19:27:13Z

Switched to passing in an external filter to allow the consumer to choose how blocks are filtered without a specific purpose.

blocksFilter: Filter(@libp2p/utils/filters)

jtsmedley · 2024-06-26T19:11:16Z

Are any other changes needed?

Skips exporting duplicate blocks as they are encountered

1a5b288

jtsmedley requested a review from a team as a code owner June 12, 2024 17:38

jtsmedley changed the title ~~Skips exporting duplicate blocks as they are encountered~~ fix: skips exporting duplicate blocks as they are encountered Jun 12, 2024

Remove semicolons

c626c1f

SgtPooki requested changes Jun 12, 2024

View reviewed changes

SgtPooki requested review from lidel and achingbrain June 12, 2024 18:29

Add allowDuplicateBlocks flag

0366cec

jtsmedley requested a review from SgtPooki June 14, 2024 18:35

achingbrain requested changes Jun 18, 2024

View reviewed changes

jtsmedley added 5 commits June 18, 2024 09:55

Merge branch 'ipfs:main' into fix-duplicate-blocks

f71ddce

Add test for duplicate CAR export

4829512

Merge remote-tracking branch 'origin/fix-duplicate-blocks' into fix-d…

edf1d1f

…uplicate-blocks

Use cuckoo filter

7cd15af

Use external optional blockFilter instead

81b6784

jtsmedley requested a review from achingbrain June 20, 2024 14:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: skips exporting duplicate blocks as they are encountered #557

fix: skips exporting duplicate blocks as they are encountered #557

jtsmedley commented Jun 12, 2024

jtsmedley commented Jun 12, 2024

SgtPooki left a comment •

edited

Loading

jtsmedley commented Jun 12, 2024

lidel commented Jun 12, 2024 •

edited

Loading

achingbrain commented Jun 13, 2024

achingbrain commented Jun 13, 2024

jtsmedley commented Jun 13, 2024

jtsmedley commented Jun 17, 2024 •

edited

Loading

achingbrain left a comment

achingbrain commented Jun 18, 2024

jtsmedley commented Jun 19, 2024

jtsmedley commented Jun 26, 2024

fix: skips exporting duplicate blocks as they are encountered #557

Are you sure you want to change the base?

fix: skips exporting duplicate blocks as they are encountered #557

Conversation

jtsmedley commented Jun 12, 2024

Title

Description

Notes & open questions

Change checklist

jtsmedley commented Jun 12, 2024

SgtPooki left a comment • edited Loading

Choose a reason for hiding this comment

Some car specifications

jtsmedley commented Jun 12, 2024

lidel commented Jun 12, 2024 • edited Loading

What specs say?

Sidenote: Ok, but when are duplicates useful?

On this PR

achingbrain commented Jun 13, 2024

achingbrain commented Jun 13, 2024

jtsmedley commented Jun 13, 2024

jtsmedley commented Jun 17, 2024 • edited Loading

achingbrain left a comment

Choose a reason for hiding this comment

achingbrain commented Jun 18, 2024

jtsmedley commented Jun 19, 2024

jtsmedley commented Jun 26, 2024

SgtPooki left a comment •

edited

Loading

lidel commented Jun 12, 2024 •

edited

Loading

jtsmedley commented Jun 17, 2024 •

edited

Loading