Skip to content
This repository has been archived by the owner on Apr 16, 2020. It is now read-only.

Ensure that Rabin fingerprinting works with large datasets #134

Closed
flyingzumwalt opened this issue Jan 29, 2017 · 5 comments
Closed

Ensure that Rabin fingerprinting works with large datasets #134

flyingzumwalt opened this issue Jan 29, 2017 · 5 comments

Comments

@flyingzumwalt
Copy link
Contributor

From https://botbot.me/freenode/ipfs/2017-01-29/?msg=80105342&page=1

[ani 10:03 pm] Rabin seems to be failing at larger jobs. Will try with more memory and CPU but it's stalling around 5GB/49

@whyrusleeping Could you re-add the test dataset from #126 using Rabin fingerprinting to make sure it doesn't choke?

@Kubuxu
Copy link
Contributor

Kubuxu commented Jan 29, 2017

Rabin sharding from what I understand isn't that cheap and it is the reason it chokes on it.

Before we invest time into it I would recommend checking if it gives any benefits in multiple areas: in file, cross files (directory) and cross datasets vs normal chunking.

As from what I understand it might prevent some other duplications from happening.

@flyingzumwalt
Copy link
Contributor Author

flyingzumwalt commented Jan 29, 2017

I appreciate your caution. In theory rabin fingerprinting should be beneficial for exactly this case, where many people have downloaded the same datasets from the same sources but might have slight variations in the copies they downloaded. Our default chunking algorithm (fixed-size 256kb chunks) prevents them from even trying to deduplicate files. People like @20zinnm are motivated to test how the code performs for this use case. I want to make sure that the code is ready for them to proceed.

Keep in mind:

  • We can't do the tests you've suggested if the chunking functions fails to even process the input files
  • Though rabin fingerprinting is not the default chunking algorithm, it is an officially supported one. It should work. If it's not working, we should at least record and diagnose the bug.

@meyerzinn
Copy link

See: #136

@flyingzumwalt
Copy link
Contributor Author

I'll take #136 as confirmation that rabin fingerprinting works for large datasets. Great work @20zinnm. I'll open a new issue for the tests that @Kubuxu suggested.

@meyerzinn
Copy link

meyerzinn commented Jan 30, 2017

@flyingzumwalt It's still very heavy in terms of performance and needs high specs for anything > a few gigs. But yes, in principle it should work.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants