Proposal: Storage of sha for files in the tree #203

martinheidegger · 2018-02-21T18:40:49Z

User-Problem

Many files got/get shared between users as part of a set. So many users (particularly in the same organization) maintain copies of the same file. If those users create new dats for every file they replicate that file even though they wouldn't need to.

This is particularly important when using a central tool like datBase or hyperdrive that (also) acts as backup for the data. Those central storages could easily just download the same file once for each client.

Code-Problem

Hyperdrive doesn't store the sha information in the tree which makes it impossible to determine the content before entirely downloading a file.

Solution

Add optionally (default=true) sha property to the file-tree item: https://github.com/mafintosh/hyperdrive/blob/2a585c9a7906bd6d432ab9b3f7fd959fb8c3417c/index.js#L585-L595
Add an optional API that allows to lookup files by their sha-hash in a cache. opt.cache.get(hash, cb) (?)

Would you accept a PR that implements this?

The text was updated successfully, but these errors were encountered:

bnewbold · 2018-02-21T18:52:07Z

This idea has come up before (dat-ecosystem-archive/datproject-discussions#77 (comment)) and i'd be in favor of it. There's some more thinking to go in to it though: which hash functions to support? We already use BLAKE2b elsewhere, but md5/sha1 are much more commonly used today for this sort of de-dupe (as opposed to cryptographic assurance). How do we make this feature "future proof", allowing evolution of which hashes are used? What is the specific additional overhead incurred to calculate this hash for large files? Is this even useful if we don't find the "inner" hashes of compressed files as well?

My current opinion is that multiple hash types should be allowed, but we default to just SHA-1 to start. By default clients would compute and add the hash, but not verify on download (rely on merkel tree for that).

I think the cache API/lookup should be done by an external library, at least to start.

martinheidegger · 2018-02-21T19:11:42Z

has come up before

Thanks! 😍

How do we make this feature "future proof"?

My thinking is to add {hash: "blah", hashFn: "sha-1"} to the tree.

What is the specific additional overhead incurred to calculate this hash for large files?

I can see several different overheads:

CPU Overhead for calculating the hash
CPU Overhead for transferring the chunks to the hashing function
Memory overhead for storing the hash + function-name
Transfer overhead for transmitting the hash

Since all of this is added through an option, it will stay possible to transfer data at an unobstructed speed.

the cache API/lookup should be done by an external library

Yes, but how to trigger and weave it in? My approach would be to add a cache object to opts:
https://github.com/mafintosh/hyperdrive/blob/2a585c9a7906bd6d432ab9b3f7fd959fb8c3417c/index.js#L25

where cache would have a cache.get(`${hashFn}/${hash}`, cb) method to make sure its partially compatible to leveldb.

pfrazee · 2018-02-21T19:39:53Z

This kind of metadata is also useful in applications for establishing happens-before precedence. If you can reference the hash of a file, then you know it was created prior to the hash's publication.

mafintosh · 2018-02-21T21:08:59Z

I have no problem adding a flat hash of the file. Someone wants to do a PR that adds a blake2b?

…

On Wed, Feb 21, 2018, 19:39 Paul Frazee ***@***.***> wrote: This kind of metadata is also useful in applications for establishing happens-before precedence. If you can reference the hash of a file, then you know it was created prior to the hash's publication. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#203 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAW_VaLKzZc7vqh_pTHwPAqvGsxbWYhZks5tXHEJgaJpZM4SOIN0> .

martinheidegger · 2018-02-22T00:38:44Z

@mafintosh a few questions:

As mentioned in the comments before: should it also store what hashing method was used?
Is it okay to be optional? With a flag in the opts?
What about the cache-lookup API?

mafintosh · 2018-03-09T11:49:26Z

@martinheidegger

yes, or just support blake2b for now (your call).
yes should be optional.
cache-api is out of scope for hyperdrive but def something that could have value by itself.

martinheidegger · 2018-03-09T11:58:36Z

@mafintosh Thanks for the reply. I guess its easy to get started working on it :)

damons · 2018-03-09T20:35:50Z

What I think about when I think about future-proofing this: Can this be parallelized over GPUs when I REALLY need to sync.

I'm curious, too, if using factoradics as an index/hash is a way to optimize dividing up the indexing work. See: https://www.researchgate.net/publication/292950384_A_GPU-based_Branch-and-Bound_algorithm_using_Integer-Vector-Matrix_data_structure

damons · 2018-03-09T21:06:48Z

Also, the Bela Ban and team of jgroups.org have done a lot of the practical ground-work around optimizing state-sync over tcp and udp. Yes, it's Java, but the hashing and syncing logic has been optimized for decades now and includes a decent amount of wisdom to consider. Not as new as a permutahedron, approach, tho. ;)

damons · 2018-03-09T21:09:44Z

See: http://jgroups.org/demos.html

ReplicatedHashMap

martinheidegger · 2019-01-11T01:16:43Z

I spent some brain-cycles to rethink this issue and following bugged me: Having the sha in the metadata information means that we need to download the whole metadata tree before we can access a dat's entire lookup. In order to allow random access, this needs to be in a trie structure and probably a separate hypercore. If it were stored together with the metadata it would just result in duplicate storage of the hashes (once in the metadata and once in the trie).

Thinking about this, I believe that the solution I proposed in this issue is not a good idea. Should I close this issue?

RangerMauve · 2019-01-17T15:19:26Z

I thought metadata was already being fully synched up when you connect to a peer, even in sparse mode.

martinheidegger · 2019-01-17T15:35:30Z

@RangerMauve There is a sparseMetadata option in hyperdrive that treats metadata as sparse, which is useful (and used) when you have a lot of files right now. Though I assume this will be even more prevalent when using hyperdb as backend for hyperdrive. Basically: There is no reason to replicate the metadata of a million files of a previous version, lets only replicate the metadata of the hundred thousand files required for the current file-tree. (and also the download of this metadata might take some time...)

RangerMauve · 2019-01-17T15:42:35Z

Huh, TIL. 😛

I'm 100% into having another hypercore with SHA hashes. It'll make interop with stuff like git easier and maybe help with deduplication and merges / forks.

martinheidegger · 2019-01-17T16:03:16Z

Reading through the code a bit, it seems like this could be done either in hyperdb or hyperdrive. But it kinda seems more reasonable to do it in hyperdrive (using hypertrie), as it even could be backwards compatible (if implemented in both branches.)

Thought I think having an index (hash-index as first variant) on a hyperdb would be an awesome general thing.

damons · 2019-02-18T18:23:19Z

I'm curious to see if a permutahedron could be used for this.

pfrazee mentioned this issue Feb 21, 2018

Overlapping dat repositories and data deduplication dat-ecosystem/dat#938

Open

martinheidegger mentioned this issue Mar 9, 2018

Check existing data when using dat clone dat-ecosystem/dat#885

Closed

bnewbold mentioned this issue Mar 18, 2018

For discussion: full-file hashes in hyperdrive metadata dat-ecosystem-archive/DEPs#12

Open

martinheidegger mentioned this issue Jan 11, 2019

Provide information on deduplication dat-ecosystem-archive/docs#135

Open

jmatsushita mentioned this issue Nov 29, 2019

Current chunking: content-aware? de-duped? dat-ecosystem-archive/datproject-discussions#77

Open

mafintosh closed this as completed May 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Storage of sha for files in the tree #203

Proposal: Storage of sha for files in the tree #203

martinheidegger commented Feb 21, 2018

bnewbold commented Feb 21, 2018

martinheidegger commented Feb 21, 2018 •

edited

Loading

pfrazee commented Feb 21, 2018

mafintosh commented Feb 21, 2018 via email

martinheidegger commented Feb 22, 2018

mafintosh commented Mar 9, 2018

martinheidegger commented Mar 9, 2018

damons commented Mar 9, 2018

damons commented Mar 9, 2018

damons commented Mar 9, 2018

martinheidegger commented Jan 11, 2019

RangerMauve commented Jan 17, 2019

martinheidegger commented Jan 17, 2019

RangerMauve commented Jan 17, 2019

martinheidegger commented Jan 17, 2019 •

edited

Loading

damons commented Feb 18, 2019

Proposal: Storage of sha for files in the tree #203

Proposal: Storage of sha for files in the tree #203

Comments

martinheidegger commented Feb 21, 2018

User-Problem

Code-Problem

Solution

bnewbold commented Feb 21, 2018

martinheidegger commented Feb 21, 2018 • edited Loading

pfrazee commented Feb 21, 2018

mafintosh commented Feb 21, 2018 via email

martinheidegger commented Feb 22, 2018

mafintosh commented Mar 9, 2018

martinheidegger commented Mar 9, 2018

damons commented Mar 9, 2018

damons commented Mar 9, 2018

damons commented Mar 9, 2018

martinheidegger commented Jan 11, 2019

RangerMauve commented Jan 17, 2019

martinheidegger commented Jan 17, 2019

RangerMauve commented Jan 17, 2019

martinheidegger commented Jan 17, 2019 • edited Loading

damons commented Feb 18, 2019

martinheidegger commented Feb 21, 2018 •

edited

Loading

martinheidegger commented Jan 17, 2019 •

edited

Loading