-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Storage of sha for files in the tree #203
Comments
This idea has come up before (dat-ecosystem-archive/datproject-discussions#77 (comment)) and i'd be in favor of it. There's some more thinking to go in to it though: which hash functions to support? We already use BLAKE2b elsewhere, but md5/sha1 are much more commonly used today for this sort of de-dupe (as opposed to cryptographic assurance). How do we make this feature "future proof", allowing evolution of which hashes are used? What is the specific additional overhead incurred to calculate this hash for large files? Is this even useful if we don't find the "inner" hashes of compressed files as well? My current opinion is that multiple hash types should be allowed, but we default to just SHA-1 to start. By default clients would compute and add the hash, but not verify on download (rely on merkel tree for that). I think the cache API/lookup should be done by an external library, at least to start. |
Thanks! 😍
My thinking is to add
I can see several different overheads:
Since all of this is added through an option, it will stay possible to transfer data at an unobstructed speed.
Yes, but how to trigger and weave it in? My approach would be to add a where |
This kind of metadata is also useful in applications for establishing |
I have no problem adding a flat hash of the file. Someone wants to do a PR
that adds a blake2b?
…On Wed, Feb 21, 2018, 19:39 Paul Frazee ***@***.***> wrote:
This kind of metadata is also useful in applications for establishing
happens-before precedence. If you can reference the hash of a file, then
you know it was created prior to the hash's publication.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#203 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAW_VaLKzZc7vqh_pTHwPAqvGsxbWYhZks5tXHEJgaJpZM4SOIN0>
.
|
@mafintosh a few questions:
|
|
@mafintosh Thanks for the reply. I guess its easy to get started working on it :) |
What I think about when I think about future-proofing this: Can this be parallelized over GPUs when I REALLY need to sync. I'm curious, too, if using factoradics as an index/hash is a way to optimize dividing up the indexing work. See: https://www.researchgate.net/publication/292950384_A_GPU-based_Branch-and-Bound_algorithm_using_Integer-Vector-Matrix_data_structure |
Also, the Bela Ban and team of jgroups.org have done a lot of the practical ground-work around optimizing state-sync over tcp and udp. Yes, it's Java, but the hashing and syncing logic has been optimized for decades now and includes a decent amount of wisdom to consider. Not as new as a permutahedron, approach, tho. ;) |
See: http://jgroups.org/demos.html ReplicatedHashMap |
I spent some brain-cycles to rethink this issue and following bugged me: Having the sha in the metadata information means that we need to download the whole metadata tree before we can access a dat's entire lookup. In order to allow random access, this needs to be in a trie structure and probably a separate hypercore. If it were stored together with the metadata it would just result in duplicate storage of the hashes (once in the metadata and once in the trie). Thinking about this, I believe that the solution I proposed in this issue is not a good idea. Should I close this issue? |
I thought metadata was already being fully synched up when you connect to a peer, even in sparse mode. |
@RangerMauve There is a sparseMetadata option in hyperdrive that treats metadata as sparse, which is useful (and used) when you have a lot of files right now. Though I assume this will be even more prevalent when using hyperdb as backend for hyperdrive. Basically: There is no reason to replicate the metadata of a million files of a previous version, lets only replicate the metadata of the hundred thousand files required for the current file-tree. (and also the download of this metadata might take some time...) |
Huh, TIL. 😛 I'm 100% into having another hypercore with SHA hashes. It'll make interop with stuff like git easier and maybe help with deduplication and merges / forks. |
Reading through the code a bit, it seems like this could be done either in hyperdb or hyperdrive. But it kinda seems more reasonable to do it in hyperdrive (using hypertrie), as it even could be backwards compatible (if implemented in both branches.) Thought I think having an index (hash-index as first variant) on a hyperdb would be an awesome general thing. |
I'm curious to see if a permutahedron could be used for this. |
User-Problem
Many files got/get shared between users as part of a set. So many users (particularly in the same organization) maintain copies of the same file. If those users create new dats for every file they replicate that file even though they wouldn't need to.
This is particularly important when using a central tool like datBase or hyperdrive that (also) acts as backup for the data. Those central storages could easily just download the same file once for each client.
Code-Problem
Hyperdrive doesn't store the sha information in the tree which makes it impossible to determine the content before entirely downloading a file.
Solution
sha
property to the file-tree item: https://github.com/mafintosh/hyperdrive/blob/2a585c9a7906bd6d432ab9b3f7fd959fb8c3417c/index.js#L585-L595opt.cache.get(hash, cb)
(?)Would you accept a PR that implements this?
The text was updated successfully, but these errors were encountered: