Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Storage of sha for files in the tree #203

Closed
martinheidegger opened this issue Feb 21, 2018 · 16 comments
Closed

Proposal: Storage of sha for files in the tree #203

martinheidegger opened this issue Feb 21, 2018 · 16 comments

Comments

@martinheidegger
Copy link
Contributor

User-Problem

Many files got/get shared between users as part of a set. So many users (particularly in the same organization) maintain copies of the same file. If those users create new dats for every file they replicate that file even though they wouldn't need to.

This is particularly important when using a central tool like datBase or hyperdrive that (also) acts as backup for the data. Those central storages could easily just download the same file once for each client.

Code-Problem

Hyperdrive doesn't store the sha information in the tree which makes it impossible to determine the content before entirely downloading a file.

Solution

  1. Add optionally (default=true) sha property to the file-tree item: https://github.com/mafintosh/hyperdrive/blob/2a585c9a7906bd6d432ab9b3f7fd959fb8c3417c/index.js#L585-L595
  2. Add an optional API that allows to lookup files by their sha-hash in a cache. opt.cache.get(hash, cb) (?)

Would you accept a PR that implements this?

@bnewbold
Copy link

This idea has come up before (dat-ecosystem-archive/datproject-discussions#77 (comment)) and i'd be in favor of it. There's some more thinking to go in to it though: which hash functions to support? We already use BLAKE2b elsewhere, but md5/sha1 are much more commonly used today for this sort of de-dupe (as opposed to cryptographic assurance). How do we make this feature "future proof", allowing evolution of which hashes are used? What is the specific additional overhead incurred to calculate this hash for large files? Is this even useful if we don't find the "inner" hashes of compressed files as well?

My current opinion is that multiple hash types should be allowed, but we default to just SHA-1 to start. By default clients would compute and add the hash, but not verify on download (rely on merkel tree for that).

I think the cache API/lookup should be done by an external library, at least to start.

@martinheidegger
Copy link
Contributor Author

martinheidegger commented Feb 21, 2018

has come up before

Thanks! 😍

How do we make this feature "future proof"?

My thinking is to add {hash: "blah", hashFn: "sha-1"} to the tree.

What is the specific additional overhead incurred to calculate this hash for large files?

I can see several different overheads:

  1. CPU Overhead for calculating the hash
  2. CPU Overhead for transferring the chunks to the hashing function
  3. Memory overhead for storing the hash + function-name
  4. Transfer overhead for transmitting the hash

Since all of this is added through an option, it will stay possible to transfer data at an unobstructed speed.

the cache API/lookup should be done by an external library

Yes, but how to trigger and weave it in? My approach would be to add a cache object to opts:
https://github.com/mafintosh/hyperdrive/blob/2a585c9a7906bd6d432ab9b3f7fd959fb8c3417c/index.js#L25

where cache would have a cache.get(`${hashFn}/${hash}`, cb) method to make sure its partially compatible to leveldb.

@pfrazee
Copy link
Contributor

pfrazee commented Feb 21, 2018

This kind of metadata is also useful in applications for establishing happens-before precedence. If you can reference the hash of a file, then you know it was created prior to the hash's publication.

@mafintosh
Copy link
Contributor

mafintosh commented Feb 21, 2018 via email

@martinheidegger
Copy link
Contributor Author

@mafintosh a few questions:

  1. As mentioned in the comments before: should it also store what hashing method was used?
  2. Is it okay to be optional? With a flag in the opts?
  3. What about the cache-lookup API?

@mafintosh
Copy link
Contributor

@martinheidegger

  1. yes, or just support blake2b for now (your call).
  2. yes should be optional.
  3. cache-api is out of scope for hyperdrive but def something that could have value by itself.

@martinheidegger
Copy link
Contributor Author

@mafintosh Thanks for the reply. I guess its easy to get started working on it :)

@damons
Copy link

damons commented Mar 9, 2018

What I think about when I think about future-proofing this: Can this be parallelized over GPUs when I REALLY need to sync.

I'm curious, too, if using factoradics as an index/hash is a way to optimize dividing up the indexing work. See: https://www.researchgate.net/publication/292950384_A_GPU-based_Branch-and-Bound_algorithm_using_Integer-Vector-Matrix_data_structure

@damons
Copy link

damons commented Mar 9, 2018

Also, the Bela Ban and team of jgroups.org have done a lot of the practical ground-work around optimizing state-sync over tcp and udp. Yes, it's Java, but the hashing and syncing logic has been optimized for decades now and includes a decent amount of wisdom to consider. Not as new as a permutahedron, approach, tho. ;)

@damons
Copy link

damons commented Mar 9, 2018

See: http://jgroups.org/demos.html

ReplicatedHashMap

@martinheidegger
Copy link
Contributor Author

I spent some brain-cycles to rethink this issue and following bugged me: Having the sha in the metadata information means that we need to download the whole metadata tree before we can access a dat's entire lookup. In order to allow random access, this needs to be in a trie structure and probably a separate hypercore. If it were stored together with the metadata it would just result in duplicate storage of the hashes (once in the metadata and once in the trie).

Thinking about this, I believe that the solution I proposed in this issue is not a good idea. Should I close this issue?

@RangerMauve
Copy link
Contributor

I thought metadata was already being fully synched up when you connect to a peer, even in sparse mode.

@martinheidegger
Copy link
Contributor Author

@RangerMauve There is a sparseMetadata option in hyperdrive that treats metadata as sparse, which is useful (and used) when you have a lot of files right now. Though I assume this will be even more prevalent when using hyperdb as backend for hyperdrive. Basically: There is no reason to replicate the metadata of a million files of a previous version, lets only replicate the metadata of the hundred thousand files required for the current file-tree. (and also the download of this metadata might take some time...)

@RangerMauve
Copy link
Contributor

Huh, TIL. 😛

I'm 100% into having another hypercore with SHA hashes. It'll make interop with stuff like git easier and maybe help with deduplication and merges / forks.

@martinheidegger
Copy link
Contributor Author

martinheidegger commented Jan 17, 2019

Reading through the code a bit, it seems like this could be done either in hyperdb or hyperdrive. But it kinda seems more reasonable to do it in hyperdrive (using hypertrie), as it even could be backwards compatible (if implemented in both branches.)

Thought I think having an index (hash-index as first variant) on a hyperdb would be an awesome general thing.

@damons
Copy link

damons commented Feb 18, 2019

I'm curious to see if a permutahedron could be used for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants