Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash conversion for import/export and long term archival #1953

Closed
btrask opened this issue Nov 9, 2015 · 26 comments
Closed

Hash conversion for import/export and long term archival #1953

btrask opened this issue Nov 9, 2015 · 26 comments

Comments

@btrask
Copy link

btrask commented Nov 9, 2015

I've talked to Juan about this before but I figured it would be good to have an issue for it. I'm not trying to beat a dead horse though. This is separate from the URI format debate (#1678).

There is a lot of existing software that might want to either move on top of IPFS, or exchange data with IPFS, that is currently using hashes of files (e.g. images) or other data. This software is often using MD5, SHA-1 or SHA-256.

There are three aspects of conversion that pose a problem for compatibility outside the IPFS ecosystem:

The multihash format
This can be solved with better tooling. The simplest would be a CLI tool (probably written in Go) that accepts hashes in various standard formats and encodings (e.g. SHA-256 hex) and outputs a multihash. When you give it a multihash it might give you the same hash in multiple standard encodings (e.g. hex and base-64). It might also produce multihashes from files directly. (Perhaps a tool like this exists, but even the JS multihash lib doesn't produce fully base-58 encoded multihashes for you.)

Documenting and reproducing the IPFS DAG format, and generating it on the fly while hashing
This is a big problem for any software that wants to interface with IPFS. IPFS addresses are hashes of the DAG data, but the DAG format is... whatever IPFS decides to stick in its protobufs (e.g. #1925). It isn't formally documented AFAIK, and there's no other software that can create IPFS DAGs besides IPFS itself. Even worse, it seems difficult to do this efficiently in a low-overhead streaming hasher (without retaining extra intermediate data).

Flexibility in the IPFS DAG format, for example the trickle-DAG mode
IPFS itself can't agree on the hash for any given file. Add a file with ipfs add myfile and ipfs add -t myfile and you get two different hashes. The trickle-DAG is a cool and useful feature that allows efficient streaming and seeking (#713) over the IPFS network. However it shouldn't result in different file addresses. (Even worse, I think other DAG structures are possible and could be used by alternate front-ends, all resulting in incompatible hashes.)

Converting hashes without recomputing them from the underlying file is impossible
Because IPFS hashes include the IPFS DAG structure, they are effectively a different hash algorithm altogether. You can convert between SHA-1 hex and base-64 without re-encoding, and even to multihash, but you can't get a full IPFS hash because the DAG data must be "mixed in" during hashing. But this is no worse than defining a new set of hash algorithms if the other problems can be addressed.

Right now, IPFS hashes are effectively proprietary, and I can't trust them for long term use/storage. Sure the IPFS code is open but work needs to be done to make it practically possible for other software to generate and validate IPFS hashes. The bare minimum is documenting the DAG format and writing a simple, portable algorithm for generating IPFS hashes (synthesizing whatever necessary DAG meta-data). IPFS itself should also standardize on one representation (which doesn't mean getting rid of the trickle-DAG mode). Ideally, file hashes would be the raw hash of file content, not including anything related to how it's stored in the DAG (but still using the multihash format).

Thank you!

@jbenet
Copy link
Member

jbenet commented Nov 10, 2015

@btrask probably worth moving to notes repo -- https://github.com/ipfs/notes/ as its more general stuff than go-ipfs specific

@jbenet
Copy link
Member

jbenet commented Nov 10, 2015

(will respond there)

@jbenet
Copy link
Member

jbenet commented Nov 10, 2015

-- also, i think you misunderstand IPFS. the whole point is creating this graph. you could have the hash of just the data as a secondary attribute, but it will never be the address of the objects.

@jbenet
Copy link
Member

jbenet commented Nov 10, 2015

See also the growing ipld spec, which is a standalone format: https://github.com/ipfs/specs/blob/ipld-spec/merkledag/ipld.md

also, the word "proprietary" does not make sense here at all.

proprietary. adjective

  1. relating to an owner or ownership.
    "the company has a proprietary right to the property"
  2. (of a product) marketed under and protected by a registered trade name.
    "proprietary brands of insecticide"

i think you mean application-specific.

@btrask
Copy link
Author

btrask commented Nov 10, 2015

I just wrote up another use case for being able to generate hashes in advance of adding them to IPFS (#1957).

"Application-specific" implies that other applications/software don't have a legitimate need to be able to compute IPFS-compatible hashes, which I disagree with. Since IPFS hashes include the DAG data, that means the DAG must be fully specified, because applications need to be able to generate DAG data that is bit-for-bit identical. That includes how blocks are broken up and even how IPFS formats its whitespace.

Do you agree that having multiple hashes for a single (otherwise identical) file is a problem? Because IPFS has this problem by itself, with the trickle-DAG.

@jbenet
Copy link
Member

jbenet commented Nov 16, 2015

Do you agree that having multiple hashes for a single (otherwise identical) file is a problem? Because IPFS has this problem by itself, with the trickle-DAG.

No, i don't think this is a problem at all. the hashes are not hashes of a file. they are hashes of a graph. The graphs represent the files AND access patterns AND deduplication AND compression AND ....

Repeating:

you could have the hash of just the data as a secondary attribute

@btrask
Copy link
Author

btrask commented Nov 16, 2015

Interesting that you mention compression, since Git went through exactly the same confusion. When Git was first written by Linus, it used SHA-1s of the compressed data (using zlib). But later on they realized that was limiting, because it made them dependent on the exact compression used. So they switched and now they generate the hash of the uncompressed data, then compress it for storage. (I think I read a really good article about this a few years ago, but the best citation I could find is this Stack Overflow question).

Similarly in IPFS, this would grant you more flexibility to improve the DAG in the future.

It makes sense to me that the "file hash" should be a secondary attribute from the perspective of IPFS. However, it should be the primary hash from the perspective of the user. The user cares about the file, not IPFS' internal representation of it.

What do you think?

@btrask
Copy link
Author

btrask commented Nov 16, 2015

To be clear, this might fit the plumbing/porcelain split. Plumbing = hash of top node in the merkle DAG. Porcelain = hash of the original file content.

@jbenet
Copy link
Member

jbenet commented Nov 20, 2015

The user cares about the file, not IPFS' internal representation of it.

You (and some subset of users) care more about the file hash. but many others care more about the dag.

the dag is really important. a dag optimized for streaming video and a dag that gives a compressed delta representation from previous version are critically different to applications.

let's put this another way.

An image can be encoded as jpg or a png, but their hashes are different! horrible! let's not hash the image files, let's come up with some canonical representations of the image -- maybe take image fingerprints based on the full bitmap or vectorized form -- and hash those. after all, the user cares about the image, not the file system's internal representation of it.

Interesting that you mention compression, since Git went through exactly the same confusion.

I am well aware, and i disagree. This compression (what i talk about) is completely separate from compression on repo (disk) and on the wire as optimization around storing + transferring the dag (what git talks about), which yes, its best to take advantage of better representations as they come. The compression i'm talking about is for finding the most compact way to represent the data and let that be the canonical form. Attempt to approach low kolmogorov complexity states, use (hash) those, and compute it out. sure, can find better, but no, i dont want to content address Pi, i want to content address a program that computes Pi.

exactly the same confusion.

it would be nice if you stopped assuming that people are wrong when you haven't yet understood them.

@jbenet
Copy link
Member

jbenet commented Nov 20, 2015

everything is a relative projection, and there are no absolutes, only the illusion of one from your relative vantage point. there is truly no canonical representation of any piece of information. even representations with the smallest kolmogorov complexity may have isomorphisms of exactly the same size. every version you pick, file or dag, is just one projection of many possible. yes, there will be many hashes for "yielding the same information". but different ways of yielding it.

@btrask
Copy link
Author

btrask commented Nov 20, 2015

You (and some subset of users) care more about the file hash. but many others care more about the dag.

I care about both. IPFS has a very good interface for letting users/applications manipulate the DAG to store all sorts of data and do cool things. However, when a user uses ipfs add to add a file, the file is what the user cares about in that instance.

The compression i'm talking about is for finding the most compact way to represent the data and let that be the canonical form.

This is precisely the mistake that Git made.

everything is a relative projection, and there are no absolutes, only the illusion of one from your relative vantage point. there is truly no canonical representation of any piece of information. even representations with the smallest kolmogorov complexity may have isomorphisms of exactly the same size. every version you pick, file or dag, is just one projection of many possible. yes, there will be many hashes for "yielding the same information". but different ways of yielding it.

At the boundary of a file system (like IPFS), the file is the Platonic ideal. While it's true that there are semantic equivalences above that, they are unknowable without AI or lots of manual intervention. On the other hand I don't see how obfuscating which files started out as equivalent is anything but a step backwards.

Traditional file systems use many different types of internal representation and caching. Regardless of how a file is fragmented, whether it is sparse, or what extended attributes it has, the file name doesn't change.

@robcat
Copy link

robcat commented Dec 8, 2015

I agree with @jbenet: ipfs has the ambition to be a universal archive of dags, and to succeed it needs flexibility in the hash function, chunking algorithms and serialization of the wrapping that a simple file requires. That requires to effectively have a new hash, incompatible with the already widely used ones.

But the result is that is tricky to convert the already existing archival infrastructure of plain files (e.g. linux distribution repositories) if ipfs doesn't make easy to rehash them using its own hashing algorithm.

For example: on my machine running ipfs add -n <file> is about three times slower than sha256sum <file> (both of them are using the same chunking and hashing algorithms).

What I would like to have:

  • documentation for the construction of the ipfs-style hash of a file (i.e. description of the data structure and serialization)
  • a public recommendation on a default chunking strategy and a default hashing algorithm; their stability will ease the migration of data to ipfs
  • a standalone tool (possibly called ipfssum) that just computes the ipfs-style hash of a file according to the recommendation above; being standalone makes possible to optimize it earlier and more aggressively, so that it can reach a speed comparable to the standard hashing tools (e.g. sha256sum)

@jbenet
Copy link
Member

jbenet commented Dec 8, 2015

Have you tried --only-hash ?
On Tue, Dec 8, 2015 at 09:41 Roberto Catini notifications@github.com
wrote:

I agree with @jbenet https://github.com/jbenet: ipfs has the ambition
to be a universal archive of dags, and to succeed it needs flexibility
in the hash function, chunking algorithms and serialization of the wrapping
that a simple file requires. That requires to effectively have a new
hash, incompatible with the already widely used ones.

But the result is that is tricky to convert the already existing archival
infrastructure of plain files (e.g. linux distribution repositories) if
ipfs doesn't make easy to rehash them using its own hashing algorithm.

For example: on my machine running ipfs add -n is about three
times slower than sha256sum (both of them are using the same
chunking and hashing algorithms).

What I would like to have:

  • documentation for the construction of the ipfs-style hash of a
    file (i.e. description of the data structure and serialization)
  • a public recommendation on a default chunking strategy and a default
    hashing algorithm; their stability will ease the migration of data to ipfs
  • a standalone tool (possibly called ipfssum) that just computes the
    ipfs-style hash of a file according to the recommendation above; being
    standalone makes possible to optimize it earlier and more aggressively, so
    that it can reach a speed comparable to the standard hashing tools (e.g.
    sha256sum)


Reply to this email directly or view it on GitHub
#1953 (comment).

@jbenet
Copy link
Member

jbenet commented Dec 8, 2015

(Am fine making tools but they will need the importers (what decides how
files are broken up))
On Tue, Dec 8, 2015 at 12:59 Juan Benet juan@benet.ai wrote:

Have you tried --only-hash ?
On Tue, Dec 8, 2015 at 09:41 Roberto Catini notifications@github.com
wrote:

I agree with @jbenet https://github.com/jbenet: ipfs has the ambition
to be a universal archive of dags, and to succeed it needs flexibility
in the hash function, chunking algorithms and serialization of the wrapping
that a simple file requires. That requires to effectively have a new
hash, incompatible with the already widely used ones.

But the result is that is tricky to convert the already existing archival
infrastructure of plain files (e.g. linux distribution repositories) if
ipfs doesn't make easy to rehash them using its own hashing algorithm.

For example: on my machine running ipfs add -n is about three
times slower than sha256sum (both of them are using the same
chunking and hashing algorithms).

What I would like to have:

  • documentation for the construction of the ipfs-style hash of a
    file (i.e. description of the data structure and serialization)
  • a public recommendation on a default chunking strategy and a
    default hashing algorithm; their stability will ease the migration of data
    to ipfs
  • a standalone tool (possibly called ipfssum) that just computes the
    ipfs-style hash of a file according to the recommendation above; being
    standalone makes possible to optimize it earlier and more aggressively, so
    that it can reach a speed comparable to the standard hashing tools (e.g.
    sha256sum)


Reply to this email directly or view it on GitHub
#1953 (comment).

@btrask
Copy link
Author

btrask commented Dec 8, 2015

@jbenet It sounds like the importers are what needs to be standardized anyway?

A stand-alone package for --only-hash would be great.

@robcat
Copy link

robcat commented Dec 8, 2015

@jbenet Yes of course, I used --only-hash (in the short -n form, as in my previous message). But it does not make a big difference anyway.

I tried the three possibilties and these are my totally nonscientific results (~1 GB video file, sandy bridge i3):

  • sha256sum: 4.636s
  • ipfs add -n: 11.320s (2.5x)
  • ipfs add: 11.687s

@jbenet
Copy link
Member

jbenet commented Dec 9, 2015

@jbenet It sounds like the importers are what needs to be standardized anyway?

A stand-alone package for --only-hash would be great.

agreed on both!

@jbenet Yes of course, I used --only-hash (in the short -n form, as in my previous message). But it does not make a big difference anyway.

it does not make a difference? huh! we may have broken it. cc @whyrusleeping o/

@robcat this is definitely a bug. we should reach sha256 perf. (unless you're using rabin)

@jbenet
Copy link
Member

jbenet commented Dec 9, 2015

  • we should use this as a moment to break off {importers, chunkers, layouts, etc} out of go-ipfs into a separate repo.

@robcat
Copy link

robcat commented Dec 9, 2015

we should use this as a moment to break off {importers, chunkers, layouts, etc} out of go-ipfs into a separate repo.

@jbenet: this modularity would be very cool, but just publishing the ipfs low level specifications could go a long way...

This is a use case I was thinking about:
look_at_how_many_of_your_files_already_are_on_ipfs.sh
It would be a bash script that hashes all your files and queries a ipfs gateway to tell you the percentage of files that are already published on ipfs. (this would be great for advertising ipfs and convince people to run the daemon).

Of course, people would not trust a fat opaque binary to go through all their files (possibility of data corruption). But using the ipfs specification, such a script could be written in a few lines just using bash and the standard unix utilities (e.g. sha256sum and split), making it much more trustable.

@jbenet
Copy link
Member

jbenet commented Dec 9, 2015

But using the ipfs specification, such a script could be written in a few lines just using bash and the standard unix utilities

not likely, not with more sophisticated chunking using things like rabin fingerprinting with specific parameters. in the end reading our modular code will be easier.

@btrask
Copy link
Author

btrask commented Dec 11, 2015

Hey @jbenet, I talked to @mekarpeles and he suggested we talk. If you're interested and have some time it might be useful. I prefer Skype but can also use Hangouts.

@jbenet
Copy link
Member

jbenet commented Dec 11, 2015

hey @btrask yeah, let's. i think it will help align a lot of our perspectives. we have our community hangouts on mondays, maybe after those. else tue is possible. let's maybe schedule by mail or irc

@btrask
Copy link
Author

btrask commented Dec 17, 2015

Thanks for chatting @jbenet! I opened a couple issues to try to document some of our conversation.

@mekarpeles
Copy link

👍

@OlegGirko
Copy link

Recently I've found this thread, and I'm very disappointed that IPFS, that was advertised as content-addressed network, can not be used for addressing files by their content.

It's a very basic and obvious use case: find a file by its SHA-1 or SHA256 hash. There are many existing centralised systems that use their own Merkle DAG (with objects addressed by their standard hashes) that could benefit from IPFS as storage, making these systems distributed and decentralised with minimal effort.

For example, imagine distributed Git. It would be great to be able to retrieve commit by its SHA-1 hash, and then retrieve all objects this commit contains by their respective SHA-1 hashes recursively.

Or, for example, repository of RPM packages used by yum, dnf and zypper package managers is essentially a Merkle DAG, but it uses SHA256. It would be great to be able to retrieve this DAG's root (a small XML file) from official repository and then access all necessary packages by their SHA256 checksums.

Unfortunately, I've found from this thread (and another thread linked from this one) that IPFS can't do this with its current design. It looks very strange to me that this basic and obvious use case was not included in initial design and for now it's just a remote goal that can be achieved only by writing a separate service for mapping from file hash to internal IPFS hash.

But it's somewhat relieving to find that IPFS is still very useful for posting cat photos: they can be compressed using different lossy algorithms, so they don't need to have stable hashes.

@whyrusleeping
Copy link
Member

Closing, please move further discussion into the ipfs/notes repo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants