Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

identity hash #4697

Closed
ivan386 opened this issue Feb 13, 2018 · 18 comments
Closed

identity hash #4697

ivan386 opened this issue Feb 13, 2018 · 18 comments
Assignees
Labels
kind/feature A new feature status/in-progress In progress topic/perf Performance

Comments

@ivan386
Copy link
Contributor

ivan386 commented Feb 13, 2018

Version information:

go-ipfs version: 0.4.13-
Repo version: 6
System version: amd64/windows
Golang version: go1.9.2

Type: Feature

Description:

With identity hash we can inline data in link. Useful for short or zero length data.

Trivial hash function

If the data to be hashed is small enough, one can use the data itself (reinterpreted as an integer) as the hashed value. The cost of computing this "trivial" (identity) hash function is effectively zero. This hash function is perfect, as it maps each input to a distinct hash value.

The meaning of "small enough" depends on the size of the type that is used as the hashed value. For example, in Java, the hash code is a 32-bit integer. Thus the 32-bit integer Integer and 32-bit floating-point Float objects can simply use the value directly; whereas the 64-bit integer Long and 64-bit floating-point Double cannot use this method.

Other types of data can also use this perfect hashing scheme. For example, when mapping character strings between upper and lower case, one can use the binary encoding of each character, interpreted as an integer, to index a table that gives the alternative form of that character ("A" for "a", "8" for "8", etc.). If each character is stored in 8 bits (as in extended ASCII[7] or ISO Latin 1), the table has only 28 = 256 entries; in the case of Unicode characters, the table would have 17×216 = 1114112 entries.

The same technique can be used to map two-letter country codes like "us" or "za" to country names (262 = 676 table entries), 5-digit zip codes like 13083 to city names (100000 entries), etc. Invalid data values (such as the country code "xx" or the zip code 00000) may be left undefined in the table or mapped to some appropriate "null" value.

Additional information:

z3vosKrgsh -> 01 55 00 03 00 00 00 is valid 3 bytes length "identity" hash of raw block with same length and data (00 00 00)

>ipfs block get z3vosKrgsh
Error: blockservice: key not found

>ipfs get z3vosKrgsh
Error: merkledag: not found
@ghost
Copy link

ghost commented Feb 13, 2018

All hashes in IPFS must be based on cryptographic hash functions. Multihash can (and does) support non-cryptographic hashes as well, but IPFS and IPLD hashes MUST be cryptographically verifiable. See #4371.

@ivan386
Copy link
Contributor Author

ivan386 commented Feb 13, 2018

It data itself. No need to publish it and no need to retrieve it. Data already there. See here why we need it : #4680 (comment)

@kevina kevina self-assigned this Feb 13, 2018
@kevina
Copy link
Contributor

kevina commented Feb 13, 2018

This is something I could work on assuming it is something that is wanted. cc @Stebalien

@Stebalien
Copy link
Member

Yes, this is definitely something we want (well, something I want).

@kevina
Copy link
Contributor

kevina commented Feb 13, 2018

Okay I work on in within a week.

@jbenet
Copy link
Member

jbenet commented Feb 22, 2018

All hashes in IPFS must be based on cryptographic hash functions. Multihash can (and does) support non-cryptographic hashes as well, but IPFS and IPLD hashes MUST be cryptographically verifiable. See #4371.

I think it's OK to allow the identity hash function. You could argue it's cryptographically safe for the purposes of #4371 -- we may only need CRH?

@Kubuxu
Copy link
Member

Kubuxu commented Feb 23, 2018

There are two main issue with not safe hashes: collisions allowing to fake the data and breaking principle of DAG (weak hashes allow creation of cyclic graphs by collisions). Those issues are not present in case of identity hash so I also think it is safe to use.

@mib-kd743naq
Copy link
Contributor

Reposting a question from the internets:

mib_kd743naq: hi! it's been a while
              the identity "hash" thing, allowing for embedding of very short data directly as the "pointer"
              is this available in a release or at least in master, or it is still WiP?

@Stebalien
Copy link
Member

@mib-kd743naq Multihash supports it but it isn't supported in go-ipfs for storing blocks (yet).

@Stebalien Stebalien added topic/perf Performance status/deferred Conscious decision to pause or backlog kind/feature A new feature labels Apr 1, 2018
@mib-kd743naq
Copy link
Contributor

@Stebalien does the backlog label imply this is downgraded to "nice to have... someday"? Without making promises, do you have a ballpark realistic guess which version of go-ipfs may contain this feature?

@Stebalien
Copy link
Member

@mib-kd743naq it means "we want this but nobody's actively working on it". Honestly, it shouldn't be that hard to implement, someone just needs to take it up and do it (I guess it warrants a "help wanted" in that case as well).

@Stebalien Stebalien added the help wanted Seeking public contribution on this issue label Apr 1, 2018
@kevina
Copy link
Contributor

kevina commented Apr 1, 2018

I have thought a bit about this and can likely implement it in the next one or two weeks if it is something that is wanted now.

@mib-kd743naq
Copy link
Contributor

@kevina It'd be amazing if you can do this. Here is a simple test case for you:

xxd -r -p <<<0a0408011804121f0a080155000444415441121153686f7274203420627974652066696c651804 \
  | ipfs block put --format=protobuf

This should be a directory with a single 4 byte file containing DATA as its sole line

@kevina
Copy link
Contributor

kevina commented Apr 2, 2018

I think the best way to implement this would be at the blockstore level. In particular I propose this is implemented as a wrapper blockstore (in the same way the cacheing blockstores are implemented) with the following semantics (when I say "pass on the request" I mean pass on the request to the underlying blockstore):

  • Get(*cid.Cid) (blocks.Block, error) - if an id-hash extract the contents from the hash and return that, otherwise pass on the request
  • Has(*cid.Cid) (bool, error) - if an id-hash then always return true, otherwise pass on the request
  • Put(blocks.Block) error - - if an id-hash and HashOnRead is enabled verify that the contents match the hash, if HashOnRead is not enabled then a no-op; if not a id-hash pass on the request
  • PutMany([]blocks.Block) error - extract the id-hashes and call Put on each, pass on the rest
  • DeleteBlock(*cid.Cid) error - if an id-hash then a no-op, otherwise pass on the request
  • AllKeysChan(ctx context.Context) (<-chan *cid.Cid, error) - pass on the request (it is impossible to list the identify hashes as they are never stored)
  • HashOnRead(enabled bool) - toggle the HashOnRead flag and pass on the request

If this sounds good I will go ahead and implement it.

CC @Stebalien

@Stebalien
Copy link
Member

SGTM.

@mib-kd743naq
Copy link
Contributor

if an id-hash and HashOnRead is enabled verify that the contents match the hash

But the content is the hash, that line should be "noop if id-hash"...

@Stebalien
Copy link
Member

But the content is the hash, that line should be "noop if id-hash"...

Yeah, that threw me as well. He's talking about verifying that the bytes returned by Block.Data() match the bytes in the identity hash CID returned by Block.Cid(). Someone could theoretically construct a Block where the CID doesn't match the data (even with the identity hash).

However, we can define special block type for identity Cids so we can check this invariant by doing a simple type check and only fall back on comparing bytes if we get a type that we're not expecting.

@kevina
Copy link
Contributor

kevina commented Apr 3, 2018

Note: Working on this right now but have not pushed anything.

Note that HashOnRead only applied when reading a hash from the BlockStore not storing it, so the new semantics of Put and PutMany is:

  • Put(blocks.Block) error - - if an id-hash then a no-op; if not a id-hash pass on the request
  • PutMany([]blocks.Block) error - remove the id-hashes and pass on the rest

Sorry for the confusion.

@ghost ghost added status/in-progress In progress and removed status/deferred Conscious decision to pause or backlog labels Apr 3, 2018
@kevina kevina removed the help wanted Seeking public contribution on this issue label Apr 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature A new feature status/in-progress In progress topic/perf Performance
Projects
No open projects
ipfs/go-ipfs
In Progress
Development

No branches or pull requests

7 participants