Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The SHA1 hash of maps should not depend on the key ordering #362

Closed
asciiansi opened this issue Oct 19, 2019 · 5 comments
Closed

The SHA1 hash of maps should not depend on the key ordering #362

asciiansi opened this issue Oct 19, 2019 · 5 comments
Assignees
Labels
1.x bug Something isn't working

Comments

@asciiansi
Copy link

asciiansi commented Oct 19, 2019

(original question below)

Currently, the content/value hashes for a map depend on the key ordering - with different orders (possibly arising from different concrete map implementations), we generate different content/value hashes for the 'same' map, which means:

  • Entities with a map key (useful, but so far undocumented) aren't always returned/joined correctly.
  • (Lesser impact, but) some documents that are essentially the same (varying only in key order) are stored under multiple content-hashes (sub-optimal, but no undesired behaviour)

(old proposed solution, for posterity, see #474 for what I ended up doing):

  • The SHA1 hashes of maps should not depend on the key ordering - we'll take inspiration from java.util.Map's hashcodes here.
  • Users that want to can preserve the old behaviour for document content-hashes
  • Users that want to can preserve the old behaviour for map-keys (independently from the previous content-hash behaviour)
  • Users that want to can migrate their Kafka streams via a tool.

(original question)

Is it intended that map keys for a crux.db/id always must be in the same order, despite having the same content/hash? Querying for an entity using (crux/entity db {:a 1 :b 1}) is not the same as (crux/entity db {:b 1 :a 1}) for example. The query only seems to work for me using the same key order as the transacted entity.

I can semi-understand why this is case but nonetheless it's a bit surprising for keys and lead to some brief debugging. Specifically, it seems a bit strange as maps are by definition unordered and often manipulating programatically.

Moreover, it's typical in a larger system when dealing with dynamically constructed keys to make use of assoc, update, etc. which becomes very dangerous and instead requires construction of a new map, creating even more garbage. I currently work around this simply by using a function every time I create keys dynamically, but there are a few cases where this is a bit annoying.

I find maps in general useful as keys since they allow a composite and additionally remove some overhead for reaching into an entity via a query using crux/q. I would prefer other options for key and would love to discuss further if you are willing along with use-cases and reasoning. For now, I'm mostly using maps to workaround dissatisfaction with other key types which aren't really suitable for my apps.

For context, I am using 100% RocksDB on both ends (though I prefer/will eventually switch to LMDB). Here are my deps:

juxt/crux-core                      {:mvn/version "19.09-1.5.0-alpha"}
juxt/crux-rocksdb                   {:mvn/version "19.09-1.5.0-alpha"}

At the very least, this should be documented given I found it surprising if it's not considered a bug. Thanks.

@hraberg hraberg added bug Something isn't working ingest labels Nov 4, 2019
@hraberg hraberg added this to To do in XTDB Development via automation Nov 4, 2019
@hraberg hraberg added this to the Beta milestone Nov 4, 2019
danmason added a commit that referenced this issue Nov 7, 2019
@danmason danmason self-assigned this Nov 7, 2019
@danmason danmason moved this from To do to In progress in XTDB Development Nov 7, 2019
@jarohen jarohen moved this from In progress to Selected in XTDB Development Nov 13, 2019
@jarohen jarohen removed this from the Beta milestone Nov 15, 2019
@t-taylor t-taylor moved this from Selected to In progress in XTDB Development Dec 2, 2019
@t-taylor t-taylor self-assigned this Dec 2, 2019
@t-taylor t-taylor moved this from In progress to Awaiting Merge in XTDB Development Dec 2, 2019
@t-taylor t-taylor moved this from Awaiting Merge to In progress in XTDB Development Dec 3, 2019
@t-taylor t-taylor moved this from In progress to Awaiting Merge in XTDB Development Dec 6, 2019
@jarohen jarohen moved this from Awaiting Merge to Selected in XTDB Development Dec 12, 2019
@jarohen jarohen changed the title map keys must always be in the same order The SHA1 hash of maps should not depend on the key ordering Dec 13, 2019
@jarohen jarohen moved this from Selected to In progress in XTDB Development Dec 13, 2019
@jarohen jarohen self-assigned this Dec 13, 2019
jarohen added a commit to jarohen/xtdb that referenced this issue Dec 13, 2019
jarohen added a commit to jarohen/xtdb that referenced this issue Dec 13, 2019
jarohen added a commit to jarohen/xtdb that referenced this issue Dec 13, 2019
@jarohen jarohen moved this from In progress to Awaiting Merge in XTDB Development Dec 13, 2019
@jarohen jarohen moved this from Awaiting Merge to Awaiting release in XTDB Development Dec 18, 2019
@jarohen jarohen closed this as completed Dec 18, 2019
@refset
Copy link
Contributor

refset commented Dec 18, 2019

Thanks for reporting this @asciiansi - I hope you find the fix satisfactory. We will be releasing imminently.

@adam0z
Copy link

adam0z commented Apr 16, 2020

As mentioned in docs, maps are allowed as valid :crux.db/id. Do I understand it properly that there is no difference in performance using maps as _id?
I would like to create my own _id like:

#crux/id {:type :person :source_id 42 :key "age"}

instead of uuid, or string like "person_42_age"
and use it also as a foreign key in relations.

@jarohen
Copy link
Member

jarohen commented Apr 17, 2020

Yep, that's correct - we convert entity ids, attributes and values to content hashes early on and then use those content hashes throughout the query engine.

@adam0z
Copy link

adam0z commented Apr 18, 2020

Do you use SHA1 hashes and if so do you have any collision handle?
https://pthree.org/2014/03/06/the-reality-of-sha1/
Or is there an assumption that collisions would not become likely?

@jarohen
Copy link
Member

jarohen commented Apr 20, 2020

Yes - we currently use SHA1 without a collision handle, with the assumption that collisions are sufficiently unlikely.

jarohen added a commit to jarohen/xtdb that referenced this issue Jul 29, 2020
Since xtdb#362 we sort the collections before we freeze them, to get a deterministic hash
Unfortunately this doesn't work if the collections aren't `Comparable`
So, if we can't sort the collections, we sort by the hashes of the elements, which is deterministic.
Much as we'd like to apply this across the board, we also take care not to change the hash of any data structure that was correctly hashed previously.
jarohen added a commit to jarohen/xtdb that referenced this issue Jul 29, 2020
Since xtdb#362 we sort the collections before we freeze them, to get a deterministic hash
Unfortunately this doesn't work if the collections aren't `Comparable`
So, if we can't sort the collections, we sort by the hashes of the elements, which is deterministic.
Much as we'd like to apply this across the board, we also take care not to change the hash of any data structure that was correctly hashed previously.
jarohen added a commit that referenced this issue Aug 6, 2020
Since #362 we sort the collections before we freeze them, to get a deterministic hash
Unfortunately this doesn't work if the collections aren't `Comparable`
So, if we can't sort the collections, we sort by the hashes of the elements, which is deterministic.
Much as we'd like to apply this across the board, we also take care not to change the hash of any data structure that was correctly hashed previously.
@jarohen jarohen added the 1.x label Apr 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.x bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

7 participants