Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hashing, Equality, and UUIDs in OPS 2.0 #1080

Open
dwhswenson opened this issue Oct 6, 2021 · 2 comments
Open

Hashing, Equality, and UUIDs in OPS 2.0 #1080

dwhswenson opened this issue Oct 6, 2021 · 2 comments
Labels
2.0 issues and PRs for the 2.0 release discussion

Comments

@dwhswenson
Copy link
Member

One of the major changes in 2.0 should be in the way objects in OPS are hashed. Although this is the sort of thing that seems like a subtle point that only programmers care about, we've seen situations where it causes headaches for real users.

NOTE: This supercedes issue #701. See some initial thoughts there; we didn't make changes in 1.x for backward compatibility, but this is a significant issue in real use that we should address in 2.0.

Python background

Equality and hashing

Just as a reminder, we have 3 types of "equality" that we can use in Python:

  1. is-equality: Same object in memory, obj1 is obj2. No way to override.
  2. hash-equality: hash(obj1) == hash(obj2), defined by __hash__, and used in for performance in dicts, etc. Must be immutable over object lifetime.
  3. ==-equality: obj1 == obj2, defined by __eq__

These are in decreasing order of strictness, i.e.

(obj1 is obj2) ==> (obj1 == obj2) ==> (hash(obj1) == hash(obj2))

where ==> is "implies" and hash is only relevant is the objects are hashable. Two objects that compare equal must have the same hash. (Obviously, a specific object in memory must have the same hash as itself and must be equal to itself.) However, two objects can compare equal or hash to the same value without being the same object in memory. More subtle, two objects can hash to the same value without comparing equal.

How hashes are used

Hashes are used to speed up dict/set lookup in Python, but after a candidate is found, a further check with == is also done. This means that two objects with the same hash can simultaneously be keys in a dict if and only if they are not ==.

The __hash__ method should return an integer, usually accomplished by wrapping other hashables in hash. Importantly, the implementation of hash is an implementation detail, and may not be preserved across different Python implementations or versions of Python.

Further reading: this gist of mine and this article.

Currently existing problems

We violate the Python contract on hashes and equality

In OPS, we sometimes override the __eq__ operator while continuing to use the UUID as the hash. This means that we have objects that compare equal but give different hashes. This becomes a problem if you have objects that are == and that you want to put into a hashed container such as a set or a dict.

Hashing by UUID causes pain for analysis

The fact that many objects hash based on UUID has been a headache for users who create equivalent objects in multiple scripts, and then try to combine them. For example, consider a case where a user runs TIS, but discovers later that they need to sample an additional interface. Currently, it is extremely difficult to analyze those results, because we make use of the hashes/UUIDs of ensembles for analysis, and equivalent ensembles created at different times have different UUIDs.

So right now the workflow is something like this (@gyorgy-hantal had to deal with this, and might be able to give details):

  1. Sample extra interface.
  2. Create new network, probably by hacking the serialization dicts until from_dict allows you to create a network that looks like it was created the way you wanted.
  3. I think there's also other hackery required to remap equivalent ensembles to some single UUID.
  4. Run analysis with the network that took you hours to assemble.

The workflow should be:

  1. Sample extra interface.
  2. Create network that includes ensembles equivalent to all your sampled ensembles (even with totally different UUIDs).
  3. Run analysis using that total network just as you normally would.

Cases to consider

Below are a few situations I'm thinking about while proposing ways to fix this. I'd appreciate any additional cases/examples that others can think of!

When are OPS objects used as keys?

Most of the situations I can think of where OPS objects are used as keys are situations related to facilitating analysis. A few examples:

  • Mapping tuple (intitial_state, final_state) to a specific transition within a network
  • Mapping ensemble to analysis results when analyzing TIS
  • Mapping tuple (state, interface_volume) when calculating flux

There may also be examples where we use a set to ensure unique copies of various objects.

We need to ensure that anything that uses OPS objects as keys will still be able to -- that the objects will be hashable, and that they won't have spurious equality collisions.

When are UUIDs used as keys?

  • UUIDs are used as keys extensively in SimStore. If two objects have the same UUID, SimStore will only store the first one it sees.

We need to ensure that UUIDs (or something like them) are available for storage purposes. It might be possible to replace UUIDs with some sort of hash, but I don't currently see a way to do that via the standard Python __hash__.

Proposals

In the following, when I refer to "immutable attributes", the actual implementation would be based on whatever is returned by to_dict (plus class name). These are set on initialization; if that is not the case, the class may need a custom __hash__ method to based on its immutable attributes.

Proposal 1: Equality and hash based on UUID

This would break some of our current equality for, e.g., ensembles and volumes, but would at least bring us in line with with Python expectations.

Advantages:

  • Fixes breaking of Python contract
  • Easy to implement

Disadvantages:

  • Doesn't solve problems with using UUIDs for analysis
  • Breaks some existing improvements on equality definitions

Proposal 2: Treat snapshots, non-named objects, and named objects differently

  • Snapshots: hash and equality by UUID
  • Non-named objects: hash and equality from immutable objects from to_dict
  • Named objects: hash immutable objects from to_dict, equality by hash + .name attribute.

The idea here is that UUIDs may be better at remaining unique than hashes (more bits), so snapshots (which there are a lot of) use UUID. We already have a lot of functionality around snapshot UUIDs and mapping time reversed snapshots back and forth.

For some objects, we could also implement custom hashing/equality. For example, in volume/ensemble combinations, we should have A & B == B & A (and should have the same hash), even though A and B have different keys in to_dict.

Advantages:

  • Fixes breaking of Python contract
  • No longer use UUID as hash, so should fix analysis problems

Disadvantages:

  • If you have several equivalent unnamed objects, only one will show up in a dict/set.

Proposal 3: Completely replace UUIDs with unique hashes

I didn't work this one all the way through, because I think it will be hard to implement. But it might have the advantage that it means that we reduce the number of objects stored, because equal objects would have these unique hashes. However, there are challenges with it, in particular because we name objects after creation, and hash should not change after object creation.


I'm leaning toward proposal 2, which I'd probably do through a series of several PRs aimed at the 2.0 branch. However, I'm definitely open to other proposals, and if anyone sees significant potential issues (especially @sroet or @hejung, who have probably dealt with needing to dance around these problems) I'd definitely appreciate the heads up!

@sroet
Copy link
Member

sroet commented Oct 7, 2021

I'm leaning toward proposal 2, which I'd probably do through a series of several PRs aimed at the 2.0 branch.

Yeah I am in favor of this as well, seems like the cleanest solution.

If you have several equivalent unnamed objects, only one will show up in a dict/set.

I can't think of an unnamed object where that would be an issue. Is there a reason why we don't just make everything named, but allow empty strings for the currently unnamed objects?

if anyone sees significant potential issues (especially @sroet or @hejung, who have probably dealt with needing to dance around these problems) I'd definitely appreciate the heads up!

I don't see any potential issues so far. We sometimes use a spurious is to check for equality like in #919, so we should probably be on the lookout for those. One thing that I found surprising is that hashes are fixed width integers (as opposed to the default integer of python that will silently grow to a long when needed) that will loop around. However, the looping also does something wonky that I am not sure how to explain:

>>> class a():
...     _i = 60
...     def __hash__(self):
...         return 2**self._i
... 
>>> b = a()
>>> hash(b)
1152921504606846976
>>> hash(2**b._i) # all identical as expected
1152921504606846976
>>> b._i = 61
>>> hash(b)
2305843009213693952
>>> hash(2**b._i) # hashing integers has less width? ok, weird but possible
1
>>> b._i = 62
>>> hash(b)
4611686018427387904
>>> hash(2**b._i)
2
>>> b._i = 63
>>> hash(b) # Dafuck, where are 1 and 2?
4
>>> hash(2**b._i) 
4

@dwhswenson
Copy link
Member Author

I can't think of an unnamed object where that would be an issue.

Relevant example I can think of:

C_7eq = (
    CVDefinedVolume(phi, -np.pi, 0)  # Unnamed object A
    & CVDefinedVolume(psi, 100 * np.pi / 180, 200 * np.pi / 180)
).named("C_7eq")
alpha_R = (
    CVDefinedVolume(phi, -np.pi, 0)  # Unnamed object B
    & CVDefinedVolume(psi, -100 * np.pi / 180, 0)
).named("alpha_R")

Under this proposal, "Unnamed object A" and "Unnamed object B," which are both CVDefinedVolume(phi, -np.pi, 0), would hash the same and show up equal. But they're not the same object in memory, and they have different UUIDs. This could lead to unexpected behavior when trying to put them into a dict (I noticed a related issue when playing with GUI ideas for an OPS storage browser -- I was using a dict to map string representations (or names when available) to objects in storage as lines in a list widget).

However, in practice, those volumes will behave identically. They just aren't is-identical. The main issue is that we need to be careful not to lose them during storage (another reason to keep UUIDs for now, but to not treat UUIDs as part of the "immutable" identity definition).

Is there a reason why we don't just make everything named, but allow empty strings for the currently unnamed objects?

This is a point of class naming that I occasionally argued about. StorableObject and StorableNamedObject are misleading. In my opinion, StorableObjects are better referred to as "data objects," where StorableNamedObjects are what I refer to as "simulation objects." In SimStore, they follow completely different serialization paths (data objects are optimized for the fact that you have a lot of them, but each object tends has the same structure; simulation objects are optimized for the fact that you have fewer of them, but the structure varies widely [so we use JSON]). The fact that you can't name a data object (except by sticking it in a tag, which uses the slower simulation object serialization method) is secondary. But basically there's no reason to carry (and store) a non-fixed width name attribute on every snapshot.

StorableNamedObjects do default to the empty string for the name attribute, although I think a property overrides it with [ClassName]. I'd need to dive into the code to remember exactly all of that works. Importantly, we need to be able to use something to discourage renaming -- the name is a set-once-and-set-before-storage attribute. It cannot be changed once stored (at least, not without significant changes to our storage infrastructure), and we do have code in place to discourage renaming.

The only difference I propose between hash/equality checks for data objects (which are all immutable -- although we might want to enforce that more strictly) and for simulation objects is that we add an equality check on the name of the simulation object (and any other distinguishing attributes that aren't set on creation.)

We sometimes use a spurious is to check for equality like in #919, so we should probably be on the lookout for those.

Agreed. These tend to be in very old corners of the code. That particular one fixed an error made in May 2015 (git blame doesn't put it on me -- but that means that not catching it in review was on me). I'm pretty sure I could have made is/== mistakes up through about the beginning of 2016. Code from around 2016 on is more mature.

One thing that I found surprising is that hashes are fixed width integers (as opposed to the default integer of python that will silently grow to a long when needed) that will loop around.

This part is easy to explain. The purpose of the hash is fast lookups -- so you can easily see how fixed number of bits is useful for the underlying C code.

As for the hash math, it's a matter of the implementation. See this comment in CPython source for details, but it's basically this:

# hash_bits is _PyHASH_BITS, defined in pyhash.h
def int_hash(x, hash_bits=61):
    return x % (2**hash_bits - 1)

for exp in [60, 61, 62, 63]:
    print(exp, hash(2**exp), int_hash(2**exp))

# 60 1152921504606846976 1152921504606846976
# 61 1 1
# 62 2 2
# 63 4 4

I think you were expecting it to be x % 2**hash_bits. Note that hash implementation is a CPython detail, which was another reason not to replace UUID with __hash__.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.0 issues and PRs for the 2.0 release discussion
Projects
None yet
Development

No branches or pull requests

2 participants