-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hashing, Equality, and UUIDs in OPS 2.0 #1080
Comments
Yeah I am in favor of this as well, seems like the cleanest solution.
I can't think of an unnamed object where that would be an issue. Is there a reason why we don't just make everything named, but allow empty strings for the currently unnamed objects?
I don't see any potential issues so far. We sometimes use a spurious >>> class a():
... _i = 60
... def __hash__(self):
... return 2**self._i
...
>>> b = a()
>>> hash(b)
1152921504606846976
>>> hash(2**b._i) # all identical as expected
1152921504606846976
>>> b._i = 61
>>> hash(b)
2305843009213693952
>>> hash(2**b._i) # hashing integers has less width? ok, weird but possible
1
>>> b._i = 62
>>> hash(b)
4611686018427387904
>>> hash(2**b._i)
2
>>> b._i = 63
>>> hash(b) # Dafuck, where are 1 and 2?
4
>>> hash(2**b._i)
4 |
Relevant example I can think of: C_7eq = (
CVDefinedVolume(phi, -np.pi, 0) # Unnamed object A
& CVDefinedVolume(psi, 100 * np.pi / 180, 200 * np.pi / 180)
).named("C_7eq")
alpha_R = (
CVDefinedVolume(phi, -np.pi, 0) # Unnamed object B
& CVDefinedVolume(psi, -100 * np.pi / 180, 0)
).named("alpha_R") Under this proposal, "Unnamed object A" and "Unnamed object B," which are both However, in practice, those volumes will behave identically. They just aren't
This is a point of class naming that I occasionally argued about.
The only difference I propose between hash/equality checks for data objects (which are all immutable -- although we might want to enforce that more strictly) and for simulation objects is that we add an equality check on the name of the simulation object (and any other distinguishing attributes that aren't set on creation.)
Agreed. These tend to be in very old corners of the code. That particular one fixed an error made in May 2015 (git blame doesn't put it on me -- but that means that not catching it in review was on me). I'm pretty sure I could have made
This part is easy to explain. The purpose of the hash is fast lookups -- so you can easily see how fixed number of bits is useful for the underlying C code. As for the hash math, it's a matter of the implementation. See this comment in CPython source for details, but it's basically this: # hash_bits is _PyHASH_BITS, defined in pyhash.h
def int_hash(x, hash_bits=61):
return x % (2**hash_bits - 1)
for exp in [60, 61, 62, 63]:
print(exp, hash(2**exp), int_hash(2**exp))
# 60 1152921504606846976 1152921504606846976
# 61 1 1
# 62 2 2
# 63 4 4 I think you were expecting it to be |
One of the major changes in 2.0 should be in the way objects in OPS are hashed. Although this is the sort of thing that seems like a subtle point that only programmers care about, we've seen situations where it causes headaches for real users.
NOTE: This supercedes issue #701. See some initial thoughts there; we didn't make changes in 1.x for backward compatibility, but this is a significant issue in real use that we should address in 2.0.
Python background
Equality and hashing
Just as a reminder, we have 3 types of "equality" that we can use in Python:
is
-equality: Same object in memory,obj1 is obj2
. No way to override.hash
-equality:hash(obj1) == hash(obj2)
, defined by__hash__
, and used in for performance in dicts, etc. Must be immutable over object lifetime.==
-equality:obj1 == obj2
, defined by__eq__
These are in decreasing order of strictness, i.e.
where
==>
is "implies" and hash is only relevant is the objects are hashable. Two objects that compare equal must have the same hash. (Obviously, a specific object in memory must have the same hash as itself and must be equal to itself.) However, two objects can compare equal or hash to the same value without being the same object in memory. More subtle, two objects can hash to the same value without comparing equal.How hashes are used
Hashes are used to speed up
dict
/set
lookup in Python, but after a candidate is found, a further check with==
is also done. This means that two objects with the samehash
can simultaneously be keys in adict
if and only if they are not==
.The
__hash__
method should return an integer, usually accomplished by wrapping other hashables inhash
. Importantly, the implementation ofhash
is an implementation detail, and may not be preserved across different Python implementations or versions of Python.Further reading: this gist of mine and this article.
Currently existing problems
We violate the Python contract on hashes and equality
In OPS, we sometimes override the
__eq__
operator while continuing to use the UUID as thehash
. This means that we have objects that compare equal but give different hashes. This becomes a problem if you have objects that are==
and that you want to put into a hashed container such as aset
or adict
.Hashing by UUID causes pain for analysis
The fact that many objects hash based on UUID has been a headache for users who create equivalent objects in multiple scripts, and then try to combine them. For example, consider a case where a user runs TIS, but discovers later that they need to sample an additional interface. Currently, it is extremely difficult to analyze those results, because we make use of the hashes/UUIDs of ensembles for analysis, and equivalent ensembles created at different times have different UUIDs.
So right now the workflow is something like this (@gyorgy-hantal had to deal with this, and might be able to give details):
from_dict
allows you to create a network that looks like it was created the way you wanted.The workflow should be:
Cases to consider
Below are a few situations I'm thinking about while proposing ways to fix this. I'd appreciate any additional cases/examples that others can think of!
When are OPS objects used as keys?
Most of the situations I can think of where OPS objects are used as keys are situations related to facilitating analysis. A few examples:
(intitial_state, final_state)
to a specific transition within a networkensemble
to analysis results when analyzing TIS(state, interface_volume)
when calculating fluxThere may also be examples where we use a
set
to ensure unique copies of various objects.We need to ensure that anything that uses OPS objects as keys will still be able to -- that the objects will be hashable, and that they won't have spurious equality collisions.
When are UUIDs used as keys?
We need to ensure that UUIDs (or something like them) are available for storage purposes. It might be possible to replace UUIDs with some sort of hash, but I don't currently see a way to do that via the standard Python
__hash__
.Proposals
In the following, when I refer to "immutable attributes", the actual implementation would be based on whatever is returned by
to_dict
(plus class name). These are set on initialization; if that is not the case, the class may need a custom__hash__
method to based on its immutable attributes.Proposal 1: Equality and hash based on UUID
This would break some of our current equality for, e.g., ensembles and volumes, but would at least bring us in line with with Python expectations.
Advantages:
Disadvantages:
Proposal 2: Treat snapshots, non-named objects, and named objects differently
to_dict
to_dict
, equality by hash +.name
attribute.The idea here is that UUIDs may be better at remaining unique than hashes (more bits), so snapshots (which there are a lot of) use UUID. We already have a lot of functionality around snapshot UUIDs and mapping time reversed snapshots back and forth.
For some objects, we could also implement custom hashing/equality. For example, in volume/ensemble combinations, we should have
A & B == B & A
(and should have the same hash), even thoughA
andB
have different keys into_dict
.Advantages:
hash
, so should fix analysis problemsDisadvantages:
dict
/set
.Proposal 3: Completely replace UUIDs with unique hashes
I didn't work this one all the way through, because I think it will be hard to implement. But it might have the advantage that it means that we reduce the number of objects stored, because equal objects would have these unique hashes. However, there are challenges with it, in particular because we name objects after creation, and hash should not change after object creation.
I'm leaning toward proposal 2, which I'd probably do through a series of several PRs aimed at the 2.0 branch. However, I'm definitely open to other proposals, and if anyone sees significant potential issues (especially @sroet or @hejung, who have probably dealt with needing to dance around these problems) I'd definitely appreciate the heads up!
The text was updated successfully, but these errors were encountered: