-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design of HWAddress
equality and hashing
#3
Comments
That's quite a train of thoughts you have there. I'll try my best to response to all of them so if something is missing or needs clarification, please let me know. |
Please ignore me accidentally closing this issue
I believe that the comparison should only be possible for concrete base classes and their subclasses, but not for abstract base classes if there's one. For example, comparing a To conclude this, comparisons should only be implemented starting from the concrete class rather than the abstract base class to prevent leaking logic to other concrete classes. This is where I draw the line. Hashing is a different story though. As far as I'm aware, the main purpose of hashing is to differentiate objects when used data structures such as |
Oh I see, sorry, I didn't make this clear enough: in Python, it is not allowed for two objects to both
From the Python documentation for
Basically Otherwise you can get a bug that is effectively random and rare, where those two objects usually count as different objects, but sometimes count as the same object, in things like dictionaries or sets. (This makes sense if you consider that the normal way to implement things like dictionaries and sets is with a hash table. Hash tables internally have some number of "buckets", which are lists in which elements are stored, and the number of buckets changes dynamically depending on stuff like how many objects are stored in the table. The hash of an object controls which bucket an object gets placed in, and it is normal for many hash values to map to the same bucket. A simple version would be to do |
You may notice that the world could actually solve the "hash must be consistent with equality" requirement. A hash table implementation could simply do This would of course require recalculating or caching the hash value for every item in the bucket. That's probably one reason why it's not done. It adds overhead to the implementation, just to enable having more notion of "is this thing the same as that thing" in the language. (Although it seems possible for the caching to be worth doing anyway for a speedup in cases where the hash table has a lot of rebalancing (changing the number of buckets), if the extra space overhead doesn't cause the whole thing to get big enough to lose more speed thanks to cache misses.) Of course, there is some design benefit to minimizing the number of different but similar "is this thing the same as that thing" concepts in a language. The less different concepts and details developers have to understand and remember just to use something, the less room there is for errors in code due to edge cases that don't match human expectations. Plus, we don't need set membership and dict keys to be based on something other than equality, because it's very easy to wrap equal objects to add a dimension of difference: for example, |
Thank you for enlightening me more on how hashed collections work. Do you have the reference link so that I indulge myself into? Now that I know that two objects cannot be equal and produce different hashes at the same time, I'm leaning more on the "different class, different hash" side. As in your original thought, there will be a problem with the Other than that, I don't see any further problem, assuming that you still uses both the integer address value and the format structure of the address as hashing component. Also, I don't think there is a need to create additional subclass for what the library already provided. Honestly, there shouldn't be. And if there's, that's insane! |
Well, the implementation of |
I'm totally not fine at all with C, so it will probably take a while for me to digest and dipping into the low level stuffs, but thanks for the tip. Anyway, I'm not well versed in networking, so it's your call on how you want your library to be. But with the exploding number of networking devices due to IoT, I would prefer following the IEEE specification, and, maybe, guiding users to a more updated standard? |
HWAddress
equality and hashing
Verdict (when+if I have time+motivation, I will post more of my reasoning for these choices):
I'm closing this issue because I like open/closed status to mean "does this need resolution?" and my mind is made up now to my satisfaction, but further comments or arguments are always welcome, even if an issue is closed. |
Re: the EUI-48 vs "MAC" issue specifically: I created #9 for further discussion of that topic in its own right. As far as it concerns this issue, I saw the cases like this:
Options 1 and 2 are equivalent for practical purposes with regard to having this library provide a MAC vs EUI-48 distinction out-of-the-box. (Either way if you want that distinction, you have to take a conscious subclassing/composition step to create an address class which does not compare equal and does not hash equal. In one of them you would be forced to explicitly cause inequality.) To a large extent I share @gytqby's suggestion/ preference for guiding users towards the "EUI-48" name and the If not option 1, then option 2 seems by far more useful than option 3 to me. More on that later. |
Re:
Yeah, I think in the typical case, no one needs any other addresses. However, there is the occasional good use-case. For example, I looked around in the projects listed by GitHub's "used by" feature on this repo, and found a real-world usage of subclassing to support custom formats. (In that situation, the subclass is just being used for validation, but we could easily imagine a similar use-case where the constructed address class is used throughout the program to represent the address, and equality and so on is actually relied-upon.) |
@gytqby I made a simplified Python sketch of how a mapping type like [click to show code]class Map:
def __init__(self):
# At its core, a hash table is just a list of
# "buckets" - each bucket is itself a small
# list that holds a few of the items in the
# hash table.
self._bucket_list = []
def __setitem__(self, key, value):
number_of_buckets = len(self._bucket_list)
# If there are no buckets yet, we need to add
# one so that the new item has a place to go.
if number_of_buckets < 1:
self._bucket_list.append([])
number_of_buckets = 1
# So here's the key point about hash tables:
# they try to evenly distribute items into
# the buckets. This is where the speedup of
# hash tables over simple lists comes from:
# if we break up the list of all the items
# into many little lists, and have some way
# to pick the right little list for an item,
# we never need to work with one big list.
# And a hash function is one way to pick.
bucket_index = hash(key) % number_of_buckets
bucket = self._bucket_list[bucket_index]
# This is why:
#
# 1. The hash function is supposed to produce
# evenly distributed numbers. If it did
# not, there would be a bias in which
# bucket things go into. For example, if
# the hash is always zero, then all keys
# will end up going in the first bucket.
#
# 2. The hash function needs to be immutable
# and deterministic. If the hash for the
# same key object changes, we could get
# into situations where a key is already
# in one bucket, but its hash function
# currently maps it to another bucket.
# So gets, sets, and deletes would have
# to search every bucket anyway, losing
# all the speedups relative to a list.
# This is also why mutable objects are
# either not hashable, or do not hash on
# their mutable parts.
# (By the way, notice how Python's `hash` and
# `__hash__` make it so that our hash table
# code doesn't need to know how to hash each
# possible type of key. Instead, each class
# brings its own appropriate hashing.)
# We store both the key and the value in the
# bucket. The next step will explain why.
item = (key, value)
# In case other items have already been added to
# the hash table, we need to check if the key is
# already in the bucket, so that we can update
# that item instead of adding a duplicate item:
index = _find_key_in_bucket(bucket, key)
if index == -1:
# `key` not in bucket:
bucket.append(item)
else:
bucket[index] = item
# If we keep adding items, the buckets get
# bigger, and we get closer to performance
# of a simple list, because loops like the
# one in `_find_key_in_bucket` have to
# look at each item in the bucket.
# The solution is to "rebalance" the hash table
# whenever it gets too big: create more buckets
# and evenly redistribute the items across the
# new number of buckets - more on that later.
self._rebalance_if_needed()
def __getitem__(self, key):
number_of_buckets = len(self._bucket_list)
if number_of_buckets < 1:
raise KeyError('key not in map')
bucket_index = hash(key) % number_of_buckets
bucket = self._bucket_list[bucket_index]
index = _find_key_in_bucket(bucket, key)
if index == -1:
raise KeyError('key not in map')
item = bucket[index]
key, value = item
return value
def __delitem__(self, key):
number_of_buckets = len(self._bucket_list)
if number_of_buckets < 1:
raise KeyError('key not in map')
bucket_index = hash(key) % number_of_buckets
bucket = self._bucket_list[bucket_index]
index = _find_key_in_bucket(bucket, key)
if index == -1:
raise KeyError('key not in map')
del bucket[index]
# If we keep removing items, the buckets get
# empty, and we end up wasting space, both
# for the bucket list reference and for any
# memory that the bucket has allocated which
# it is still holding onto.
# The solution is to "rebalance" the hash table
# whenever it gets too small: reduce the number
# of buckets and evenly redistribute the items.
self._rebalance_if_needed()
def items(self):
# To get at all items we can just go into
# each bucket and grab everything in it,
# we don't need to bother with hashes.
for bucket in self._bucket_list:
for item in bucket:
yield item
# So for example, if `self._bucket_list`
# is `[[foo, bar], [qux]]`, then `items`
# yields `foo`, `bar`, `qux`.
def keys(self):
for key, _ in self.items():
yield key
def values(self):
for _, value in self.items():
yield value
def __len__(self):
return sum(map(len, self._bucket_list))
def _rebalance_if_needed(self):
# For this teaching example we'll target one
# bucket for each item, and rebalance on any
# deviation from that. But since rebalancing
# is costly, in a mature real-world hash
# table we would use some logic to minimize
# the amount of rebalancings and to do them
# when it brings the most benefit.
desired_bucket_amount = len(self)
current_bucket_amount = len(self._bucket_list)
if desired_bucket_amount == current_bucket_amount:
return
new_bucket_list = [[] for _ in range(desired_bucket_amount)]
for item in self.items():
key, _ = item
bucket_index = hash(key) % desired_bucket_amount
bucket = new_bucket_list[bucket_index]
bucket.append(item)
self._bucket_list = new_bucket_list
def __repr__(self):
return '<Map ' + repr(self._bucket_list) + '>'
def _find_key_in_bucket(bucket, key):
for index, preexisting_item in enumerate(bucket):
preexisting_key, preexisting_value = preexisting_item
if preexisting_key == key:
return index
return -1 You may find this more helpful as an introductory reference than the C Python source. |
@gytqby I came across a nice real-world example of the same equality and hashing problem: https://bidict.readthedocs.io/en/main/learning-from-bidict.html#python-surprises |
I finally wrote up rationale for "The docstring's claim about subclasses being equal, which was what I originally intended but which is inconsistent with the behavior I actually had in every release so far, will be treated as mistaken" in #24 . Now here's a short rationale for "equality will be changed so that any HWAddress instance, including all subclasses, can be equal to any other, if the address size and address value are equal": I think per something like the subclass substitution principle, a subclass instance should compare equal unless deliberately made to not compare equal. It's not exactly violating Liskov substitutability to have subclasses compare not-equal by default (after all, even if you change the answer you're still fulfilling the interface of subclasses being comparable with This point does depend on subclassing being supported:
|
I'm going to consider the design choices explained enough now, unless someone has further questions. |
Actually, two more rationale pieces from my old draft notes:
|
I'm finding myself really drawn to a middle ground:
Unfortunately, there's no easy way in Python to implement this - not without either
If I could have a working version of same class or one is subclass of the other compatibility I would do that (possibly while retaining the same size as an additional sanity-check constraint). But here's a somewhat not-awful compromise that gets us most of the way there:
(I am also starting to take for granted that subclassing-for-formatting will not be the main way to do custom formats in the future, and also that subclasses with different formats are perhaps best thought of as not interchangeable with other arbitrary subclasses.) I need to sleep on it, but this feels like the correct way to go now. |
Oh, I forgot to reply to this by @duynhaaa
Yeah I've always 100% agreed that this is the ideal, and conceptually I totally see (Although... the original ergonomics bar was "hopefully we can think of something that Just Works automatically because that's what's normally desirable", but now since I'm leaning towards keeping our initial equality behavior by default, and have convinced myself that it's actually the right behavior most of the time, the bar is lower.) |
Sightly annoying nit: if equality is based on type, but comparison/ordering is just based on size, then I think it's not strictly-speaking a "total order", because there can be addresses such that neither of |
The flip side is that I could tighten ordering by adding |
Actually it would sort fine enough, I don't know what I was thinking (though I do know that I was very mentally tired at the time I wrote that, enough that simulating the cases was getting seriously hard): if I just add the id of the type as the last value of the ordering tuple, then this will restore consistency of ordering with equality. The only sorting quirks would be when there are addresses of exactly the same length and numerical/binary value, all different classes, and one of them has a different name than the others (which should come up approximately never (and if anyone ever does have a need for better sorting of equal-but-for-type address classes, we can also add the class name to the sorting tuple after the aligned address integer and size and before class ID, and that would eliminate most problems - this is also an extremely self-serviceable problem for the approximately zero users who will ever have this edge case)). |
I do still think that the ideal is for hardware address classes to compare and hash equal when they are subclasses of the one common concrete base class. In the case of So if I went down the road of making this real/enforced in the code, I would do Of course I imagine the size becoming "locked" within each concrete base class - a subclass of But in order to do that I'd need to implement some mutating stuff in |
This issue is the continuation of the discussion that happens within the pull request #2. Here's @mentalisttraceur's dump of thoughts:
The text was updated successfully, but these errors were encountered: