-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Improve the hash module #2250
Conversation
proposals/imporved-hash-module.md
Outdated
fn update(inout self, pointer: DTypePointer[DType.uint8], length: Int): | ||
... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a comment saying that when Mojo supports it, we could have a default implementation for this?
Something like:
fn update(inout self, pointer: DTypePointer[DType.uint8], length: Int): | |
... | |
fn update(inout self, pointer: DTypePointer[DType.uint8], length: Int): | |
for i in range(length): | |
self.update(pointer[i]) |
Then a user can override the method to implement something more efficient if they wish to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above, I think it is a detail not really important for this proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was just a suggestion, fair enough
Nice proposal 👍🏻 |
proposals/imporved-hash-module.md
Outdated
|
||
fn __init__(inout self): | ||
self.hash = 42 | ||
fn update[T: DType](inout self, value: SIMD[T, 1]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I remember, you have an example implementation of this trait in one of your repos right? It would help a lot readers if a link was provided. A POC is always welcome :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the implementation of the Hasher is a detail we should not focus on in this proposal. Given Implementation actually works and returns 42 for anything you will pass to it. If this proposal will be accepted I guess the default hasher will be based on DJBX33A hash algorithm already implemented in standard library. After this proposal is implemented I intend to create another proposal where I will argue about more suitable hash algorithms, which should be part of standard library and maybe replacing DJBX33A as the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not the focus I agree, it's just helpful for readers like me who are not very familiar with hashing algorithms. Thanks for the proposal though. I think it's really good :)
proposals/imporved-hash-module.md
Outdated
# How to combine a hash of hashes ??? | ||
``` | ||
As you can see above we, computed hashes for all of the struct fields, but we are uncertain how to combine those values in a way which produces a good (non compromised) hash value. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe to be fair, we should add an example of what the computing of the hash would look like if mojo evolves without this proposal. By following the docs it would look like this
@value
struct Person(Hashable):
var name: String
var age: UInt8
var friends_names: List[String]
fn __hash__(self) -> Int:
var final_tuple = (self.name, self.age) + tuple(self.friends)
return hash(final_tuple)
And this has a lot of flaw(lots of copying, possible memory allocation, unpredictable types possible for the tuple, use of the object
struct, a struct must know how to hash its members, etc...) but at least the proposal gives a fair chance to the "python's way" of doing the hashing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned the "python's way" in the proposal now.
proposals/improved-hash-module.md
Outdated
# self.name.hash_with(hasher), when String is Hashable, otherwise: | ||
hasher.update(self.name._as_ptr().bitcast[DType.uint8](), len(self.name)) | ||
# self.age.hash_with(hasher), when SIMD is hashable, otherwise | ||
hasher.update(self.age) | ||
# self.friends_names.hash_with(hasher), when List of Hashable types is Hashable, otherwise: | ||
for friend in self.friends_names: | ||
hasher.update(friend[]._as_ptr().bitcast[DType.uint8](), len(friend[])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose an alternative here, as it allows the implementer of "hash_with" to use only one type of call instread of two, thus simplifying the API.
If we modify the Hasher
trait to include a default implementation:
trait Hasher:
fn update(inout self, value: Hashable):
# this is a default implementation and should not be changed
value.hash_with[Self](self)
then, if the struct attribute has a hash_with
method or not, the same method is called and users don't have to remember two different ways of calling the hasher:
fn hash_with[H: Hasher](self, inout hasher: H):
# hasher.update(self.name), when String is Hashable, otherwise:
hasher.update(self.name._as_ptr().bitcast[DType.uint8](), len(self.name))
hasher.update(self.age)
# hasher.update(self.friends_names), when List of Hashable types is Hashable, otherwise:
for friend in self.friends_names:
hasher.update(friend[]._as_ptr().bitcast[DType.uint8](), len(friend[]))
Note that this isn't possible with the current compiler for two reasons:
- The compiler doesn't like cyclical dependencies in traits
- The compiler doesn't allow default implementations in traits
Those are two issues which I hope will be resolved in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simplified the API and provide a working example in https://gist.github.com/mzaks/aa66c831dc5e177c2322d5088aac76aa
Let me know if it resolves your concerns.
proposals/improved-hash-module.md
Outdated
var age: UInt8 | ||
var friends_names: List[String] | ||
|
||
fn hash_with[H: Hasher](self, inout hasher: H): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python and Mojo both use dunder __something__
to signify that methods have a special meaning in the stdlib and the language. It avoids naming conflicts with user methods in general. I don't have a strong opinion about this, I'm just pointing this out.
This method in this proposal shouldn't be a second class citizen compared to __hash__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I created a small POC, where I changed the method names. https://gist.github.com/mzaks/aa66c831dc5e177c2322d5088aac76aa
If it is ok with everyone I will change the snippets in the proposal and reference the gist at the end.
Thanks so much for going through the proposal process! We as a team really appreciate the thought and motivation presented. It makes things easier when discussing the design of the proposal. 🎉 🎉 🎉 Decision
Some open questions to consider
|
I created a small POC to address all the outstanding questions:
I suggest to parametrize the return type of trait Hasher[hash_value_type: DType]:
fn __init__(inout self):
...
fn update(inout self, bytes: DTypePointer[DType.uint8], n: Int):
...
fn finish(owned self) -> Scalar[hash_value_type]:
... I did not do it at this point in time mainly because the compiler does not allow default parameters on traits. (
I prefer it too now. I though of factory approach, because some Hashers might have complex initialization needs, e.g. often they need a random seed and a secret. But as you can see here I was able to come up with a strategy where users can influence the hasher behavior with compile time parameters and environment variables.
Yes, as you can see here it is trivial to implement the hash method when standard library implements hash methods for the basic types. As I mentioned here it would be very easy to synthesize the method as well. The only concern I have right now is, what happens with references and do we want to provide a solution for circular references. We could say that structs with references do a shallow hash, where they only hash the reference address. Or we say we allow deep hash, where we would need to be able to check if the referenced value was already visited by the Hasher. This would be possible if we add
To be honest I did not consider it till now. After examining the method docs I think it's a bit redundant. The collection implementer can add the length of the collection as an Int to the hasher, in order to solve the issue of |
4c46b8f
to
7c3c0a4
Compare
I had a call with @gabrieldemarmiesse yesterday where we discussed the proposal in depth. As a result I commit some improvements to the proposal today. |
Looks good! We can improve on the API and Python compatibility when the compiler has more flexibility :) |
fa9b222
to
119320b
Compare
Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
119320b
to
1447717
Compare
Thanks for updating this — I'll take a look at the updates early this week. Very exciting to see this moving along!! 🚀 |
✅🟣 This contribution has been merged 🟣✅ Your pull request has been merged to the internal upstream Mojo sources. It will be reflected here in the Mojo repository on the nightly branch during the next Mojo nightly release, typically within the next 24-48 hours. We use Copybara to merge external contributions, click here to learn more. |
Landed in today's nightly: #2615. Thanks for the well thought out proposal! Happy to see this moving along. |
[External] [Proposal] Improve the hash module This proposal is based on discussion started in modularml#1744 Co-authored-by: Maxim Zaks <maxim.zaks@gmail.com> Closes modularml#2250 MODULAR_ORIG_COMMIT_REV_ID: 692c7d5940b8c88e83ef895b0be26a33a06ad941 Signed-off-by: Lukas Hermann <lukashermann28@gmail.com>
This proposal is based on discussion started in #1744