[Proposal] Improve the hash module #2250

mzaks · 2024-04-09T15:09:03Z

This proposal is based on discussion started in #1744

gabrieldemarmiesse · 2024-04-10T07:16:06Z

proposals/imporved-hash-module.md

+    fn update(inout self, pointer: DTypePointer[DType.uint8], length: Int):
+        ...


Can we add a comment saying that when Mojo supports it, we could have a default implementation for this?
Something like:

Suggested change

fn update(inout self, pointer: DTypePointer[DType.uint8], length: Int):

...

fn update(inout self, pointer: DTypePointer[DType.uint8], length: Int):

for i in range(length):

self.update(pointer[i])

Then a user can override the method to implement something more efficient if they wish to do so.

See above, I think it is a detail not really important for this proposal.

It was just a suggestion, fair enough

theopomies · 2024-04-10T09:10:09Z

Nice proposal 👍🏻
Please note there appears to be a typo in the filename: imporved-hash-module.md when it should actually be improved-hash-module.md.

gabrieldemarmiesse · 2024-04-10T07:49:24Z

proposals/imporved-hash-module.md

+
+    fn __init__(inout self):
+        self.hash = 42
+    fn update[T: DType](inout self, value: SIMD[T, 1]):


From what I remember, you have an example implementation of this trait in one of your repos right? It would help a lot readers if a link was provided. A POC is always welcome :)

I think the implementation of the Hasher is a detail we should not focus on in this proposal. Given Implementation actually works and returns 42 for anything you will pass to it. If this proposal will be accepted I guess the default hasher will be based on DJBX33A hash algorithm already implemented in standard library. After this proposal is implemented I intend to create another proposal where I will argue about more suitable hash algorithms, which should be part of standard library and maybe replacing DJBX33A as the default.

It's not the focus I agree, it's just helpful for readers like me who are not very familiar with hashing algorithms. Thanks for the proposal though. I think it's really good :)

gabrieldemarmiesse · 2024-04-10T09:58:50Z

proposals/imporved-hash-module.md

+        # How to combine a hash of hashes ???
+```
+As you can see above we, computed hashes for all of the struct fields, but we are uncertain how to combine those values in a way which produces a good (non compromised) hash value.
+


I believe to be fair, we should add an example of what the computing of the hash would look like if mojo evolves without this proposal. By following the docs it would look like this

@value struct Person(Hashable): var name: String var age: UInt8 var friends_names: List[String] fn __hash__(self) -> Int: var final_tuple = (self.name, self.age) + tuple(self.friends) return hash(final_tuple)

And this has a lot of flaw(lots of copying, possible memory allocation, unpredictable types possible for the tuple, use of the object struct, a struct must know how to hash its members, etc...) but at least the proposal gives a fair chance to the "python's way" of doing the hashing.

I mentioned the "python's way" in the proposal now.

gabrieldemarmiesse · 2024-04-11T08:30:38Z

proposals/improved-hash-module.md

+        # self.name.hash_with(hasher), when String is Hashable, otherwise:
+        hasher.update(self.name._as_ptr().bitcast[DType.uint8](), len(self.name))
+        # self.age.hash_with(hasher), when SIMD is hashable, otherwise
+        hasher.update(self.age)
+        # self.friends_names.hash_with(hasher), when List of Hashable types is Hashable, otherwise:
+        for friend in self.friends_names:
+            hasher.update(friend[]._as_ptr().bitcast[DType.uint8](), len(friend[]))


I propose an alternative here, as it allows the implementer of "hash_with" to use only one type of call instread of two, thus simplifying the API.
If we modify the Hasher trait to include a default implementation:

trait Hasher: fn update(inout self, value: Hashable): # this is a default implementation and should not be changed value.hash_with[Self](self)

then, if the struct attribute has a hash_with method or not, the same method is called and users don't have to remember two different ways of calling the hasher:

fn hash_with[H: Hasher](self, inout hasher: H): # hasher.update(self.name), when String is Hashable, otherwise: hasher.update(self.name._as_ptr().bitcast[DType.uint8](), len(self.name)) hasher.update(self.age) # hasher.update(self.friends_names), when List of Hashable types is Hashable, otherwise: for friend in self.friends_names: hasher.update(friend[]._as_ptr().bitcast[DType.uint8](), len(friend[]))

Note that this isn't possible with the current compiler for two reasons:

The compiler doesn't like cyclical dependencies in traits

The compiler doesn't allow default implementations in traits

Those are two issues which I hope will be resolved in the future.

I simplified the API and provide a working example in https://gist.github.com/mzaks/aa66c831dc5e177c2322d5088aac76aa
Let me know if it resolves your concerns.

gabrieldemarmiesse · 2024-04-11T08:33:32Z

proposals/improved-hash-module.md

+    var age: UInt8
+    var friends_names: List[String]
+
+    fn hash_with[H: Hasher](self, inout hasher: H):


Python and Mojo both use dunder __something__ to signify that methods have a special meaning in the stdlib and the language. It avoids naming conflicts with user methods in general. I don't have a strong opinion about this, I'm just pointing this out.
This method in this proposal shouldn't be a second class citizen compared to __hash__.

I agree, I created a small POC, where I changed the method names. https://gist.github.com/mzaks/aa66c831dc5e177c2322d5088aac76aa

If it is ok with everyone I will change the snippets in the proposal and reference the gist at the end.

proposals/improved-hash-module.md

JoeLoser · 2024-04-18T15:55:49Z

Thanks so much for going through the proposal process! We as a team really appreciate the thought and motivation presented. It makes things easier when discussing the design of the proposal. 🎉 🎉 🎉

Decision

Accepted, pending responses to @gabrieldemarmiesse's comments and feedback on the PR

Some open questions to consider

Should we create an alias for the return type of finish(...) (e.g. alias HASH_RETURN_T = UInt64) rather than hard-coding to UInt64? This gives us flexibility to change the return type in the future or have different hashers return different sized things if they care about space savings in certain cases.
We prefer the hash_with approach defaulting to the DefaultHasher rather than the hasher_factory approach. If you share more motivation/context on why the function approach may be useful, we may be better equipped to chime in.
When implementors (and users) are writing user-defined-types, it would be nice for them to not have to always worry about init, update, and finish to show up in their code: i.e. they only worry about hash or hash_with for example rather than the full API set for the trait requirements.
In the future (keep things simple for now as you have it!), we may want to consider fancy things like write_length_prefix from Rust. Have you given any thought to this sort of thing?

mzaks · 2024-04-26T05:47:22Z

I created a small POC to address all the outstanding questions:
https://gist.github.com/mzaks/aa66c831dc5e177c2322d5088aac76aa

Should we create an alias for the return type of finish(...) (e.g. alias HASH_RETURN_T = UInt64) rather than hard-coding to UInt64? This gives us flexibility to change the return type in the future or have different hashers return different sized things if they care about space savings in certain cases.

I suggest to parametrize the return type of finish with a default, see. It might be sensible to put the hash value type parameter directly on the Hasher trait, which would allow Hasher to have different implementations for different DTypes, not just a bitcast at the end.

trait Hasher[hash_value_type: DType]:
    fn __init__(inout self):
        ...
    fn update(inout self, bytes: DTypePointer[DType.uint8], n: Int):
        ...
    fn finish(owned self) -> Scalar[hash_value_type]:
        ...

I did not do it at this point in time mainly because the compiler does not allow default parameters on traits. (trait Hasher[hash_value_type: DType = DType.uint64] does not work)

We prefer the hash_with approach defaulting to the DefaultHasher rather than the hasher_factory approach. If you share more motivation/context on why the function approach may be useful, we may be better equipped to chime in.

I prefer it too now. I though of factory approach, because some Hashers might have complex initialization needs, e.g. often they need a random seed and a secret. But as you can see here I was able to come up with a strategy where users can influence the hasher behavior with compile time parameters and environment variables.

When implementors (and users) are writing user-defined-types, it would be nice for them to not have to always worry about init, update, and finish to show up in their code: i.e. they only worry about hash or hash_with for example rather than the full API set for the trait requirements.

Yes, as you can see here it is trivial to implement the hash method when standard library implements hash methods for the basic types. As I mentioned here it would be very easy to synthesize the method as well. The only concern I have right now is, what happens with references and do we want to provide a solution for circular references. We could say that structs with references do a shallow hash, where they only hash the reference address. Or we say we allow deep hash, where we would need to be able to check if the referenced value was already visited by the Hasher. This would be possible if we add fn visited(inout self, id: ???) -> Bool to the Hasher. As you can see I am not certain what the type of the reference id should be, maybe Scalar[DType.address].

In the future (keep things simple for now as you have it!), we may want to consider fancy things like write_length_prefix from Rust. Have you given any thought to this sort of thing?

To be honest I did not consider it till now. After examining the method docs I think it's a bit redundant. The collection implementer can add the length of the collection as an Int to the hasher, in order to solve the issue of [1, 2, 3] and [[1], [2, 3]] generating same hash value. I think a special method does not help much.

mzaks · 2024-04-28T12:53:46Z

I had a call with @gabrieldemarmiesse yesterday where we discussed the proposal in depth. As a result I commit some improvements to the proposal today.

gabrieldemarmiesse · 2024-04-28T17:39:03Z

Looks good! We can improve on the API and Python compatibility when the compiler has more flexibility :)

Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>

JoeLoser · 2024-05-05T16:14:02Z

Thanks for updating this — I'll take a look at the updates early this week. Very exciting to see this moving along!! 🚀

modularbot · 2024-05-10T13:42:45Z

✅🟣 This contribution has been merged 🟣✅

Your pull request has been merged to the internal upstream Mojo sources. It will be reflected here in the Mojo repository on the nightly branch during the next Mojo nightly release, typically within the next 24-48 hours.

We use Copybara to merge external contributions, click here to learn more.

[External] [Proposal] Improve the hash module This proposal is based on discussion started in #1744 Co-authored-by: Maxim Zaks <maxim.zaks@gmail.com> Closes #2250 MODULAR_ORIG_COMMIT_REV_ID: 692c7d5940b8c88e83ef895b0be26a33a06ad941

JoeLoser · 2024-05-11T03:11:39Z

Landed in today's nightly: #2615. Thanks for the well thought out proposal! Happy to see this moving along.

[External] [Proposal] Improve the hash module This proposal is based on discussion started in modularml#1744 Co-authored-by: Maxim Zaks <maxim.zaks@gmail.com> Closes modularml#2250 MODULAR_ORIG_COMMIT_REV_ID: 692c7d5940b8c88e83ef895b0be26a33a06ad941 Signed-off-by: Lukas Hermann <lukashermann28@gmail.com>

[External] [Proposal] Improve the hash module This proposal is based on discussion started in #1744 Co-authored-by: Maxim Zaks <maxim.zaks@gmail.com> Closes #2250 MODULAR_ORIG_COMMIT_REV_ID: 692c7d5940b8c88e83ef895b0be26a33a06ad941

mzaks changed the title ~~Improve the hash module~~ [Proposal] Improve the hash module Apr 9, 2024

gabrieldemarmiesse reviewed Apr 10, 2024

View reviewed changes

gabrieldemarmiesse reviewed Apr 11, 2024

View reviewed changes

JoeLoser added the stdlib-proposal Standard Library Proposals label Apr 18, 2024

mzaks mentioned this pull request Apr 20, 2024

[BUG]: Dict._find_index doesn't check all index slots, can infinite loop #1729

Closed

mzaks force-pushed the proposal/improved-hash-module branch from 4c46b8f to 7c3c0a4 Compare April 26, 2024 14:16

mzaks requested a review from gabrieldemarmiesse April 28, 2024 12:52

gabrieldemarmiesse approved these changes May 1, 2024

View reviewed changes

mzaks force-pushed the proposal/improved-hash-module branch from fa9b222 to 119320b Compare May 3, 2024 02:23

mzaks added 3 commits May 3, 2024 09:58

improve the hash module

721a2b2

Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>

Coorected file name and added some details based on PR feedback

759c66b

Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>

Update improved-hash-module.md

1447717

Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>

mzaks force-pushed the proposal/improved-hash-module branch from 119320b to 1447717 Compare May 3, 2024 07:58

JoeLoser self-assigned this May 5, 2024

ematejska added the mojo-repo Tag all issues with this label label May 6, 2024

modularbot added the merged-internally Indicates that this pull request has been merged internally label May 10, 2024

JoeLoser added the merged-externally Merged externally in public mojo repo label May 11, 2024

JoeLoser closed this May 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Improve the hash module #2250

[Proposal] Improve the hash module #2250

mzaks commented Apr 9, 2024

gabrieldemarmiesse Apr 10, 2024

mzaks Apr 10, 2024

gabrieldemarmiesse Apr 10, 2024

theopomies commented Apr 10, 2024

gabrieldemarmiesse Apr 10, 2024

mzaks Apr 10, 2024

gabrieldemarmiesse Apr 10, 2024

gabrieldemarmiesse Apr 10, 2024 •

edited

mzaks Apr 10, 2024 •

edited

gabrieldemarmiesse Apr 11, 2024 •

edited

mzaks Apr 26, 2024

gabrieldemarmiesse Apr 11, 2024

mzaks Apr 26, 2024

JoeLoser commented Apr 18, 2024 •

edited

mzaks commented Apr 26, 2024 •

edited

mzaks commented Apr 28, 2024

gabrieldemarmiesse commented Apr 28, 2024

JoeLoser commented May 5, 2024

modularbot commented May 10, 2024

JoeLoser commented May 11, 2024

		fn update(inout self, pointer: DTypePointer[DType.uint8], length: Int):
		...

[Proposal] Improve the hash module #2250

[Proposal] Improve the hash module #2250

Conversation

mzaks commented Apr 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

theopomies commented Apr 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gabrieldemarmiesse Apr 10, 2024 • edited

Choose a reason for hiding this comment

mzaks Apr 10, 2024 • edited

Choose a reason for hiding this comment

gabrieldemarmiesse Apr 11, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoeLoser commented Apr 18, 2024 • edited

mzaks commented Apr 26, 2024 • edited

mzaks commented Apr 28, 2024

gabrieldemarmiesse commented Apr 28, 2024

JoeLoser commented May 5, 2024

modularbot commented May 10, 2024

JoeLoser commented May 11, 2024

gabrieldemarmiesse Apr 10, 2024 •

edited

mzaks Apr 10, 2024 •

edited

gabrieldemarmiesse Apr 11, 2024 •

edited

JoeLoser commented Apr 18, 2024 •

edited

mzaks commented Apr 26, 2024 •

edited