Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Improve the hash module #2250

Closed
wants to merge 3 commits into from

Conversation

mzaks
Copy link
Contributor

@mzaks mzaks commented Apr 9, 2024

This proposal is based on discussion started in #1744

@mzaks mzaks changed the title Improve the hash module [Proposal] Improve the hash module Apr 9, 2024
Comment on lines 75 to 76
fn update(inout self, pointer: DTypePointer[DType.uint8], length: Int):
...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment saying that when Mojo supports it, we could have a default implementation for this?
Something like:

Suggested change
fn update(inout self, pointer: DTypePointer[DType.uint8], length: Int):
...
fn update(inout self, pointer: DTypePointer[DType.uint8], length: Int):
for i in range(length):
self.update(pointer[i])

Then a user can override the method to implement something more efficient if they wish to do so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above, I think it is a detail not really important for this proposal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just a suggestion, fair enough

@theopomies
Copy link

Nice proposal 👍🏻
Please note there appears to be a typo in the filename: imporved-hash-module.md when it should actually be improved-hash-module.md.


fn __init__(inout self):
self.hash = 42
fn update[T: DType](inout self, value: SIMD[T, 1]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I remember, you have an example implementation of this trait in one of your repos right? It would help a lot readers if a link was provided. A POC is always welcome :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the implementation of the Hasher is a detail we should not focus on in this proposal. Given Implementation actually works and returns 42 for anything you will pass to it. If this proposal will be accepted I guess the default hasher will be based on DJBX33A hash algorithm already implemented in standard library. After this proposal is implemented I intend to create another proposal where I will argue about more suitable hash algorithms, which should be part of standard library and maybe replacing DJBX33A as the default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the focus I agree, it's just helpful for readers like me who are not very familiar with hashing algorithms. Thanks for the proposal though. I think it's really good :)

# How to combine a hash of hashes ???
```
As you can see above we, computed hashes for all of the struct fields, but we are uncertain how to combine those values in a way which produces a good (non compromised) hash value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe to be fair, we should add an example of what the computing of the hash would look like if mojo evolves without this proposal. By following the docs it would look like this

@value
struct Person(Hashable):
    var name: String
    var age: UInt8
    var friends_names: List[String]
    fn __hash__(self) -> Int:
        var final_tuple = (self.name, self.age) + tuple(self.friends)
        return hash(final_tuple)

And this has a lot of flaw(lots of copying, possible memory allocation, unpredictable types possible for the tuple, use of the object struct, a struct must know how to hash its members, etc...) but at least the proposal gives a fair chance to the "python's way" of doing the hashing.

Copy link
Contributor Author

@mzaks mzaks Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned the "python's way" in the proposal now.

Comment on lines 103 to 109
# self.name.hash_with(hasher), when String is Hashable, otherwise:
hasher.update(self.name._as_ptr().bitcast[DType.uint8](), len(self.name))
# self.age.hash_with(hasher), when SIMD is hashable, otherwise
hasher.update(self.age)
# self.friends_names.hash_with(hasher), when List of Hashable types is Hashable, otherwise:
for friend in self.friends_names:
hasher.update(friend[]._as_ptr().bitcast[DType.uint8](), len(friend[]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose an alternative here, as it allows the implementer of "hash_with" to use only one type of call instread of two, thus simplifying the API.
If we modify the Hasher trait to include a default implementation:

trait Hasher:
    fn update(inout self, value: Hashable):
        # this is a default implementation and should not be changed
        value.hash_with[Self](self)

then, if the struct attribute has a hash_with method or not, the same method is called and users don't have to remember two different ways of calling the hasher:

    fn hash_with[H: Hasher](self, inout hasher: H):
        # hasher.update(self.name), when String is Hashable, otherwise:
        hasher.update(self.name._as_ptr().bitcast[DType.uint8](), len(self.name))
        hasher.update(self.age)
        # hasher.update(self.friends_names), when List of Hashable types is Hashable, otherwise:
        for friend in self.friends_names:
            hasher.update(friend[]._as_ptr().bitcast[DType.uint8](), len(friend[]))

Note that this isn't possible with the current compiler for two reasons:

  • The compiler doesn't like cyclical dependencies in traits
  • The compiler doesn't allow default implementations in traits

Those are two issues which I hope will be resolved in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified the API and provide a working example in https://gist.github.com/mzaks/aa66c831dc5e177c2322d5088aac76aa
Let me know if it resolves your concerns.

var age: UInt8
var friends_names: List[String]

fn hash_with[H: Hasher](self, inout hasher: H):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python and Mojo both use dunder __something__ to signify that methods have a special meaning in the stdlib and the language. It avoids naming conflicts with user methods in general. I don't have a strong opinion about this, I'm just pointing this out.
This method in this proposal shouldn't be a second class citizen compared to __hash__.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I created a small POC, where I changed the method names. https://gist.github.com/mzaks/aa66c831dc5e177c2322d5088aac76aa

If it is ok with everyone I will change the snippets in the proposal and reference the gist at the end.

proposals/improved-hash-module.md Show resolved Hide resolved
@JoeLoser
Copy link
Collaborator

JoeLoser commented Apr 18, 2024

Thanks so much for going through the proposal process! We as a team really appreciate the thought and motivation presented. It makes things easier when discussing the design of the proposal. 🎉 🎉 🎉

Decision

Some open questions to consider

  • Should we create an alias for the return type of finish(...) (e.g. alias HASH_RETURN_T = UInt64) rather than hard-coding to UInt64? This gives us flexibility to change the return type in the future or have different hashers return different sized things if they care about space savings in certain cases.
  • We prefer the hash_with approach defaulting to the DefaultHasher rather than the hasher_factory approach. If you share more motivation/context on why the function approach may be useful, we may be better equipped to chime in.
  • When implementors (and users) are writing user-defined-types, it would be nice for them to not have to always worry about init, update, and finish to show up in their code: i.e. they only worry about hash or hash_with for example rather than the full API set for the trait requirements.
  • In the future (keep things simple for now as you have it!), we may want to consider fancy things like write_length_prefix from Rust. Have you given any thought to this sort of thing?

@mzaks
Copy link
Contributor Author

mzaks commented Apr 26, 2024

I created a small POC to address all the outstanding questions:
https://gist.github.com/mzaks/aa66c831dc5e177c2322d5088aac76aa

Should we create an alias for the return type of finish(...) (e.g. alias HASH_RETURN_T = UInt64) rather than hard-coding to UInt64? This gives us flexibility to change the return type in the future or have different hashers return different sized things if they care about space savings in certain cases.

I suggest to parametrize the return type of finish with a default, see. It might be sensible to put the hash value type parameter directly on the Hasher trait, which would allow Hasher to have different implementations for different DTypes, not just a bitcast at the end.

trait Hasher[hash_value_type: DType]:
    fn __init__(inout self):
        ...
    fn update(inout self, bytes: DTypePointer[DType.uint8], n: Int):
        ...
    fn finish(owned self) -> Scalar[hash_value_type]:
        ...

I did not do it at this point in time mainly because the compiler does not allow default parameters on traits. (trait Hasher[hash_value_type: DType = DType.uint64] does not work)

We prefer the hash_with approach defaulting to the DefaultHasher rather than the hasher_factory approach. If you share more motivation/context on why the function approach may be useful, we may be better equipped to chime in.

I prefer it too now. I though of factory approach, because some Hashers might have complex initialization needs, e.g. often they need a random seed and a secret. But as you can see here I was able to come up with a strategy where users can influence the hasher behavior with compile time parameters and environment variables.

When implementors (and users) are writing user-defined-types, it would be nice for them to not have to always worry about init, update, and finish to show up in their code: i.e. they only worry about hash or hash_with for example rather than the full API set for the trait requirements.

Yes, as you can see here it is trivial to implement the hash method when standard library implements hash methods for the basic types. As I mentioned here it would be very easy to synthesize the method as well. The only concern I have right now is, what happens with references and do we want to provide a solution for circular references. We could say that structs with references do a shallow hash, where they only hash the reference address. Or we say we allow deep hash, where we would need to be able to check if the referenced value was already visited by the Hasher. This would be possible if we add fn visited(inout self, id: ???) -> Bool to the Hasher. As you can see I am not certain what the type of the reference id should be, maybe Scalar[DType.address].

In the future (keep things simple for now as you have it!), we may want to consider fancy things like write_length_prefix from Rust. Have you given any thought to this sort of thing?

To be honest I did not consider it till now. After examining the method docs I think it's a bit redundant. The collection implementer can add the length of the collection as an Int to the hasher, in order to solve the issue of [1, 2, 3] and [[1], [2, 3]] generating same hash value. I think a special method does not help much.

@mzaks
Copy link
Contributor Author

mzaks commented Apr 28, 2024

I had a call with @gabrieldemarmiesse yesterday where we discussed the proposal in depth. As a result I commit some improvements to the proposal today.

@gabrieldemarmiesse
Copy link
Contributor

Looks good! We can improve on the API and Python compatibility when the compiler has more flexibility :)

@mzaks mzaks force-pushed the proposal/improved-hash-module branch from fa9b222 to 119320b Compare May 3, 2024 02:23
mzaks added 3 commits May 3, 2024 09:58
Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
Signed-off-by: Maxim Zaks <maxim.zaks@gmail.com>
@mzaks mzaks force-pushed the proposal/improved-hash-module branch from 119320b to 1447717 Compare May 3, 2024 07:58
@JoeLoser
Copy link
Collaborator

JoeLoser commented May 5, 2024

Thanks for updating this — I'll take a look at the updates early this week. Very exciting to see this moving along!! 🚀

@JoeLoser JoeLoser self-assigned this May 5, 2024
@ematejska ematejska added the mojo-repo Tag all issues with this label label May 6, 2024
@modularbot
Copy link
Collaborator

✅🟣 This contribution has been merged 🟣✅

Your pull request has been merged to the internal upstream Mojo sources. It will be reflected here in the Mojo repository on the nightly branch during the next Mojo nightly release, typically within the next 24-48 hours.

We use Copybara to merge external contributions, click here to learn more.

@modularbot modularbot added the merged-internally Indicates that this pull request has been merged internally label May 10, 2024
JoeLoser pushed a commit that referenced this pull request May 11, 2024
[External] [Proposal] Improve the hash module

This proposal is based on discussion started in
#1744

Co-authored-by: Maxim Zaks <maxim.zaks@gmail.com>
Closes #2250
MODULAR_ORIG_COMMIT_REV_ID: 692c7d5940b8c88e83ef895b0be26a33a06ad941
@JoeLoser JoeLoser added the merged-externally Merged externally in public mojo repo label May 11, 2024
@JoeLoser
Copy link
Collaborator

Landed in today's nightly: #2615. Thanks for the well thought out proposal! Happy to see this moving along.

@JoeLoser JoeLoser closed this May 11, 2024
lsh pushed a commit to lsh/mojo that referenced this pull request May 17, 2024
[External] [Proposal] Improve the hash module

This proposal is based on discussion started in
modularml#1744

Co-authored-by: Maxim Zaks <maxim.zaks@gmail.com>
Closes modularml#2250
MODULAR_ORIG_COMMIT_REV_ID: 692c7d5940b8c88e83ef895b0be26a33a06ad941

Signed-off-by: Lukas Hermann <lukashermann28@gmail.com>
modularbot pushed a commit that referenced this pull request Jun 7, 2024
[External] [Proposal] Improve the hash module

This proposal is based on discussion started in
#1744

Co-authored-by: Maxim Zaks <maxim.zaks@gmail.com>
Closes #2250
MODULAR_ORIG_COMMIT_REV_ID: 692c7d5940b8c88e83ef895b0be26a33a06ad941
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-externally Merged externally in public mojo repo merged-internally Indicates that this pull request has been merged internally mojo-repo Tag all issues with this label stdlib-proposal Standard Library Proposals
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants