Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hash: Add MurmurHash3 with streaming support #6059

Closed
wants to merge 6 commits into from

Conversation

weltling
Copy link
Contributor

@weltling weltling commented Sep 1, 2020

The implementation is based on the upstream PMurHash. The modernized implementation with the progressive processing support is thankfully attributed to node-murmurhash-native The following variants are implemented

  • murmur3a, 32-bit hash
  • murmur3c, 128-bit hash for x86
  • murmur3f, 128-bit hash for x64

The custom seed support is not targeted by this implementation. It will need a major change to the API, so then custom arguments can be passed through hash_init. For now, the starting hash is always zero.

Signed-off-by: Anatol Belski ab@php.net

@javiereguiluz
Copy link
Contributor

I didn't know this hashing function, so I read Wikipedia's page about MurmurHash and found this:

Jean-Philippe Aumasson and Daniel J. Bernstein were able to show that even implementations of MurmurHash using a randomized seed are vulnerable to so-called HashDoS attacks.

(but maybe MurmurHash 3x versions are not vulnerable?)

SipHash is recommended as a safe alternative. I can't find that algo in ext/hash/hash.c but it looks like SipHash is used for "hash tables" in many major programming languages (and @nikic tried to use it in PHP too a few years ago: see #2149)

@weltling
Copy link
Contributor Author

weltling commented Sep 2, 2020

Yeah, siphash might be a good candidate to add, too.

Regarding the vulnerability concerns - any non crypto algorithm is prone to collisions.

Thanks.

@weltling
Copy link
Contributor Author

weltling commented Sep 2, 2020

@m6w6 perhaps you could check upon this as well, if you've some time.

Thanks.

Copy link
Contributor

@m6w6 m6w6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍

@weltling
Copy link
Contributor Author

weltling commented Sep 3, 2020

Thanks for the quick check!

@nikic
Copy link
Member

nikic commented Sep 8, 2020

I'm not sure if anyone cares, but this adds a hard dependency on C++ to PHP. Previously C++ was only used in intl (an optional extension).

@m6w6
Copy link
Contributor

m6w6 commented Sep 9, 2020

Hm, looking through it again quickly, it doesn't seem to use any C++ specifics, except a few references as parameters for inline functions, which could easily be replaced by macros.

@weltling
Copy link
Contributor Author

weltling commented Sep 9, 2020

Thanks for noticing. In this case, the existing proven implementation is reused, porting it to C seems unnecessary. The C++ itself there is very light C like. One impact I could think of is that libstdc++ will need to be present at the runtime, which will increase the storage capacity requirement. In general, porting to C would decouple this from the origin implementation, which would complicate backporting.

Of course we could do the discussion broader whether C++ is suitable for an obligatory extension. In general, after the move to C99, some perks of C++ are provided. On the compat side - if not a particular C++ standard version is enforced, but one that can be compiled by a corny compiler (like fe i see is used on travis), that should be ok. If C++ usage is considered a bloker, probably this needs to be solved first. There might be other useful pieces, like spooky hash canonical implementation, that might be going here, too. In my case, i'd rather endorse the way to profit from things that can be reused, rather than reinventing the wheel.

Thanks.

@weltling
Copy link
Contributor Author

Thinking more on this, another option could be to make ext/hash support dynamic plugins. In that case, it wouldn't matter to the core at all, what the plugin language is. Please advise me here. The options

  1. Take this as-is. If C++ is seen as a big deal, then
  2. Go for a C++ RFC, then. If accepted, goto 1., otherwise
  3. Implement a dynamic plugin mechanism to extend on the fly, or
  4. Port to C

In this priority order :) I'd really prefer 1., as the implementation already works in node, works on both endiannesses, more implementations are to come, etc. Please let me know, if there are strong opinions, i'd otherwise pursue merging in a week or two. @m6w6 @nikic @dstogov @cmb69 @remicollet

Thanks

@m6w6
Copy link
Contributor

m6w6 commented Sep 14, 2020

  • ad 1. As nikic mentioned, ext/hash would set a precedence for requiring C++ for building PHP, because it is a non-optional extension.
  • ad 2. I doubt it would go through, but... good luck? (no pun intended)
  • ad 3. Doesn't php_hash_register_algo() suffice? I'm usually all for extensibility by plugins, but ext/hash already comes with this API?
  • ad 4. It should be as simple as replacing the inline func defs requiring references, e.g. inline void foo(bar &baz) with POMs (plain old macros), e.g. #define foo(baz) ..., because it looks like the parameter names are everywhere the same as the local var names.

@m6w6
Copy link
Contributor

m6w6 commented Sep 14, 2020

@weltling
Copy link
Contributor Author

@m6w6 nice, thanks! I was hesitant on this also because this implementation would miss custom seed. That would probably need to touch every init routine to pass a custom hash table, so should be a separate run. Perhaps the way to get it done is to

  • Convert this one to C, that's easy
  • As a follow up, implement a way to pass custom args, for this one it would mean make passing custom seed possible

Still, for others in C++, see spooky for example - that would be a tedious case to port to C :/ Until someone has more time to implement papers like this from scratch, having a complementary ext on pecl that is allowed to use C++ or anything else it needs might be the approach. Perhaps sohuld first go for pecl with all of this, to have some playground and figure out a better way. I haven't seen C++ as a big deal, so didn't expect a glitch situation like this initially :)

This wouldn't get into 8.0 anyway, so have time to decide the best direction.

Thanks

@weltling
Copy link
Contributor Author

Oh, and with php_hash_register_algo - cool, i've missed that! Thanks.

@weltling weltling force-pushed the weltling/murmur_take2 branch 2 times, most recently from 80d2cf9 to fa1b355 Compare October 25, 2020 12:02
@weltling
Copy link
Contributor Author

Pulled the C port patch and fixed some follow ups. Please revaluate.

Thanks

Copy link
Member

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need some adjustments to be ubsan clean, either using proper alignment marking (grep unaligned_uint32_t) or by suppressing warnings (grep no_sanitize).

/home/nikic/php/php-src-fuzz/ext/hash/murmur/endianness.h:63:12: runtime error: load of misaligned address 0x60800000d4bb for type 'const uint32_t' (aka 'const unsigned int'), which requires 4 byte alignment
0x60800000d4bb: note: pointer points here
 00  3a 01 00 00 00 00 00 00  01 31 30 35 36 35 39 36  32 36 34 3b 69 3a 31 3b  69 3a 2d 38 37 00 00
              ^ 
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /home/nikic/php/php-src-fuzz/ext/hash/murmur/endianness.h:63:12 in 

ext/hash/hash_murmur.c Outdated Show resolved Hide resolved
@m6w6
Copy link
Contributor

m6w6 commented Oct 27, 2020

Big endian support is probably missing, too -- just came to mind recently.

Copy link
Member

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

If you want to test big-endian support before committing, you can temporarily delete the line

if: type = cron
and see if the s390x job succeeds.

@weltling
Copy link
Contributor Author

@m6w6 big endian is in. See endianness.h.

Thanks.

@weltling
Copy link
Contributor Author

Ah, i just saw the comment about s390. I'll do that later today. I previously tested on an old powermac with Gentoo and on QEMU, but it was with the C++ version. This one certainly needs to be retested on big endian, too.

Thanks.

@weltling weltling closed this Oct 28, 2020
@weltling weltling reopened this Oct 28, 2020
@weltling
Copy link
Contributor Author

It doesn't look like Travis wants to run the s390 job, timed out 2nd time already. I will verify another way.

Thansk.

@weltling
Copy link
Contributor Author

It actually came through, putting the link to the run for the reference https://travis-ci.org/github/php/php-src/builds/739657005 , that's sparing time for a manual test :) I'll be finishing this up next days, probably this weekend.

Thanks

weltling and others added 4 commits October 31, 2020 13:35
The implementation is based on the upstream PMurHash. The following
variants are implemented

- murmur3a, 32-bit hash
- murmur3c, 128-bit hash for x86
- murmur3f, 128-bit hash for x64

The custom seed support is not targeted by this implementation. It will
need a major change to the API, so then custom arguments can be passed
through `hash_init`. For now, the starting hash is always zero.

Signed-off-by: Anatol Belski <ab@php.net>
Don't convert to numeric hash as it might overflow

Signed-off-by: Anatol Belski <ab@php.net>
Signed-off-by: Anatol Belski <ab@php.net>
Signed-off-by: Anatol Belski <ab@php.net>
It is just a normal array index usage, no alignment is needed.

Signed-off-by: Anatol Belski <ab@php.net>
@nikic
Copy link
Member

nikic commented Nov 1, 2020

@weltling
Copy link
Contributor Author

weltling commented Nov 1, 2020

Indeed. On my end, i used clang to fix the warnings, but the pipeline seems to use AddressSanitizer from GCC, too. I've pushed a fix.

Thanks.

jrfnl added a commit to PHPCompatibility/PHPCompatibility that referenced this pull request Mar 9, 2022
> **MurmurHash3**
>
> Added support for MurmurHash3 with streaming support. The following variants are implemented:
> * murmur3a, 32-bit hash
> * murmur3c, 128-bit hash for x86
> * murmur3f, 128-bit hash for x64
>
> **xxHash**
>
> Added support for xxHash. The following variants are implemented:
> * xxh32, 32-bit hash
> * xxh64, 64-bit hash
> * xxh3, 64-bit hash
> * xxh128, 128-bit hash

Includes unit tests.

Refs:
* https://www.php.net/manual/en/migration81.new-features.php#migration81.new-features.hash.murmurhash3
* php/php-src#6059
* php/php-src@72e91e9
* https://www.php.net/manual/en/migration81.new-features.php#migration81.new-features.hash.xxhash
* php/php-src#6524
* php/php-src@23590f7
jrfnl added a commit to jrfnl/doc-en that referenced this pull request Mar 11, 2022
> **MurmurHash3**
>
> Added support for MurmurHash3 with streaming support. The following variants are implemented:
> * murmur3a, 32-bit hash
> * murmur3c, 128-bit hash for x86
> * murmur3f, 128-bit hash for x64
>
> **xxHash**
>
> Added support for xxHash. The following variants are implemented:
> * xxh32, 32-bit hash
> * xxh64, 64-bit hash
> * xxh3, 64-bit hash
> * xxh128, 128-bit hash

Refs:
* https://www.php.net/manual/en/migration81.new-features.php#migration81.new-features.hash.murmurhash3
* php/php-src#6059
* php/php-src@72e91e9
* https://www.php.net/manual/en/migration81.new-features.php#migration81.new-features.hash.xxhash
* php/php-src#6524
* php/php-src@23590f7
jrfnl added a commit to jrfnl/doc-en that referenced this pull request Mar 11, 2022
> **MurmurHash3**
>
> Added support for MurmurHash3 with streaming support. The following variants are implemented:
> * murmur3a, 32-bit hash
> * murmur3c, 128-bit hash for x86
> * murmur3f, 128-bit hash for x64
>
> **xxHash**
>
> Added support for xxHash. The following variants are implemented:
> * xxh32, 32-bit hash
> * xxh64, 64-bit hash
> * xxh3, 64-bit hash
> * xxh128, 128-bit hash

Refs:
* https://www.php.net/manual/en/migration81.new-features.php#migration81.new-features.hash.murmurhash3
* php/php-src#6059
* php/php-src@72e91e9
* https://www.php.net/manual/en/migration81.new-features.php#migration81.new-features.hash.xxhash
* php/php-src#6524
* php/php-src@23590f7
jrfnl added a commit to jrfnl/doc-en that referenced this pull request Mar 11, 2022
> **MurmurHash3**
>
> Added support for MurmurHash3 with streaming support. The following variants are implemented:
> * murmur3a, 32-bit hash
> * murmur3c, 128-bit hash for x86
> * murmur3f, 128-bit hash for x64
>
> **xxHash**
>
> Added support for xxHash. The following variants are implemented:
> * xxh32, 32-bit hash
> * xxh64, 64-bit hash
> * xxh3, 64-bit hash
> * xxh128, 128-bit hash

Refs:
* https://www.php.net/manual/en/migration81.new-features.php#migration81.new-features.hash.murmurhash3
* php/php-src#6059
* php/php-src@72e91e9
* https://www.php.net/manual/en/migration81.new-features.php#migration81.new-features.hash.xxhash
* php/php-src#6524
* php/php-src@23590f7
cmb69 pushed a commit to php/doc-en that referenced this pull request Mar 11, 2022
> **MurmurHash3**
>
> Added support for MurmurHash3 with streaming support. The following variants are implemented:
> * murmur3a, 32-bit hash
> * murmur3c, 128-bit hash for x86
> * murmur3f, 128-bit hash for x64
>
> **xxHash**
>
> Added support for xxHash. The following variants are implemented:
> * xxh32, 32-bit hash
> * xxh64, 64-bit hash
> * xxh3, 64-bit hash
> * xxh128, 128-bit hash

Refs:
* https://www.php.net/manual/en/migration81.new-features.php#migration81.new-features.hash.murmurhash3
* php/php-src#6059
* php/php-src@72e91e9
* https://www.php.net/manual/en/migration81.new-features.php#migration81.new-features.hash.xxhash
* php/php-src#6524
* php/php-src@23590f7

Co-authored-by: jrfnl <jrfnl@users.noreply.github.com>

Closes GH-1451.
tiffany-taylor pushed a commit to tiffany-taylor/doc-en that referenced this pull request Jan 16, 2023
> **MurmurHash3**
>
> Added support for MurmurHash3 with streaming support. The following variants are implemented:
> * murmur3a, 32-bit hash
> * murmur3c, 128-bit hash for x86
> * murmur3f, 128-bit hash for x64
>
> **xxHash**
>
> Added support for xxHash. The following variants are implemented:
> * xxh32, 32-bit hash
> * xxh64, 64-bit hash
> * xxh3, 64-bit hash
> * xxh128, 128-bit hash

Refs:
* https://www.php.net/manual/en/migration81.new-features.php#migration81.new-features.hash.murmurhash3
* php/php-src#6059
* php/php-src@72e91e9
* https://www.php.net/manual/en/migration81.new-features.php#migration81.new-features.hash.xxhash
* php/php-src#6524
* php/php-src@23590f7

Co-authored-by: jrfnl <jrfnl@users.noreply.github.com>

Closes phpGH-1451.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants