Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: use hashing for switch statements #34381

Open
mdempsky opened this issue Sep 18, 2019 · 17 comments
Open

cmd/compile: use hashing for switch statements #34381

mdempsky opened this issue Sep 18, 2019 · 17 comments

Comments

@mdempsky
Copy link
Member

@mdempsky mdempsky commented Sep 18, 2019

Currently we implement switch statements with binary search. This works well for values that fit into a register, but for strings it might be better to compute a hash function.

Some thoughts:

  1. Even in the simple case of using a full hash function, exactly two passes over the switched value is probably better than O(log N) possible passes over it.

  2. Since we have the expected keys statically, we can use a simple, cheap universal hash (e.g., FNV? djb2?) and keep trying different seeds until we get one without any collisions.

  3. Instead of hashing all of the input bytes, we can hash just enough of them to avoid collisions.

  4. We can try using a minimal (or near minimal) perfect hash to emit a jump table instead of binary search. (This might even be useful for sparse integer values too.)

If someone can help figure out code that can take a []string and figures out an appropriate and cheap hash function to use, I can help wire it into swt.go.

@mdempsky mdempsky added this to the Unplanned milestone Sep 18, 2019
@josharian
Copy link
Contributor

@josharian josharian commented Sep 19, 2019

Drive-by comment from phone: It wouldn’t surprise me if “use the first word of the string data” turns out to in practice to be a decent first-cut hash, and it is certainly cheap (can be generated inline, should be just a few instructions).

@pierrec
Copy link

@pierrec pierrec commented Sep 19, 2019

After reading minimal perfect hash function which does have an implementation go-mph, maybe that can just be used as is?

@lranjbar
Copy link

@lranjbar lranjbar commented Nov 3, 2019

@mdempsky @josharian @pierrec @toothrot Has anyone started on this issue? It seems interesting.

@mdempsky
Copy link
Member Author

@mdempsky mdempsky commented Nov 3, 2019

@lranjbar I don't think anyone has started on it yet. Happy to answer any quick questions to get you started. (I probably won't be able to give any more serious guidance/feedback for another week though.)

@lranjbar
Copy link

@lranjbar lranjbar commented Nov 5, 2019

@mdempsky I okay with waiting a little bit for more guidance. (I do actually have a pretty busy week myself.) I can still dig into it a bit more on my own, I was just curious before I started. :)

@jupj
Copy link
Contributor

@jupj jupj commented Sep 4, 2020

About the jump table, we probably prefer a near minimal perfect hash function. A strict minimal perfect hash function would always lead to a collision for the default case of the switch. One option is to extend the size of the jump table from minimal to the next power of 2, and jump to the default case for indexes that don't collide with a pre-computed hash. At the same time, that would allow us to use cheaper bitwise operators instead of the more expensive % operator when calculating the hash/index for the jump table.

A question is how to evaluate the performance of binary search vs jump table. Constructing the near minimal hash probably requires some additional cycles, compared to a cheap hash (or the even cheaper "use the first word of the string data", as mentioned by @josharian). I suppose that the performance benefits of the jump table comes with large n. For small n, the difference might be less clear?

@jupj
Copy link
Contributor

@jupj jupj commented Sep 19, 2020

I've implemented a hash function using fnv, based on string length and minimal unique prefix (@lranjbar I hope you don't mind). Also have a working implementation of a near minimal perfect hash function for a jump table.

This finds a unique hash reasonably fast (typically within the first 1-2 seeds) and hash/jumptable benchmarks indicate performance close to maphash on linux/amd64. (I've used all string-typed switch cases from the golang/go repo as testdata.)

@mdempsky how/where should I submit this code?

@mdempsky
Copy link
Member Author

@mdempsky mdempsky commented Sep 19, 2020

@jupj Great! Thanks for working on this. Do you have the code available somewhere that I can take a look?

What would probably be best is if you can make a standalone Go package that we can include in the standard repo as something like "cmd/compile/internal/phash" (or ".../perfect hash" or whatever; open to suggestions).

With that done, I can help wire it into gc/swt.go. Or give you pointers on doing this yourself if you'd like.

@jupj
Copy link
Contributor

@jupj jupj commented Sep 20, 2020

@mdempsky please find the code here: https://github.com/jupj/go-issue-34381 (findhash.go contains the hash functions)
I'll modify this to a standalone package that could be included in the standard repo.

I'm new to compiler work, so I appreciate any help and/or pointers to wire it into gc/swt.go!

Edit: added pointer to findhash.go

@lranjbar
Copy link

@lranjbar lranjbar commented Sep 20, 2020

@jupj No worries. It's been almost a year and clearly I didn't have time outside of work to focus on this. 😅 Good luck with your submission.

@gopherbot
Copy link

@gopherbot gopherbot commented Sep 23, 2020

Change https://golang.org/cl/256898 mentions this issue: cmd/compile: add internal package phash

@josharian
Copy link
Contributor

@josharian josharian commented Oct 2, 2020

I think we're on a good path for this, but one thing to double-check as we integrate is that if a giant string arrives for a switch statement only composed of small strings, we don't end up doing O(giant) work to calculate a hash. This protection could have many forms (hash function design, checking max length, checking exact length before hashing, etc.), but we should make sure it's there.

@jupj
Copy link
Contributor

@jupj jupj commented Oct 3, 2020

I think we're on a good path for this, but one thing to double-check as we integrate is that if a giant string arrives for a switch statement only composed of small strings, we don't end up doing O(giant) work to calculate a hash. This protection could have many forms (hash function design, checking max length, checking exact length before hashing, etc.), but we should make sure it's there.

Thanks, I updated the CL mentioned above, re-added the check for the minimum unique prefix length. This limits the work to O(prefixLen), with the maximum case string length in the switch statement as the upper bound of prefixLen. Is this sufficient to protect from giant strings?


Any tips for how to wire the perfect hashes into swt.go? I think this could be the next step, to validate the perfect hash functionality. I have two options in mind:

  1. For each switch statement, generate an AST that implements the ad-hoc hash as an anonymous function. So switch s { will be rewritten to switch func(s string) uint32 { /* generated AST */ }(s) {, where the AST implements the selected perfect hash function with given seed and prefix length.

  2. Generate functions (go source code) for all different combinations of cheap hashes and include them in the runtime library as runtime.phash_Len, runtime.phash_First, runtime.phash_LenFirst, etc. The compiler will rewrite the switch statements to use these, eg. statement switch s { will be rewritten to switch runtime.phash_LenFirst(s) {. The compiler will include seed and prefix parameters to the hash function where needed.

I don't have any experience generating AST, so I feel the latter option would be more approachable for me. There are 31 different combinations (fnv fallback included), so I think it would make sense to generate the functions into the runtime library, at least for programs with more than 31 switch cases.

Comments?

@josharian
Copy link
Contributor

@josharian josharian commented Oct 3, 2020

We should generate AST.

Or leave it alone in swt.go and generate SSA in gc/ssa.go. I find generating SSA easier than generating AST, but since this would be the first switch-to-SSA direct lowering there's no sample code to work from and you're more likely to hit unexpected stumbling blocks.

@gopherbot
Copy link

@gopherbot gopherbot commented Oct 7, 2020

Change https://golang.org/cl/260597 mentions this issue: cmd/compile: add -d=swthash flag for string switch hashing

@mdempsky
Copy link
Member Author

@mdempsky mdempsky commented Oct 7, 2020

As an update here: On Sunday, I played with integrating phash into the Go compiler and got it mostly working. CL is 260597 above. (If you're really curious, I streamed the compiler development process on Twitch at https://www.twitch.tv/videos/761008858; string hashing stuff starts at 37m20s and runs to the end of the stream. Beware Twitch only stores videos for 14 days.)

You can build targets with -gcflags=all=-d=swthash to enable the optimization. I haven't done any performance/size benchmarking yet.

@jupj
Copy link
Contributor

@jupj jupj commented Oct 10, 2020

Thanks for the stream, very insightful to watch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants
You can’t perform that action at this time.