New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aarch64 optimized UTF-8 validator #437
Conversation
@ethercrow @0xd34df00d @vdukhovni could you possibly take a look at C code? |
I'm afraid i am not familiar with arm intrinsics and don't have the cycles to come up to speed. So won't be able to do a thorough review of the C code in question. It would need to be a literate program that explained all the registers and instructions involved with a hefty prose to code ratio to make it possible for casual newbies like me to come to grips with it. :-( |
@kozross while we are trying to source reviews, let's create a small benchmark set. I suggest to form a data set by downloading wiki pages in various languages, both raw HTML and content text only. According to https://en.wikipedia.org/wiki/Wikipedia:Copyrights such files should contain a link to the source somewhere. |
@Bodigrim Should I add these benchmarks to this PR, or make a separate one? I assume a separate one would be better. |
Separate PR, please. |
Neither am I familiar with Neon, unfortunately. C in general looks good to me (modulo a couple of very minor, rather stylistic nitpicks), and I've asked a couple of folks that know Neon to take a look if they have time. |
Hello together, I also have no experience with Neon, so can't meaningfully review from that angle. The chunk size here is 64 bytes and validating strings shorter than that is handled by the scalar version Naively I would think that |
@ethercrow In my experience, the overhead in the situation you describe is basically zero. Furthermore, for bytestrings that short, even if the overhead was meaningful, I don't think we're that worried. I haven't benchmarked this difference in particular though, since all my previous work (for Also, I'm fairly sure a typical response from Twitter or GitHub API would be considerably larger than this; tweets exceed 100 characters just in their text fairly regularly nowadays, for example. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving on the basis that it passes tests on ARM CI.
bytestring.cabal
Outdated
@@ -42,6 +42,7 @@ Description: | |||
|
|||
License: BSD3 | |||
License-file: LICENSE | |||
Cabal-Version: 1.18 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for the increased Cabal-Version
constraint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally I raised it to 2.* because I needed elif
. I guess I changed it back to too high a number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, please revert to the original setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kozross could you please revert to Cabal-Version: >= 1.10
? I'd like to merge the branch.
Thanks @kozross for your heroic efforts! Would you be possibly able to finish up with shortcut for ASCII validation? This is important for non-accelerated architectures, e. g., 32-bit machines. Now we have big-endian CI in place, so it should be less painful. |
* Skeleton AArch64 is-valid-utf8.c * NEON SIMD implementation of UTF-8 validator * Revert to older Cabal version * Revert to Cabal 1.10 * Make Cabal check happy
* Skeleton AArch64 is-valid-utf8.c * NEON SIMD implementation of UTF-8 validator * Revert to older Cabal version * Revert to Cabal 1.10 * Make Cabal check happy
This addresses #417, specifically regarding a NEON SIMD implementation. I haven't benchmarked this (there aren't any for UTF-8 encoding and decoding specifically), but this implementation is based on the same principles as the ones used for the x86 ones, and should be fairly fast.
I also had to bump the Cabal version requirements to 2.2, so that I could use
elif
without it erroring.On a related note: should there be some benchmarks for validator speed? If so, which data should we test against?