Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aarch64 optimized UTF-8 validator #437

Merged
merged 5 commits into from Nov 25, 2021
Merged

Aarch64 optimized UTF-8 validator #437

merged 5 commits into from Nov 25, 2021

Conversation

kozross
Copy link
Contributor

@kozross kozross commented Nov 6, 2021

This addresses #417, specifically regarding a NEON SIMD implementation. I haven't benchmarked this (there aren't any for UTF-8 encoding and decoding specifically), but this implementation is based on the same principles as the ones used for the x86 ones, and should be fairly fast.

I also had to bump the Cabal version requirements to 2.2, so that I could use elif without it erroring.

On a related note: should there be some benchmarks for validator speed? If so, which data should we test against?

bytestring.cabal Outdated Show resolved Hide resolved
@Bodigrim
Copy link
Contributor

@ethercrow @0xd34df00d @vdukhovni could you possibly take a look at C code?

@vdukhovni
Copy link
Contributor

I'm afraid i am not familiar with arm intrinsics and don't have the cycles to come up to speed. So won't be able to do a thorough review of the C code in question. It would need to be a literate program that explained all the registers and instructions involved with a hefty prose to code ratio to make it possible for casual newbies like me to come to grips with it. :-(

@Bodigrim Bodigrim added this to the 0.11.2.0 milestone Nov 14, 2021
@Bodigrim
Copy link
Contributor

Bodigrim commented Nov 14, 2021

@kozross while we are trying to source reviews, let's create a small benchmark set. I suggest to form a data set by downloading wiki pages in various languages, both raw HTML and content text only. According to https://en.wikipedia.org/wiki/Wikipedia:Copyrights such files should contain a link to the source somewhere.

@kozross
Copy link
Contributor Author

kozross commented Nov 14, 2021

@Bodigrim Should I add these benchmarks to this PR, or make a separate one? I assume a separate one would be better.

@Bodigrim
Copy link
Contributor

Separate PR, please.

@0xd34df00d
Copy link
Contributor

Neither am I familiar with Neon, unfortunately. C in general looks good to me (modulo a couple of very minor, rather stylistic nitpicks), and I've asked a couple of folks that know Neon to take a look if they have time.

@ethercrow
Copy link
Contributor

Hello together,

I also have no experience with Neon, so can't meaningfully review from that angle.

The chunk size here is 64 bytes and validating strings shorter than that is handled by the scalar version is_valid_utf8_fallback.

Naively I would think that bytestring_is_valid_utf8 should delegate to is_valid_utf8_fallback as quickly as possible when len < 64 so avoid the overhead of loading the LUTs and initializing state for the SIMD algorithm. I don't know how much of it matters in the presence of an optimizing compiler and CPU smartness, but let's have a benchmark about short strings. E.g. take a typical JSON response from Twitter or GitHub API or something like that and measure how much it takes to validate all the strings there.

@kozross
Copy link
Contributor Author

kozross commented Nov 17, 2021

@ethercrow In my experience, the overhead in the situation you describe is basically zero. Furthermore, for bytestrings that short, even if the overhead was meaningful, I don't think we're that worried. I haven't benchmarked this difference in particular though, since all my previous work (for text) dealt only with its benches, all of which had much larger inputs than 64 bytes.

Also, I'm fairly sure a typical response from Twitter or GitHub API would be considerably larger than this; tweets exceed 100 characters just in their text fairly regularly nowadays, for example.

Copy link
Contributor

@Bodigrim Bodigrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving on the basis that it passes tests on ARM CI.

bytestring.cabal Outdated
@@ -42,6 +42,7 @@ Description:

License: BSD3
License-file: LICENSE
Cabal-Version: 1.18
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for the increased Cabal-Version constraint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I raised it to 2.* because I needed elif. I guess I changed it back to too high a number.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, please revert to the original setting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kozross could you please revert to Cabal-Version: >= 1.10? I'd like to merge the branch.

@Bodigrim Bodigrim merged commit 7fb66a9 into haskell:master Nov 25, 2021
@Bodigrim
Copy link
Contributor

Thanks @kozross for your heroic efforts!

Would you be possibly able to finish up with shortcut for ASCII validation? This is important for non-accelerated architectures, e. g., 32-bit machines. Now we have big-endian CI in place, so it should be less painful.

Bodigrim pushed a commit to Bodigrim/bytestring that referenced this pull request Dec 4, 2021
* Skeleton AArch64 is-valid-utf8.c

* NEON SIMD implementation of UTF-8 validator

* Revert to older Cabal version

* Revert to Cabal 1.10

* Make Cabal check happy
@Bodigrim Bodigrim linked an issue Dec 4, 2021 that may be closed by this pull request
noughtmare pushed a commit to noughtmare/bytestring that referenced this pull request Dec 12, 2021
* Skeleton AArch64 is-valid-utf8.c

* NEON SIMD implementation of UTF-8 validator

* Revert to older Cabal version

* Revert to Cabal 1.10

* Make Cabal check happy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide isValidUtf8 function
6 participants