Aarch64 optimized UTF-8 validator #437

kozross · 2021-11-06T23:28:45Z

This addresses #417, specifically regarding a NEON SIMD implementation. I haven't benchmarked this (there aren't any for UTF-8 encoding and decoding specifically), but this implementation is based on the same principles as the ones used for the x86 ones, and should be fairly fast.

I also had to bump the Cabal version requirements to 2.2, so that I could use elif without it erroring.

On a related note: should there be some benchmarks for validator speed? If so, which data should we test against?

bytestring.cabal

Bodigrim · 2021-11-11T20:27:39Z

@ethercrow @0xd34df00d @vdukhovni could you possibly take a look at C code?

vdukhovni · 2021-11-14T02:37:03Z

I'm afraid i am not familiar with arm intrinsics and don't have the cycles to come up to speed. So won't be able to do a thorough review of the C code in question. It would need to be a literate program that explained all the registers and instructions involved with a hefty prose to code ratio to make it possible for casual newbies like me to come to grips with it. :-(

Bodigrim · 2021-11-14T14:34:22Z

@kozross while we are trying to source reviews, let's create a small benchmark set. I suggest to form a data set by downloading wiki pages in various languages, both raw HTML and content text only. According to https://en.wikipedia.org/wiki/Wikipedia:Copyrights such files should contain a link to the source somewhere.

kozross · 2021-11-14T17:11:41Z

@Bodigrim Should I add these benchmarks to this PR, or make a separate one? I assume a separate one would be better.

Bodigrim · 2021-11-15T02:40:37Z

Separate PR, please.

0xd34df00d · 2021-11-16T19:12:25Z

Neither am I familiar with Neon, unfortunately. C in general looks good to me (modulo a couple of very minor, rather stylistic nitpicks), and I've asked a couple of folks that know Neon to take a look if they have time.

ethercrow · 2021-11-17T11:07:00Z

Hello together,

I also have no experience with Neon, so can't meaningfully review from that angle.

The chunk size here is 64 bytes and validating strings shorter than that is handled by the scalar version is_valid_utf8_fallback.

Naively I would think that bytestring_is_valid_utf8 should delegate to is_valid_utf8_fallback as quickly as possible when len < 64 so avoid the overhead of loading the LUTs and initializing state for the SIMD algorithm. I don't know how much of it matters in the presence of an optimizing compiler and CPU smartness, but let's have a benchmark about short strings. E.g. take a typical JSON response from Twitter or GitHub API or something like that and measure how much it takes to validate all the strings there.

kozross · 2021-11-17T13:25:02Z

@ethercrow In my experience, the overhead in the situation you describe is basically zero. Furthermore, for bytestrings that short, even if the overhead was meaningful, I don't think we're that worried. I haven't benchmarked this difference in particular though, since all my previous work (for text) dealt only with its benches, all of which had much larger inputs than 64 bytes.

Also, I'm fairly sure a typical response from Twitter or GitHub API would be considerably larger than this; tweets exceed 100 characters just in their text fairly regularly nowadays, for example.

Bodigrim

Approving on the basis that it passes tests on ARM CI.

sjakobi · 2021-11-22T21:10:36Z

bytestring.cabal

@@ -42,6 +42,7 @@ Description:

 License:             BSD3
 License-file:        LICENSE
+Cabal-Version:       1.18


What's the reason for the increased Cabal-Version constraint?

Originally I raised it to 2.* because I needed elif. I guess I changed it back to too high a number.

In this case, please revert to the original setting.

@kozross could you please revert to Cabal-Version: >= 1.10? I'd like to merge the branch.

Bodigrim · 2021-11-25T20:02:13Z

Thanks @kozross for your heroic efforts!

Would you be possibly able to finish up with shortcut for ASCII validation? This is important for non-accelerated architectures, e. g., 32-bit machines. Now we have big-endian CI in place, so it should be less painful.

* Skeleton AArch64 is-valid-utf8.c * NEON SIMD implementation of UTF-8 validator * Revert to older Cabal version * Revert to Cabal 1.10 * Make Cabal check happy

kozross added 2 commits November 4, 2021 15:14

Skeleton AArch64 is-valid-utf8.c

708cd69

NEON SIMD implementation of UTF-8 validator

4219cc1

Bodigrim reviewed Nov 10, 2021

View reviewed changes

bytestring.cabal Outdated Show resolved Hide resolved

Revert to older Cabal version

e84d4f6

Bodigrim added this to the 0.11.2.0 milestone Nov 14, 2021

Bodigrim approved these changes Nov 22, 2021

View reviewed changes

Bodigrim requested a review from sjakobi November 22, 2021 20:15

sjakobi approved these changes Nov 22, 2021

View reviewed changes

kozross and others added 2 commits November 25, 2021 15:45

Revert to Cabal 1.10

5cd5b4f

Make Cabal check happy

7ca1ba1

Bodigrim merged commit 7fb66a9 into haskell:master Nov 25, 2021

Bodigrim linked an issue Dec 4, 2021 that may be closed by this pull request

Provide isValidUtf8 function #417

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aarch64 optimized UTF-8 validator #437

Aarch64 optimized UTF-8 validator #437

kozross commented Nov 6, 2021 •

edited

Bodigrim commented Nov 11, 2021

vdukhovni commented Nov 14, 2021

Bodigrim commented Nov 14, 2021 •

edited

kozross commented Nov 14, 2021

Bodigrim commented Nov 15, 2021

0xd34df00d commented Nov 16, 2021

ethercrow commented Nov 17, 2021

kozross commented Nov 17, 2021 •

edited

Bodigrim left a comment

sjakobi Nov 22, 2021

kozross Nov 22, 2021

sjakobi Nov 22, 2021

Bodigrim Nov 25, 2021

Bodigrim commented Nov 25, 2021

Aarch64 optimized UTF-8 validator #437

Aarch64 optimized UTF-8 validator #437

Conversation

kozross commented Nov 6, 2021 • edited

Bodigrim commented Nov 11, 2021

vdukhovni commented Nov 14, 2021

Bodigrim commented Nov 14, 2021 • edited

kozross commented Nov 14, 2021

Bodigrim commented Nov 15, 2021

0xd34df00d commented Nov 16, 2021

ethercrow commented Nov 17, 2021

kozross commented Nov 17, 2021 • edited

Bodigrim left a comment

Choose a reason for hiding this comment

sjakobi Nov 22, 2021

Choose a reason for hiding this comment

kozross Nov 22, 2021

Choose a reason for hiding this comment

sjakobi Nov 22, 2021

Choose a reason for hiding this comment

Bodigrim Nov 25, 2021

Choose a reason for hiding this comment

Bodigrim commented Nov 25, 2021

kozross commented Nov 6, 2021 •

edited

Bodigrim commented Nov 14, 2021 •

edited

kozross commented Nov 17, 2021 •

edited