Expose SIMD UTF-8 validation functions from internal module #483

andrewthad · 2022-12-27T15:13:33Z

This makes it possible for users to validate ByteArray# when compiling with the SIMDUTF8 flag on. It does not, however, provide any fallbacks for when that flag is off. That is left as future work.

Related to #479

andrewthad · 2022-12-27T16:55:53Z

I've added portable functions that work without SIMDUTF being turned on. This is ready for review.

andrewthad · 2022-12-27T17:33:30Z

Fixed a problem with GHC 8.0. I still need to add some tests to the test suite. The ByteArray# functions have no coverage.

andrewthad · 2022-12-28T15:51:13Z

Just squashed this down to remove a bunch of meaningless commits.

src/Data/Text/Internal/PrimCompat.hs

andrewthad · 2023-01-05T15:26:14Z

Just squashed everything down.

andrewthad · 2023-01-05T16:00:18Z

This is ready for another round of review now that support for GHC 8.0 is gone. The only changes since last time are documentation. I've provided the motivation for the module in the module's haddock, and I've documented the behaviors of the functions.

src/Data/Text/Internal/Validate.hs

text.cabal

src/Data/Text/Internal/Validate.hs

andrewthad · 2023-01-11T15:55:55Z

Failure on GHC 8.6.5 looks spurious.

andrewthad · 2023-01-12T16:49:35Z

Just squashed. This will kick off another round of CI.

andrewthad · 2023-01-12T17:54:29Z

CI is still failing, but I’m pretty sure that it’s spurious. This is ready for another round of review

andrewthad · 2023-01-12T17:57:20Z

Wait, something really is wrong on the ppcle64 architecture. For some reason, the dependency on the byte array compatibility library doesn’t get picked up, and so it ends up missing the module. I don’t see anything wrong in the Cabal file though.

Lysxia · 2023-01-12T18:40:57Z

The emulated.yml workflow uses ghc --make instead of cabal to build, due to memory constraints according to the comment in emulated.yml. So the extra package probably needs to be installed explicitly somehow.

Bodigrim · 2023-01-12T19:33:25Z

@andrewthad necessary changes are in d0b70f7. I think you squashed a bit overzealously.

andrewthad · 2023-01-13T14:27:55Z

I’ve incorporated the diff manually. Thanks. Looks like that fixed it.

andrewthad · 2023-01-19T16:15:07Z

Only remaining CI failure seems unrelated to the change.

andrewthad · 2023-01-24T19:10:51Z

Is there any additional feedback, or is this good to go?

Bodigrim · 2023-01-24T22:10:19Z

It's good to go, but on hold waiting for #474 to land (which in its turn waits until we cut a release to accompany GHC 9.6).

Bodigrim · 2023-02-28T18:52:03Z

@Lysxia given that #448 has landed, could you take a look at what remains of this PR? I assume the new design is flexible enough already, but we might still want to expose low-level calls to simdutf?..

I’m AFK and cannot take a close look myself.

andrewthad · 2023-03-08T20:08:44Z

Is there anything I can do to move this forward? Should I resolve the merge conflict, or are there concerns about this feature no longer fitting into the design of text?

Lysxia · 2023-03-09T00:54:32Z

We can still accept this. The bytearray validation functions are new. And while we now have a fancier bytestring validation function (now in Data.Text.Internal.Encoding), adding the bool version would look consistent with the bytearray ones.

src/Data/Text/Internal/Validate.hs

Lysxia · 2023-03-09T01:09:38Z

src/Data/Text/Internal/Validate.hs

+-- @safe@ FFI call is executing. The byte array argument /must/ be pinned,
+-- and the calling context is responsible for enforcing this. If the
+-- byte array is not pinned, this function's behavior is undefined.
+isValidUtf8ByteArraySafe ::


Instead of Unsafe/Safe why not Unpinned/Pinned?

The Safe/Unsafe naming is confusing since "safe FFI" calls are unsafe and "unsafe FFI" calls are safe. The FFI functions themselves don't need to be renamed.

We could also consider Unsafe/Pinned but this still gives the wrong impression that the pinned one is safer than Unsafe.

I think you're right. Pinned/unpinned is much more clear. I'll change this.

src/Data/Text/Internal/Validate.hs

andrewthad · 2023-03-09T14:32:21Z

Responding to the comment about the 128KiB limit for switching from unsafe to safe ffi:

No reason for that particular number (it's just an arbitrary large power of two). Here is everything that I understand about safe FFI calls. First, I'm provide a list of primary sources, and then I'll offer my interpretation of this. My interpretation could be wrong since I've never actually read the source code for this part of the GHC runtime.

GHC User Manual section 6.17.1.2: "A safe call calling into haskell is run on a bound thread by the RTS. This means any nesting of safe calls will be executed on the same operating system thread. Sequential safe calls however do not enjoy this luxury and may be run on arbitrary OS threads."
GHC Issue 13296: Slyfox writes "ghc's RTS does not limit amount of outstanding safe FFIs". rwbarton measures that the overhead of safe FFI call is 100ns in a particular situation. Simon Marlow implies that the estimate of 100ns is reasonable.
GHC Issue 11075: A discussion of how safe FFI calls can spawn an unlimited number of OS threads, which essentially thwarts the GHC scheduler (this one is a little pathological and seems like it was more of a thought experiment than something observed in a production environment). Simon Marlow suggests a lightweight extension to the FFI. Some code for this actually existed at some point, but it was never merged into GHC.

Here's is my analysis. The phrase "nesting of safe calls" in the GHC User Manual is talking about calling Haskell from C, but I interpret this entire section to mean either, or both, of these:

If you perform a safe FFI call and then call back into Haskell code from C, and then that Haskell code makes another safe FFI call, then you stay in the same thread. (the original intent)
If you perform a safe FFI call inside of a bound thread (created by forkOS), GHC does not have to transfer control to another OS thread. It uses the one you are already in. (something that would also make sense)

Regardless of which one of these is the case, it is certain that if you make a safe FFI call from an unbound thread (a green thread), the runtime has to change the execution context from the green thread's current capability's OS thread to a different OS thread. These steps must happen:

Find an OS thread in the bound thread pool, or create a new one if all the existing ones are busy.
Communicate to this OS thread the task it needs to do (not sure how this works exactly)
Put the green thread to sleep, relinquishing its capability, and add the green thread to a wait queue that will wake it back up when the result from the bound OS thread is done (this is very similar to calling takeMVar but it might be totally different code in the runtime that handles this). This might cause a different green thread (one that had been waiting for a capability to become available) to start running on that capability.
Wake the green thread back up once the result is ready. Once a capability becomes available, the green thread can start running on it, although this will not necessarily be the same capability it was originally running on.

That's a lot of stuff going on. There are two pitfalls to the safe FFI:

It involves transferring control between thread and synchronization between threads, both of which can be expensive, but the context (like if you are already running in a bound thread) might change the cost.
The concerns from GHC Issue 11075, where the safe FFI thwarts the GHC scheduler.

For these reasons, I tend to be cautious about using the safe FFI for compute intensive things. For anything that might take longer than 1ms, I think the safe FFI makes sense because blocking GC sync for a stop-the-world collector is really bad. But for things that are certainly under 500us, I think the unsafe FFI makes sense. The numbers 1ms and 500us are arbitrary. I cannot justify those with anything other than intuition about what makes sense. But for the 128KiB breakpoint that I chose, at a speed of validating 1B/ns (a conservative estimate), it takes 128us to validate. I think that performing an uninterruptible (preventing GHC across all threads) 128us operation ought to be fine. But maybe that threshold should be lower. Maybe 64KiB or 32KiB would be better. I do think that just using "is it pinned?" alone to make the decision is not great. Everything above 3KiB is pinned, and I don't think that the pitfalls of the safe FFI are worth making sure that GC can run concurrently with a 3us operation.

It's a little flimsy on hard evidence, but that's the best argument for it I can provide. I'm fine with changing the switch-to-safe-ffi-at-128kib behavior to something else if you prefer though.

Lysxia · 2023-03-09T15:00:03Z

That's a very clear analysis! That makes sense to me. Confirming this further, the GHC User Manual also says:

This means that if you need to make a foreign call to a function that takes a long time or potentially blocks, then you should mark it safe and use -threaded. Some library functions make such calls internally; their documentation should indicate when this is the case.

On the other hand, a foreign call to a function that is guaranteed to take a short time, and does not call back into Haskell can be marked unsafe.

Bodigrim · 2023-03-18T18:48:42Z

@andrewthad could you please rebase?

This makes it possible for users to validate that a ByteArray is a well formed UTF-8 sequence. This works both with and without the SIMDUTF flag, falling back to a pure-Haskell implementation when SIMDUTF is off. Pick up a dependency on data-array-byte to be able to refer to the lifted ByteArray type on older versions of GHC. Use data-array-byte dep for emulated workflow.

andrewthad · 2023-03-20T12:51:54Z

I've just now squashed everything down and rebased.

src/Data/Text/Internal/Validate.hs

text.cabal

src/Data/Text/Internal/Validate/Simd.hs

andrewthad · 2023-04-06T12:16:15Z

I'm not able to make sense of the NetBSD test failure output.

andrewthad · 2023-04-18T13:27:06Z

Is there anything else that I can improve in this PR, or is this ready to be merged?

andrewthad mentioned this pull request Dec 27, 2022

UTF8 decode on unpinned bytes #479

Closed

andrewthad force-pushed the expose-simd-validation-functions branch from 21bda65 to ea6a3d6 Compare December 28, 2022 15:50

Bodigrim reviewed Dec 31, 2022

View reviewed changes

src/Data/Text/Internal/PrimCompat.hs Outdated Show resolved Hide resolved

andrewthad force-pushed the expose-simd-validation-functions branch 2 times, most recently from 7e72724 to 0780ba6 Compare January 5, 2023 15:25

Bodigrim reviewed Jan 5, 2023

View reviewed changes

Bodigrim reviewed Jan 8, 2023

View reviewed changes

src/Data/Text/Internal/Validate.hs Outdated Show resolved Hide resolved

src/Data/Text/Internal/Validate.hs Show resolved Hide resolved

src/Data/Text/Internal/Validate.hs Outdated Show resolved Hide resolved

andrewthad force-pushed the expose-simd-validation-functions branch from 8dfa76f to b7ffc31 Compare January 12, 2023 16:49

Lysxia added the API Addition (PVP minor) label Feb 9, 2023

Lysxia reviewed Mar 9, 2023

View reviewed changes

andrewthad force-pushed the expose-simd-validation-functions branch from 924ada3 to 5620537 Compare March 20, 2023 12:36

andrewthad force-pushed the expose-simd-validation-functions branch from 5620537 to e04674e Compare March 20, 2023 12:51

Bodigrim reviewed Mar 20, 2023

View reviewed changes

src/Data/Text/Internal/Validate.hs Outdated Show resolved Hide resolved

text.cabal Outdated Show resolved Hide resolved

text.cabal Show resolved Hide resolved

src/Data/Text/Internal/Validate/Simd.hs Outdated Show resolved Hide resolved

Resolve minor issues

dd4c6b3

Bodigrim approved these changes Apr 3, 2023

View reviewed changes

Bodigrim requested a review from Lysxia April 3, 2023 19:35

Lysxia approved these changes Apr 18, 2023

View reviewed changes

Bodigrim merged commit 6c2ad14 into haskell:master Apr 18, 2023
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose SIMD UTF-8 validation functions from internal module #483

Expose SIMD UTF-8 validation functions from internal module #483

andrewthad commented Dec 27, 2022

andrewthad commented Dec 27, 2022

andrewthad commented Dec 27, 2022

andrewthad commented Dec 28, 2022

andrewthad commented Jan 5, 2023

andrewthad commented Jan 5, 2023

andrewthad commented Jan 11, 2023

andrewthad commented Jan 12, 2023

andrewthad commented Jan 12, 2023

andrewthad commented Jan 12, 2023

Lysxia commented Jan 12, 2023 •

edited

Loading

Bodigrim commented Jan 12, 2023

andrewthad commented Jan 13, 2023

andrewthad commented Jan 19, 2023

andrewthad commented Jan 24, 2023

Bodigrim commented Jan 24, 2023

Bodigrim commented Feb 28, 2023

andrewthad commented Mar 8, 2023

Lysxia commented Mar 9, 2023 •

edited

Loading

Lysxia Mar 9, 2023

andrewthad Mar 9, 2023

andrewthad commented Mar 9, 2023

Lysxia commented Mar 9, 2023

Bodigrim commented Mar 18, 2023

andrewthad commented Mar 20, 2023

andrewthad commented Apr 6, 2023

andrewthad commented Apr 18, 2023

Expose SIMD UTF-8 validation functions from internal module #483

Expose SIMD UTF-8 validation functions from internal module #483

Conversation

andrewthad commented Dec 27, 2022

andrewthad commented Dec 27, 2022

andrewthad commented Dec 27, 2022

andrewthad commented Dec 28, 2022

andrewthad commented Jan 5, 2023

andrewthad commented Jan 5, 2023

andrewthad commented Jan 11, 2023

andrewthad commented Jan 12, 2023

andrewthad commented Jan 12, 2023

andrewthad commented Jan 12, 2023

Lysxia commented Jan 12, 2023 • edited Loading

Bodigrim commented Jan 12, 2023

andrewthad commented Jan 13, 2023

andrewthad commented Jan 19, 2023

andrewthad commented Jan 24, 2023

Bodigrim commented Jan 24, 2023

Bodigrim commented Feb 28, 2023

andrewthad commented Mar 8, 2023

Lysxia commented Mar 9, 2023 • edited Loading

Lysxia Mar 9, 2023

Choose a reason for hiding this comment

andrewthad Mar 9, 2023

Choose a reason for hiding this comment

andrewthad commented Mar 9, 2023

Lysxia commented Mar 9, 2023

Bodigrim commented Mar 18, 2023

andrewthad commented Mar 20, 2023

andrewthad commented Apr 6, 2023

andrewthad commented Apr 18, 2023

Lysxia commented Jan 12, 2023 •

edited

Loading

Lysxia commented Mar 9, 2023 •

edited

Loading