-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose SIMD UTF-8 validation functions from internal module #483
Expose SIMD UTF-8 validation functions from internal module #483
Conversation
I've added portable functions that work without SIMDUTF being turned on. This is ready for review. |
Fixed a problem with GHC 8.0. I still need to add some tests to the test suite. The |
21bda65
to
ea6a3d6
Compare
Just squashed this down to remove a bunch of meaningless commits. |
7e72724
to
0780ba6
Compare
Just squashed everything down. |
This is ready for another round of review now that support for GHC 8.0 is gone. The only changes since last time are documentation. I've provided the motivation for the module in the module's haddock, and I've documented the behaviors of the functions. |
Failure on GHC 8.6.5 looks spurious. |
8dfa76f
to
b7ffc31
Compare
Just squashed. This will kick off another round of CI. |
CI is still failing, but I’m pretty sure that it’s spurious. This is ready for another round of review |
Wait, something really is wrong on the ppcle64 architecture. For some reason, the dependency on the byte array compatibility library doesn’t get picked up, and so it ends up missing the module. I don’t see anything wrong in the Cabal file though. |
The |
@andrewthad necessary changes are in d0b70f7. I think you squashed a bit overzealously. |
I’ve incorporated the diff manually. Thanks. Looks like that fixed it. |
Only remaining CI failure seems unrelated to the change. |
Is there any additional feedback, or is this good to go? |
It's good to go, but on hold waiting for #474 to land (which in its turn waits until we cut a release to accompany GHC 9.6). |
Is there anything I can do to move this forward? Should I resolve the merge conflict, or are there concerns about this feature no longer fitting into the design of |
We can still accept this. The bytearray validation functions are new. And while we now have a fancier bytestring validation function (now in |
src/Data/Text/Internal/Validate.hs
Outdated
-- @safe@ FFI call is executing. The byte array argument /must/ be pinned, | ||
-- and the calling context is responsible for enforcing this. If the | ||
-- byte array is not pinned, this function's behavior is undefined. | ||
isValidUtf8ByteArraySafe :: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of Unsafe/Safe
why not Unpinned/Pinned
?
The Safe/Unsafe
naming is confusing since "safe FFI" calls are unsafe and "unsafe FFI" calls are safe. The FFI functions themselves don't need to be renamed.
We could also consider Unsafe/Pinned
but this still gives the wrong impression that the pinned one is safer than Unsafe
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right. Pinned/unpinned is much more clear. I'll change this.
Responding to the comment about the 128KiB limit for switching from unsafe to safe ffi: No reason for that particular number (it's just an arbitrary large power of two). Here is everything that I understand about safe FFI calls. First, I'm provide a list of primary sources, and then I'll offer my interpretation of this. My interpretation could be wrong since I've never actually read the source code for this part of the GHC runtime.
Here's is my analysis. The phrase "nesting of safe calls" in the GHC User Manual is talking about calling Haskell from C, but I interpret this entire section to mean either, or both, of these:
Regardless of which one of these is the case, it is certain that if you make a safe FFI call from an unbound thread (a green thread), the runtime has to change the execution context from the green thread's current capability's OS thread to a different OS thread. These steps must happen:
That's a lot of stuff going on. There are two pitfalls to the safe FFI:
For these reasons, I tend to be cautious about using the safe FFI for compute intensive things. For anything that might take longer than 1ms, I think the safe FFI makes sense because blocking GC sync for a stop-the-world collector is really bad. But for things that are certainly under 500us, I think the unsafe FFI makes sense. The numbers 1ms and 500us are arbitrary. I cannot justify those with anything other than intuition about what makes sense. But for the 128KiB breakpoint that I chose, at a speed of validating 1B/ns (a conservative estimate), it takes 128us to validate. I think that performing an uninterruptible (preventing GHC across all threads) 128us operation ought to be fine. But maybe that threshold should be lower. Maybe 64KiB or 32KiB would be better. I do think that just using "is it pinned?" alone to make the decision is not great. Everything above 3KiB is pinned, and I don't think that the pitfalls of the safe FFI are worth making sure that GC can run concurrently with a 3us operation. It's a little flimsy on hard evidence, but that's the best argument for it I can provide. I'm fine with changing the switch-to-safe-ffi-at-128kib behavior to something else if you prefer though. |
That's a very clear analysis! That makes sense to me. Confirming this further, the GHC User Manual also says:
|
@andrewthad could you please rebase? |
924ada3
to
5620537
Compare
This makes it possible for users to validate that a ByteArray is a well formed UTF-8 sequence. This works both with and without the SIMDUTF flag, falling back to a pure-Haskell implementation when SIMDUTF is off. Pick up a dependency on data-array-byte to be able to refer to the lifted ByteArray type on older versions of GHC. Use data-array-byte dep for emulated workflow.
5620537
to
e04674e
Compare
I've just now squashed everything down and rebased. |
I'm not able to make sense of the NetBSD test failure output. |
Is there anything else that I can improve in this PR, or is this ready to be merged? |
This makes it possible for users to validate ByteArray# when compiling with the SIMDUTF8 flag on. It does not, however, provide any fallbacks for when that flag is off. That is left as future work.
Related to #479