UTF8 decode on unpinned bytes #479

andrewthad · 2022-12-08T18:56:35Z

I have, in the byteslice library, a type that looks like this:

data Bytes = Bytes
  { array :: {-# UNPACK #-} !ByteArray
  , offset :: {-# UNPACK #-} !Int
  , length :: {-# UNPACK #-} !Int
  }

This is the same thing as ByteString except that it doesn't require pinned memory and it cannot use memory that was allocated in C code. I'm trying to write this function (not in text, in my library):

decodeUtf8Bytes :: Text -> Maybe Bytes

The text library comes with a fast utf8 validation routine implemented in C++. However, it does not expose this in a way that lets me use it. To expose this, it would be sufficient to add this to text:

/* Add this to cbits/validate_utf8.cpp */
extern "C"
int _hs_text_is_valid_utf8_offset(const char* str, size_t off, size_t len){
  return simdutf::validate_utf8(str + off, len);
}

And a wrapper:

foreign import ccall unsafe "_hs_text_is_valid_utf8_offset" c_is_valid_utf8_offset
    :: ByteArray# -> CSize -> CSize -> IO CInt

With this wrapper, it becomes possible to perform UTF-8 validation of unpinned ByteArray# at arbitrary starting points.

If something like this were added to text, it could be exposed in an internal, unstable module. Let me know if this sounds like a welcome addition (and if it is, with some direction on where this should be exposed), and I can prepare a patch.

The text was updated successfully, but these errors were encountered:

Boarders · 2022-12-09T00:35:31Z

I think this would be a welcome addition (especially as it will only be promised internally).

Bodigrim · 2022-12-09T01:35:01Z

Looks reasonable to me.

phadej · 2022-12-09T11:04:16Z

You probably want both ccall safe and ccall unsafe variants of _hs_text_is_valid_utf8_offset.

For big enough ByteArray# it will be pinned, so it might be good idea to check whether length is big enough, then check that array is actually pinned, and go the safe route.

andrewthad · 2022-12-27T15:18:24Z

I've opened a PR with this at #483.

One thing I realized as I was doing this is that I need to provide a fallback when the SIMDUTF flag is off. I need to add a variant of the isValidBS fallback that works on ByteArray# instead of ByteString. I don't think this is terribly difficult, but I've not done it yet.

andrewthad · 2022-12-27T16:54:31Z

I've added the important missing stuff.

Bodigrim · 2024-02-21T01:39:01Z

Closed by #483.

andrewthad mentioned this issue Dec 27, 2022

Expose SIMD UTF-8 validation functions from internal module #483

Merged

Lysxia added the feature request label Feb 4, 2023

Bodigrim closed this as completed Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF8 decode on unpinned bytes #479

UTF8 decode on unpinned bytes #479

andrewthad commented Dec 8, 2022

Boarders commented Dec 9, 2022

Bodigrim commented Dec 9, 2022

phadej commented Dec 9, 2022

andrewthad commented Dec 27, 2022

andrewthad commented Dec 27, 2022

Bodigrim commented Feb 21, 2024

UTF8 decode on unpinned bytes #479

UTF8 decode on unpinned bytes #479

Comments

andrewthad commented Dec 8, 2022

Boarders commented Dec 9, 2022

Bodigrim commented Dec 9, 2022

phadej commented Dec 9, 2022

andrewthad commented Dec 27, 2022

andrewthad commented Dec 27, 2022

Bodigrim commented Feb 21, 2024