Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF8 decode on unpinned bytes #479

Closed
andrewthad opened this issue Dec 8, 2022 · 6 comments
Closed

UTF8 decode on unpinned bytes #479

andrewthad opened this issue Dec 8, 2022 · 6 comments

Comments

@andrewthad
Copy link
Contributor

I have, in the byteslice library, a type that looks like this:

data Bytes = Bytes
  { array :: {-# UNPACK #-} !ByteArray
  , offset :: {-# UNPACK #-} !Int
  , length :: {-# UNPACK #-} !Int
  }

This is the same thing as ByteString except that it doesn't require pinned memory and it cannot use memory that was allocated in C code. I'm trying to write this function (not in text, in my library):

decodeUtf8Bytes :: Text -> Maybe Bytes

The text library comes with a fast utf8 validation routine implemented in C++. However, it does not expose this in a way that lets me use it. To expose this, it would be sufficient to add this to text:

/* Add this to cbits/validate_utf8.cpp */
extern "C"
int _hs_text_is_valid_utf8_offset(const char* str, size_t off, size_t len){
  return simdutf::validate_utf8(str + off, len);
}

And a wrapper:

foreign import ccall unsafe "_hs_text_is_valid_utf8_offset" c_is_valid_utf8_offset
    :: ByteArray# -> CSize -> CSize -> IO CInt

With this wrapper, it becomes possible to perform UTF-8 validation of unpinned ByteArray# at arbitrary starting points.

If something like this were added to text, it could be exposed in an internal, unstable module. Let me know if this sounds like a welcome addition (and if it is, with some direction on where this should be exposed), and I can prepare a patch.

@Boarders
Copy link

Boarders commented Dec 9, 2022

I think this would be a welcome addition (especially as it will only be promised internally).

@Bodigrim
Copy link
Contributor

Bodigrim commented Dec 9, 2022

Looks reasonable to me.

@phadej
Copy link
Contributor

phadej commented Dec 9, 2022

You probably want both ccall safe and ccall unsafe variants of _hs_text_is_valid_utf8_offset.

For big enough ByteArray# it will be pinned, so it might be good idea to check whether length is big enough, then check that array is actually pinned, and go the safe route.

@andrewthad
Copy link
Contributor Author

I've opened a PR with this at #483.

One thing I realized as I was doing this is that I need to provide a fallback when the SIMDUTF flag is off. I need to add a variant of the isValidBS fallback that works on ByteArray# instead of ByteString. I don't think this is terribly difficult, but I've not done it yet.

@andrewthad
Copy link
Contributor Author

I've added the important missing stuff.

@Bodigrim
Copy link
Contributor

Closed by #483.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants