Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide isValidUtf8 function #417

Closed
Bodigrim opened this issue Sep 4, 2021 · 15 comments · Fixed by #423, #445 or #437
Closed

Provide isValidUtf8 function #417

Bodigrim opened this issue Sep 4, 2021 · 15 comments · Fixed by #423, #445 or #437

Comments

@Bodigrim
Copy link
Contributor

Bodigrim commented Sep 4, 2021

This has been raised in course of haskell/text#365 (comment)
New text UTF8 engine is based on a function isValidUtf8 :: ByteString -> Bool, implemented in C. This is a valuable routine for other packages, but it looks strange to export it from text: the signature does not mention Text at all.

How do we feel about moving this function to bytestring proper? Caveat is that the implementation comes with a chunky C/C++ source.

@kindaro
Copy link
Contributor

kindaro commented Sep 5, 2021

It would not be in good style. The good style is to parse with a hypothetical parseUtf8 ∷ ByteString → Maybe Text and downgrade to Bool as a special case. In what a use case would one prefer to keep their data in a ByteString after the knowledge that it is Text has been expensively obtained? Is isValidUtf8 much cheaper than a hypothetical parseUtf8?

@Bodigrim
Copy link
Contributor Author

Bodigrim commented Sep 5, 2021

There is already Data.Text.Encoding.decodeUtf8', similar to your parseUtf8.

The idea is that there are other packages than text, which can benefit from fast utf8 validation. For example, text-short and utf8-string.

@sjakobi
Copy link
Member

sjakobi commented Sep 11, 2021

isValidUtf8 :: ByteString -> Bool sounds like a reasonable addition to me. 👍

@Bodigrim
Copy link
Contributor Author

CC @kozross

@kozross
Copy link
Contributor

kozross commented Sep 11, 2021

That would make sense to me. In fact, it'd make it easier to port my work I think.

@sjakobi
Copy link
Member

sjakobi commented Sep 13, 2021

It would be nice if this feature could be included in the upcoming 0.11.2.0 release.

@kozross, do I understand you correctly that you could provide an implementation? If so, roughly when could provide a PR?

@kozross
Copy link
Contributor

kozross commented Sep 13, 2021

@sjakobi I've requested a work day to get this wired in. If this gets approved, tomorrow.

@Bodigrim
Copy link
Contributor Author

A bit of background, because this was mostly discussed elsewhere.

The crucial piece of performant UTF8-backed Text is an access to a fast validation routine. Even while Data.Text.Encoding.decodeUtf8 :: ByteString -> Text can just copy a valid UTF8 buffer, it must ensure that the buffer is valid first. This proves to be a non-trivial job. Currently text HEAD relies on simdutf for this. The problem is that simdutf is huge (~500K) and, more importantly, is C++. GHC does not really have a great story of linking to C++ modules, and text is a boot library to be built both in stage 0 and stage 1 compiler, both by make and by hadrian, etc. This also, strictly speaking, adds C++ compiler to a toolchain, required to build GHC. @kozross has been working for several months by now on a pure C replacement for simdutf, and I hope that we'll be able to switch to his library to resolve complications above.

@hasufell
Copy link
Member

This also, strictly speaking, adds C++ compiler to a toolchain, required to build GHC.

You can't build current GHCs without a valid C++ compiler, btw.

@Bodigrim
Copy link
Contributor Author

Bodigrim commented Nov 3, 2021

Reopening, since ideally we should have ARM64 support + happy path for ASCII text restored.

@Bodigrim Bodigrim reopened this Nov 3, 2021
@sjakobi
Copy link
Member

sjakobi commented Nov 3, 2021

Just to be clear:

  1. What's the current status of the ARM64 support in isValidUtf8? Is the function entirely disabled on this architecture?
  2. What does it mean to restore the happy path for ASCII text?

@kozross
Copy link
Contributor

kozross commented Nov 3, 2021

@sjakobi It is supported, but only as the fallback. I can restore it easily enough. As far as the 'happy path' goes, it is blocked only on the fallback. I can restore it there as well, but I would argue it's not as pressing a problem.

@vdukhovni
Copy link
Contributor

Look at the isValidUtf8 API, I wonder whether perhaps rather than returning a Bool result it'd be more useful to return the length of the longest leading valid UTF8 prefix of the string.

When reading ByteStrings from various streaming sources, the last few bytes of the string may be an incomplete UTF8 sequence, and "topping it up" may make it valid again.

Said, there can also be strings that are invalid for reasons other than unexpected truncation, so if this is really the direction, one might return a negative length (say -1) for invalid strings, and a positive length for strings whose tail may yet be fixable.

Of course there could also be more than one interface here. Thoughts???

@kozross
Copy link
Contributor

kozross commented Nov 14, 2021

@vdukhovni The biggest consumer of this is text, which does replacement of invalid sequences anyway. Therefore, an extended API doesn't help much, which is why I didn't provide it.

@Bodigrim
Copy link
Contributor Author

@vdukhovni Assuming that the stream is incomplete, but not malformed a caller can detect the last complete element in O(1), see https://github.com/haskell/text/blob/7a492ecff429748386dbde7da0db45a0bfb8dcda/src/Data/Text/Encoding.hs#L208-L221

@Bodigrim Bodigrim added this to the 0.11.2.0 milestone Dec 4, 2021
@Bodigrim Bodigrim linked a pull request Dec 4, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment