Provide isValidUtf8 function #417

Bodigrim · 2021-09-04T23:52:46Z

This has been raised in course of haskell/text#365 (comment)
New text UTF8 engine is based on a function isValidUtf8 :: ByteString -> Bool, implemented in C. This is a valuable routine for other packages, but it looks strange to export it from text: the signature does not mention Text at all.

How do we feel about moving this function to bytestring proper? Caveat is that the implementation comes with a chunky C/C++ source.

The text was updated successfully, but these errors were encountered:

kindaro · 2021-09-05T07:43:02Z

It would not be in good style. The good style is to parse with a hypothetical parseUtf8 ∷ ByteString → Maybe Text and downgrade to Bool as a special case. In what a use case would one prefer to keep their data in a ByteString after the knowledge that it is Text has been expensively obtained? Is isValidUtf8 much cheaper than a hypothetical parseUtf8?

Bodigrim · 2021-09-05T10:32:41Z

There is already Data.Text.Encoding.decodeUtf8', similar to your parseUtf8.

The idea is that there are other packages than text, which can benefit from fast utf8 validation. For example, text-short and utf8-string.

sjakobi · 2021-09-11T10:05:31Z

isValidUtf8 :: ByteString -> Bool sounds like a reasonable addition to me. 👍

Bodigrim · 2021-09-11T12:12:22Z

CC @kozross

kozross · 2021-09-11T17:04:51Z

That would make sense to me. In fact, it'd make it easier to port my work I think.

sjakobi · 2021-09-13T22:50:06Z

It would be nice if this feature could be included in the upcoming 0.11.2.0 release.

@kozross, do I understand you correctly that you could provide an implementation? If so, roughly when could provide a PR?

kozross · 2021-09-13T22:58:01Z

@sjakobi I've requested a work day to get this wired in. If this gets approved, tomorrow.

Bodigrim · 2021-09-13T23:49:11Z

A bit of background, because this was mostly discussed elsewhere.

The crucial piece of performant UTF8-backed Text is an access to a fast validation routine. Even while Data.Text.Encoding.decodeUtf8 :: ByteString -> Text can just copy a valid UTF8 buffer, it must ensure that the buffer is valid first. This proves to be a non-trivial job. Currently text HEAD relies on simdutf for this. The problem is that simdutf is huge (~500K) and, more importantly, is C++. GHC does not really have a great story of linking to C++ modules, and text is a boot library to be built both in stage 0 and stage 1 compiler, both by make and by hadrian, etc. This also, strictly speaking, adds C++ compiler to a toolchain, required to build GHC. @kozross has been working for several months by now on a pure C replacement for simdutf, and I hope that we'll be able to switch to his library to resolve complications above.

hasufell · 2021-10-30T08:51:16Z

This also, strictly speaking, adds C++ compiler to a toolchain, required to build GHC.

You can't build current GHCs without a valid C++ compiler, btw.

Bodigrim · 2021-11-03T19:37:46Z

Reopening, since ideally we should have ARM64 support + happy path for ASCII text restored.

sjakobi · 2021-11-03T19:52:05Z

Just to be clear:

What's the current status of the ARM64 support in isValidUtf8? Is the function entirely disabled on this architecture?
What does it mean to restore the happy path for ASCII text?

kozross · 2021-11-03T21:15:50Z

@sjakobi It is supported, but only as the fallback. I can restore it easily enough. As far as the 'happy path' goes, it is blocked only on the fallback. I can restore it there as well, but I would argue it's not as pressing a problem.

vdukhovni · 2021-11-14T00:15:00Z

Look at the isValidUtf8 API, I wonder whether perhaps rather than returning a Bool result it'd be more useful to return the length of the longest leading valid UTF8 prefix of the string.

When reading ByteStrings from various streaming sources, the last few bytes of the string may be an incomplete UTF8 sequence, and "topping it up" may make it valid again.

Said, there can also be strings that are invalid for reasons other than unexpected truncation, so if this is really the direction, one might return a negative length (say -1) for invalid strings, and a positive length for strings whose tail may yet be fixable.

Of course there could also be more than one interface here. Thoughts???

kozross · 2021-11-14T00:35:11Z

@vdukhovni The biggest consumer of this is text, which does replacement of invalid sequences anyway. Therefore, an extended API doesn't help much, which is why I didn't provide it.

Bodigrim · 2021-11-14T00:48:57Z

@vdukhovni Assuming that the stream is incomplete, but not malformed a caller can detect the last complete element in O(1), see https://github.com/haskell/text/blob/7a492ecff429748386dbde7da0db45a0bfb8dcda/src/Data/Text/Encoding.hs#L208-L221

Bodigrim added the discussion/rfc label Sep 4, 2021

Bodigrim mentioned this issue Sep 4, 2021

Switch internal representation to UTF8 haskell/text#365

Merged

sjakobi added the enhancement label Sep 11, 2021

Bodigrim mentioned this issue Sep 13, 2021

Update changelog and since pragmas #422

Merged

kozross mentioned this issue Sep 15, 2021

[WIP] isValidUtf8 function, tests #423

Merged

Bodigrim closed this as completed in #423 Nov 3, 2021

Bodigrim reopened this Nov 3, 2021

kozross mentioned this issue Nov 6, 2021

Aarch64 optimized UTF-8 validator #437

Merged

Bodigrim linked a pull request Dec 4, 2021 that will close this issue

Uncomment happy path for ASCII validation and fix big-endian issues #445

Merged

Bodigrim added this to the 0.11.2.0 milestone Dec 4, 2021

Bodigrim linked a pull request Dec 4, 2021 that will close this issue

Aarch64 optimized UTF-8 validator #437

Merged

Bodigrim closed this as completed in #445 Dec 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide isValidUtf8 function #417

Provide isValidUtf8 function #417

Bodigrim commented Sep 4, 2021

kindaro commented Sep 5, 2021 •

edited

Bodigrim commented Sep 5, 2021

sjakobi commented Sep 11, 2021

Bodigrim commented Sep 11, 2021

kozross commented Sep 11, 2021

sjakobi commented Sep 13, 2021

kozross commented Sep 13, 2021

Bodigrim commented Sep 13, 2021

hasufell commented Oct 30, 2021

Bodigrim commented Nov 3, 2021

sjakobi commented Nov 3, 2021

kozross commented Nov 3, 2021

vdukhovni commented Nov 14, 2021

kozross commented Nov 14, 2021

Bodigrim commented Nov 14, 2021

Provide isValidUtf8 function #417

Provide isValidUtf8 function #417

Comments

Bodigrim commented Sep 4, 2021

kindaro commented Sep 5, 2021 • edited

Bodigrim commented Sep 5, 2021

sjakobi commented Sep 11, 2021

Bodigrim commented Sep 11, 2021

kozross commented Sep 11, 2021

sjakobi commented Sep 13, 2021

kozross commented Sep 13, 2021

Bodigrim commented Sep 13, 2021

hasufell commented Oct 30, 2021

Bodigrim commented Nov 3, 2021

sjakobi commented Nov 3, 2021

kozross commented Nov 3, 2021

vdukhovni commented Nov 14, 2021

kozross commented Nov 14, 2021

Bodigrim commented Nov 14, 2021

kindaro commented Sep 5, 2021 •

edited