New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide isValidUtf8 function #417
Comments
It would not be in good style. The good style is to parse with a hypothetical |
There is already The idea is that there are other packages than |
|
CC @kozross |
That would make sense to me. In fact, it'd make it easier to port my work I think. |
It would be nice if this feature could be included in the upcoming 0.11.2.0 release. @kozross, do I understand you correctly that you could provide an implementation? If so, roughly when could provide a PR? |
@sjakobi I've requested a work day to get this wired in. If this gets approved, tomorrow. |
A bit of background, because this was mostly discussed elsewhere. The crucial piece of performant UTF8-backed |
You can't build current GHCs without a valid C++ compiler, btw. |
Reopening, since ideally we should have ARM64 support + happy path for ASCII text restored. |
Just to be clear:
|
@sjakobi It is supported, but only as the fallback. I can restore it easily enough. As far as the 'happy path' goes, it is blocked only on the fallback. I can restore it there as well, but I would argue it's not as pressing a problem. |
Look at the When reading ByteStrings from various streaming sources, the last few bytes of the string may be an incomplete UTF8 sequence, and "topping it up" may make it valid again. Said, there can also be strings that are invalid for reasons other than unexpected truncation, so if this is really the direction, one might return a negative length (say -1) for invalid strings, and a positive length for strings whose tail may yet be fixable. Of course there could also be more than one interface here. Thoughts??? |
@vdukhovni The biggest consumer of this is |
@vdukhovni Assuming that the stream is incomplete, but not malformed a caller can detect the last complete element in O(1), see https://github.com/haskell/text/blob/7a492ecff429748386dbde7da0db45a0bfb8dcda/src/Data/Text/Encoding.hs#L208-L221 |
This has been raised in course of haskell/text#365 (comment)
New
text
UTF8 engine is based on a functionisValidUtf8 :: ByteString -> Bool
, implemented in C. This is a valuable routine for other packages, but it looks strange to export it fromtext
: the signature does not mentionText
at all.How do we feel about moving this function to
bytestring
proper? Caveat is that the implementation comes with a chunky C/C++ source.The text was updated successfully, but these errors were encountered: