Allow caller more control in stream decoding #448

david-sledge · 2022-06-27T05:45:07Z

I've been wanting this feature for a while, so I decided to prototype it and create a pull request.

Lysxia · 2022-06-27T10:44:57Z

Rather than STT, safer alternatives would be to specialize this to one specific monad (was state the original motivation?) or to use a different algorithm that does not involve STT.

david-sledge · 2022-06-27T13:42:24Z

Rather than STT, safer alternatives would be to specialize this to one specific monad (was state the original motivation?) or to use a different algorithm that does not involve STT.

State (along with reader) was secondary. My primary motivation was to multiple monads in a stack (I use IO and continuation frequently). Specializing is not ideal, since there'd have to be a different one not just for each monad, but each combination of monads.

Lysxia · 2022-06-27T17:08:56Z

STT is notably unsafe with continuations.

Bodigrim · 2022-06-27T17:53:37Z

text is a boot library, it cannot depend on STMonadTrans package, unfortunately.

I'm not very happy with the API of Data.Text.Encoding. The fact that the only way to abort decoding is to raise an error (and catch it elsewhere) is a joke. Ideally it should be possible to drive decoding from without, e. g.,

decodeUtf8With2 :: ByteString -> ByteString -> Either Int (Text, ByteString)

Then clients are free to interpret Int (the position, where decoding failed) as they like, in whatever monad.

david-sledge · 2022-07-01T05:12:30Z

text is a boot library, it cannot depend on STMonadTrans package, unfortunately.

Removed.

I'm not very happy with the API of Data.Text.Encoding. The fact that the only way to abort decoding is to raise an error (and catch it elsewhere) is a joke. Ideally it should be possible to drive decoding from without, e. g.,
decodeUtf8With2 :: ByteString -> ByteString -> Either Int (Text, ByteString)

Done (though not in that exact manner). Check out the test case t_decode_withM_error5 in tests/Tests/Properties/Transcoding.hs. There other examples, too, for Cont and Maybe. What's more is that the error handlers of the additional functions (the ones ending with M) take the byte position where the error occurred.

Lysxia · 2022-07-01T06:57:16Z

This still has the same unsafety as STT, since it's just an inlining of the STT logic. You can cause a segfault by specializing to ContT and by using a continuation twice.
Taking a callback is fundamentally flawed, so that the current implementation cannot simply be generalized to be in an arbitrary monad. Changing the API as bodigrim suggested seems like a better option (you may also remember the part that was decoded so far).

david-sledge · 2022-07-02T01:05:43Z

Ahhh... I see what you mean, and it'd be a problem with the list monad, too. However, I noticed that streamDecodeUtf8With returns a continuation in the Decoding data type. I looked how the risk of calling that continuation multiple times is mitigated. Essentially, the continuation a wholly separate invocation of a function. It calls runST in which the text array is created, populated, frozen, embedded in the Text data type, and returned without being modified any further.

So I applied the same concept around the call to the error handler. Before the error handler is called, it runs through the processes just before exiting; the text array is frozen, no longer modified, and the reference to last state value is discarded. Then after the error handler returns, it simulates runST with a call to runRW#, makes a copy of the text array where it left off, and continues its operations on the copy.

Lysxia · 2022-07-02T07:19:34Z

Changing the API to not take a callback at all and instead return an Either that tells you exactly where it stopped would still be much simpler, safer, and faster. I would prefer to consider that first.

Passing the State# token explicitly like this, while in an arbitrary monad, makes this quite difficult to review with confidence.

david-sledge · 2022-07-03T05:08:28Z

While I understand the desire to use Either, that doesn't really get me closer to my goal, which is why I was going for arbitrary monad which would meet my goal and allow for Either. Since insert-your-favorite-monads-here option has been removed from the table, I found myself at an impasse. So I took a step back, looked at the issue again, and reevaluated the problem.

With the requirements and constraints of

Allow a means to abort decoding without raising an error.
No STT.
No arbitrary monad implementation.
No manually passing the State# token.
Indicate position in source ByteStream where the error occurs.

I think I found a solution that'll fit the bill.

Lysxia · 2022-07-03T08:40:59Z

This is actually close to what we were suggesting, and indeed much more satisfactory.

david-sledge · 2022-07-03T12:41:42Z

Excellent! A couple of questions:

The byte position does not currently track across continuation invocations and instead resets back to zero. What's the opinion on whether it should track or the caller be responsible for tracking the cumulative byte position/total bytes read?
What version should the @since annotation be for streamDecodeUtf8With'? 2.0.1, 2.1.0, or do I not need to worry about it yet?

david-sledge · 2022-07-03T19:17:25Z

For the first question, I went ahead and let the library track the absolute position for the caller.

david-sledge · 2022-07-06T20:23:48Z

Stream decoders added for utf-16 and utf-32, and an Either decoder for ASCII. I'll work on the lazy options next.

src/Data/Text/Encoding.hs

changelog.md

src/Data/Text/Internal/Builder.hs

tests/Tests/Properties/Transcoding.hs

text.cabal

Bodigrim · 2022-07-19T19:43:07Z

We are getting closer to the final cut off before GHC 9.4. @Lysxia could you please complete your review on this? I don't have much bandwidth at the moment.

Lysxia · 2022-07-19T21:01:42Z

@Bodigrim How much time is left for the cut off? I'll allocate more cycles to shepherd this PR (sorry for dragging this @david-sledge !) but I feel like this is going to take a couple more rounds.

Bodigrim · 2022-07-19T21:07:20Z

I'm not sure exactly. Given that @bgamari has not chased us yet, it would probably take at least a week or more.

src/Data/Text/Encoding.hs

Bodigrim · 2022-07-21T19:50:34Z

Given #453 (comment), I'll cut text-2.0.1 from the current master, and this PR will be delayed until text-2.0.2. I'm happy to release text-2.0.2 fairly soon, in time for GHC 9.4.2. Sorry @david-sledge, I hope it's not too much of delay for your purposes.

src/Data/Text/Encoding.hs

david-sledge · 2022-07-29T02:43:43Z

This branch no longer uses simdutf anymore. Should all of it and references to it be removed, or is it worth modifying simdutf to have a determine_utf8_prefix_length function?

Lysxia · 2022-07-29T06:34:30Z

simdutf is there for performance, so if it is no longer used that needs to be justified by benchmarks. I suspect it's going to be difficult to compete with a vectorized implementation.

Lysxia · 2023-02-06T04:01:56Z

I "accepted" my own change to remove the "requested change" flag in the Github UI. Please review :)

And sorry for the messy git history. I'm assuming this is going to be squashed.

I added some optimizations so the benchmarks run at least as fast as before. Most of the time should be spent looping in C++, so there should be no change (confirmed by getting a screenful of "same as baseline" where "baseline" is master), except for the "tiny" benchmark where the work in entering and exiting the function dominates. Somehow the "LazyText" benchmark for the "tiny" input is consistently 25% faster even on my noisy machine. (The strict "Text" benchmark for "tiny" is "same as baseline"; it used to be 3x slower, and that's what my last round of commits fixes).

  Pure
    tiny
      decode
        LazyText: OK (1.20s)
          142  ns ± 7.0 ns, 28% less than baseline
      decode'
        LazyText: OK (0.39s)
          176  ns ± 5.6 ns, 27% less than baseline

Note that the benchmarks test valid UTF-8. When there are invalid bytes, the old code performs badly (quadratically) anyway as reported in issue #495. This PR fixes #495.

Bodigrim

Just couple of suggestions, otherwise looks very good.

src/Data/Text/Encoding.hs

changelog.md

Bodigrim

Great job!

Lysxia · 2023-02-06T23:57:20Z

Thanks to @david-sledge for getting this started and your great work!

Lysxia · 2023-02-07T11:12:31Z

The test is due to a mismatch between Unicode versions (text = Unicode 14, GHC 8.6 = ???). We probably should disable those tests for old GHC (and remain notified when new GHC/Unicode break this test again).

Bodigrim · 2023-02-07T21:55:09Z

Yeah, t_toCaseFold_char fails on '\66937', which is VITHKUQI CAPITAL LETTER FE' (U+10579) added in Unicode 14.0. The test should be guarded with #if MIN_VERSION_base(4,16,0).

@Lysxia could you please rebase?

Lysxia · 2023-02-07T22:09:22Z

Uh, I can if that's what we want to do. But I thought it would be cleaner to squash it. In the Github UI "Rebase and merge" is grayed out but we can still just select "Squash and merge". Is that okay?

Bodigrim · 2023-02-07T22:11:25Z

Ah, I see. Yes, squashing is OK, please go ahead.

414owen · 2024-03-19T11:13:59Z

Hi @david-sledge,

I want to decode some utf8 from a socket, and abort early if I get any invalid data.
The haddock states that the second element of the tuple returned by decodeUtf8{Chunk,More} is a

undecoded remainder of the given chunk, for diagnosing errors and resuming

It looks like the suffix can also contain input that wasn't parsed because the parser needs more bytes to determine whether it has a valid code point. Is that correct?
If so, {can,how do} I distinguish between a suffix that's due to a unicode error, and one that's due to a lack of input?

Lysxia · 2024-03-19T11:18:32Z

@414owen You can look at the third component of the returned triple Maybe Utf8State.

In the doc of decodeUtf8More:

Just the new state, or Nothing if an invalid byte was encountered (it will be within the first 4 bytes of the undecoded remainder).

david-sledge changed the title ~~Monad support for stream decoding~~ Allow caller more control in stream decoding Jul 3, 2022

Lysxia reviewed Jul 7, 2022

View reviewed changes

src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved

src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved

src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved

Bodigrim reviewed Jul 17, 2022

View reviewed changes

changelog.md Outdated Show resolved Hide resolved

Bodigrim reviewed Jul 17, 2022

View reviewed changes

src/Data/Text/Internal/Builder.hs Outdated Show resolved Hide resolved

Bodigrim reviewed Jul 17, 2022

View reviewed changes

tests/Tests/Properties/Transcoding.hs Outdated Show resolved Hide resolved

Bodigrim reviewed Jul 17, 2022

View reviewed changes

text.cabal Outdated Show resolved Hide resolved

Lysxia reviewed Jul 19, 2022

View reviewed changes

Lysxia reviewed Jul 20, 2022

View reviewed changes

src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved

src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved

Bodigrim mentioned this pull request Jul 20, 2022

Avoid usage of __builtin_popcountll when -simdutf is unset #453

Merged

Lysxia reviewed Jul 25, 2022

View reviewed changes

src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved

Lysxia reviewed Jul 25, 2022

View reviewed changes

src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved

src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved

Lysxia added 6 commits February 6, 2023 00:51

Make StrictBuilder module, explicit exports

7656b01

Revert changes to Data.Text.Internal.Encoding.Utf8

f663a6e

Docs

bc65921

Fix short-circuit

28a7460

Doc

9e8e0f9

Sort imports

1f5a873

Lysxia approved these changes Feb 6, 2023

View reviewed changes

Lysxia added 2 commits February 6, 2023 03:09

Undo useless optimization

72f91ee

import Semigroup for old base

d347bba

Lysxia force-pushed the streamDecodeUtf8WithM branch from 2862fed to d347bba Compare February 6, 2023 03:51

Lysxia added 4 commits February 6, 2023 04:04

Merge remote-tracking branch 'origin/master' into streamDecodeUtf8WithM

82bcef3

Clean up imports

9a0c8a8

Minimize test diff

64fd029

test: sort imports

e136cad

Lysxia force-pushed the streamDecodeUtf8WithM branch from a2c7f26 to e136cad Compare February 6, 2023 07:26

Bodigrim reviewed Feb 6, 2023

View reviewed changes

src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved

changelog.md Outdated Show resolved Hide resolved

Apply suggestions

74db3eb

Bodigrim approved these changes Feb 6, 2023

View reviewed changes

Lysxia merged commit 7ef771d into haskell:master Feb 7, 2023

raehik mentioned this pull request Feb 15, 2023

No safe decodeASCII :: ByteString -> Maybe Text #496

Closed

Bodigrim mentioned this pull request Feb 28, 2023

Expose SIMD UTF-8 validation functions from internal module #483

Merged

Bodigrim mentioned this pull request Feb 21, 2024

streamDecodeUtf8 and streamDecodeUtf8With defeat the purpose of the ByteString field in Some Text ByteString (ByteString -> Decoding) #60

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow caller more control in stream decoding #448

Allow caller more control in stream decoding #448

david-sledge commented Jun 27, 2022

Lysxia commented Jun 27, 2022

david-sledge commented Jun 27, 2022

Lysxia commented Jun 27, 2022

Bodigrim commented Jun 27, 2022

david-sledge commented Jul 1, 2022 •

edited

Loading

Lysxia commented Jul 1, 2022

david-sledge commented Jul 2, 2022

Lysxia commented Jul 2, 2022

david-sledge commented Jul 3, 2022

Lysxia commented Jul 3, 2022

david-sledge commented Jul 3, 2022

david-sledge commented Jul 3, 2022

david-sledge commented Jul 6, 2022

Bodigrim commented Jul 19, 2022

Lysxia commented Jul 19, 2022

Bodigrim commented Jul 19, 2022 •

edited

Loading

Bodigrim commented Jul 21, 2022

david-sledge commented Jul 29, 2022

Lysxia commented Jul 29, 2022

Lysxia commented Feb 6, 2023 •

edited

Loading

Bodigrim left a comment

Bodigrim left a comment

Lysxia commented Feb 6, 2023

Lysxia commented Feb 7, 2023 •

edited

Loading

Bodigrim commented Feb 7, 2023

Lysxia commented Feb 7, 2023

Bodigrim commented Feb 7, 2023

414owen commented Mar 19, 2024

Lysxia commented Mar 19, 2024

Allow caller more control in stream decoding #448

Allow caller more control in stream decoding #448

Conversation

david-sledge commented Jun 27, 2022

Lysxia commented Jun 27, 2022

david-sledge commented Jun 27, 2022

Lysxia commented Jun 27, 2022

Bodigrim commented Jun 27, 2022

david-sledge commented Jul 1, 2022 • edited Loading

Lysxia commented Jul 1, 2022

david-sledge commented Jul 2, 2022

Lysxia commented Jul 2, 2022

david-sledge commented Jul 3, 2022

Lysxia commented Jul 3, 2022

david-sledge commented Jul 3, 2022

david-sledge commented Jul 3, 2022

david-sledge commented Jul 6, 2022

Bodigrim commented Jul 19, 2022

Lysxia commented Jul 19, 2022

Bodigrim commented Jul 19, 2022 • edited Loading

Bodigrim commented Jul 21, 2022

david-sledge commented Jul 29, 2022

Lysxia commented Jul 29, 2022

Lysxia commented Feb 6, 2023 • edited Loading

Bodigrim left a comment

Choose a reason for hiding this comment

Bodigrim left a comment

Choose a reason for hiding this comment

Lysxia commented Feb 6, 2023

Lysxia commented Feb 7, 2023 • edited Loading

Bodigrim commented Feb 7, 2023

Lysxia commented Feb 7, 2023

Bodigrim commented Feb 7, 2023

414owen commented Mar 19, 2024

Lysxia commented Mar 19, 2024

david-sledge commented Jul 1, 2022 •

edited

Loading

Bodigrim commented Jul 19, 2022 •

edited

Loading

Lysxia commented Feb 6, 2023 •

edited

Loading

Lysxia commented Feb 7, 2023 •

edited

Loading