Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow caller more control in stream decoding #448

Merged
merged 96 commits into from
Feb 7, 2023

Conversation

david-sledge
Copy link
Contributor

I've been wanting this feature for a while, so I decided to prototype it and create a pull request.

@Lysxia
Copy link
Contributor

Lysxia commented Jun 27, 2022

Rather than STT, safer alternatives would be to specialize this to one specific monad (was state the original motivation?) or to use a different algorithm that does not involve STT.

@david-sledge
Copy link
Contributor Author

Rather than STT, safer alternatives would be to specialize this to one specific monad (was state the original motivation?) or to use a different algorithm that does not involve STT.

State (along with reader) was secondary. My primary motivation was to multiple monads in a stack (I use IO and continuation frequently). Specializing is not ideal, since there'd have to be a different one not just for each monad, but each combination of monads.

@Lysxia
Copy link
Contributor

Lysxia commented Jun 27, 2022

STT is notably unsafe with continuations.

@Bodigrim
Copy link
Contributor

text is a boot library, it cannot depend on STMonadTrans package, unfortunately.

I'm not very happy with the API of Data.Text.Encoding. The fact that the only way to abort decoding is to raise an error (and catch it elsewhere) is a joke. Ideally it should be possible to drive decoding from without, e. g.,

decodeUtf8With2 :: ByteString -> ByteString -> Either Int (Text, ByteString)

Then clients are free to interpret Int (the position, where decoding failed) as they like, in whatever monad.

@david-sledge
Copy link
Contributor Author

david-sledge commented Jul 1, 2022

text is a boot library, it cannot depend on STMonadTrans package, unfortunately.

Removed.

I'm not very happy with the API of Data.Text.Encoding. The fact that the only way to abort decoding is to raise an error (and catch it elsewhere) is a joke. Ideally it should be possible to drive decoding from without, e. g.,

decodeUtf8With2 :: ByteString -> ByteString -> Either Int (Text, ByteString)

Done (though not in that exact manner). Check out the test case t_decode_withM_error5 in tests/Tests/Properties/Transcoding.hs. There other examples, too, for Cont and Maybe. What's more is that the error handlers of the additional functions (the ones ending with M) take the byte position where the error occurred.

@Lysxia
Copy link
Contributor

Lysxia commented Jul 1, 2022

This still has the same unsafety as STT, since it's just an inlining of the STT logic. You can cause a segfault by specializing to ContT and by using a continuation twice.
Taking a callback is fundamentally flawed, so that the current implementation cannot simply be generalized to be in an arbitrary monad. Changing the API as bodigrim suggested seems like a better option (you may also remember the part that was decoded so far).

@david-sledge
Copy link
Contributor Author

Ahhh... I see what you mean, and it'd be a problem with the list monad, too. However, I noticed that streamDecodeUtf8With returns a continuation in the Decoding data type. I looked how the risk of calling that continuation multiple times is mitigated. Essentially, the continuation a wholly separate invocation of a function. It calls runST in which the text array is created, populated, frozen, embedded in the Text data type, and returned without being modified any further.

So I applied the same concept around the call to the error handler. Before the error handler is called, it runs through the processes just before exiting; the text array is frozen, no longer modified, and the reference to last state value is discarded. Then after the error handler returns, it simulates runST with a call to runRW#, makes a copy of the text array where it left off, and continues its operations on the copy.

@Lysxia
Copy link
Contributor

Lysxia commented Jul 2, 2022

Changing the API to not take a callback at all and instead return an Either that tells you exactly where it stopped would still be much simpler, safer, and faster. I would prefer to consider that first.

Passing the State# token explicitly like this, while in an arbitrary monad, makes this quite difficult to review with confidence.

@david-sledge
Copy link
Contributor Author

While I understand the desire to use Either, that doesn't really get me closer to my goal, which is why I was going for arbitrary monad which would meet my goal and allow for Either. Since insert-your-favorite-monads-here option has been removed from the table, I found myself at an impasse. So I took a step back, looked at the issue again, and reevaluated the problem.

With the requirements and constraints of

  • Allow a means to abort decoding without raising an error.
  • No STT.
  • No arbitrary monad implementation.
  • No manually passing the State# token.
  • Indicate position in source ByteStream where the error occurs.

I think I found a solution that'll fit the bill.

@Lysxia
Copy link
Contributor

Lysxia commented Jul 3, 2022

This is actually close to what we were suggesting, and indeed much more satisfactory.

@david-sledge
Copy link
Contributor Author

Excellent! A couple of questions:

  • The byte position does not currently track across continuation invocations and instead resets back to zero. What's the opinion on whether it should track or the caller be responsible for tracking the cumulative byte position/total bytes read?
  • What version should the @since annotation be for streamDecodeUtf8With'? 2.0.1, 2.1.0, or do I not need to worry about it yet?

@david-sledge
Copy link
Contributor Author

For the first question, I went ahead and let the library track the absolute position for the caller.

@david-sledge david-sledge changed the title Monad support for stream decoding Allow caller more control in stream decoding Jul 3, 2022
@david-sledge
Copy link
Contributor Author

Stream decoders added for utf-16 and utf-32, and an Either decoder for ASCII. I'll work on the lazy options next.

src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved
src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved
src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved
changelog.md Outdated Show resolved Hide resolved
text.cabal Outdated Show resolved Hide resolved
@Bodigrim
Copy link
Contributor

We are getting closer to the final cut off before GHC 9.4. @Lysxia could you please complete your review on this? I don't have much bandwidth at the moment.

@Lysxia
Copy link
Contributor

Lysxia commented Jul 19, 2022

@Bodigrim How much time is left for the cut off? I'll allocate more cycles to shepherd this PR (sorry for dragging this @david-sledge !) but I feel like this is going to take a couple more rounds.

@Bodigrim
Copy link
Contributor

Bodigrim commented Jul 19, 2022

I'm not sure exactly. Given that @bgamari has not chased us yet, it would probably take at least a week or more.

src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved
src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved
src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved
src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved
src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved
src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved
src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved
src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved
@Bodigrim
Copy link
Contributor

Given #453 (comment), I'll cut text-2.0.1 from the current master, and this PR will be delayed until text-2.0.2. I'm happy to release text-2.0.2 fairly soon, in time for GHC 9.4.2. Sorry @david-sledge, I hope it's not too much of delay for your purposes.

src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved
src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved
src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved
@david-sledge
Copy link
Contributor Author

This branch no longer uses simdutf anymore. Should all of it and references to it be removed, or is it worth modifying simdutf to have a determine_utf8_prefix_length function?

@Lysxia
Copy link
Contributor

Lysxia commented Jul 29, 2022

simdutf is there for performance, so if it is no longer used that needs to be justified by benchmarks. I suspect it's going to be difficult to compete with a vectorized implementation.

@Lysxia
Copy link
Contributor

Lysxia commented Feb 6, 2023

I "accepted" my own change to remove the "requested change" flag in the Github UI. Please review :)

And sorry for the messy git history. I'm assuming this is going to be squashed.

I added some optimizations so the benchmarks run at least as fast as before. Most of the time should be spent looping in C++, so there should be no change (confirmed by getting a screenful of "same as baseline" where "baseline" is master), except for the "tiny" benchmark where the work in entering and exiting the function dominates. Somehow the "LazyText" benchmark for the "tiny" input is consistently 25% faster even on my noisy machine. (The strict "Text" benchmark for "tiny" is "same as baseline"; it used to be 3x slower, and that's what my last round of commits fixes).

  Pure
    tiny
      decode
        LazyText: OK (1.20s)
          142  ns ± 7.0 ns, 28% less than baseline
      decode'
        LazyText: OK (0.39s)
          176  ns ± 5.6 ns, 27% less than baseline

Note that the benchmarks test valid UTF-8. When there are invalid bytes, the old code performs badly (quadratically) anyway as reported in issue #495. This PR fixes #495.

Copy link
Contributor

@Bodigrim Bodigrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just couple of suggestions, otherwise looks very good.

src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved
changelog.md Outdated Show resolved Hide resolved
Copy link
Contributor

@Bodigrim Bodigrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job!

@Lysxia
Copy link
Contributor

Lysxia commented Feb 6, 2023

Thanks to @david-sledge for getting this started and your great work!

@Lysxia
Copy link
Contributor

Lysxia commented Feb 7, 2023

The test is due to a mismatch between Unicode versions (text = Unicode 14, GHC 8.6 = ???). We probably should disable those tests for old GHC (and remain notified when new GHC/Unicode break this test again).

@Bodigrim
Copy link
Contributor

Bodigrim commented Feb 7, 2023

Yeah, t_toCaseFold_char fails on '\66937', which is VITHKUQI CAPITAL LETTER FE' (U+10579) added in Unicode 14.0. The test should be guarded with #if MIN_VERSION_base(4,16,0).

@Lysxia could you please rebase?

@Lysxia
Copy link
Contributor

Lysxia commented Feb 7, 2023

Uh, I can if that's what we want to do. But I thought it would be cleaner to squash it. In the Github UI "Rebase and merge" is grayed out but we can still just select "Squash and merge". Is that okay?

@Bodigrim
Copy link
Contributor

Bodigrim commented Feb 7, 2023

Ah, I see. Yes, squashing is OK, please go ahead.

@414owen
Copy link

414owen commented Mar 19, 2024

Hi @david-sledge,

I want to decode some utf8 from a socket, and abort early if I get any invalid data.
The haddock states that the second element of the tuple returned by decodeUtf8{Chunk,More} is a

undecoded remainder of the given chunk, for diagnosing errors and resuming

It looks like the suffix can also contain input that wasn't parsed because the parser needs more bytes to determine whether it has a valid code point. Is that correct?
If so, {can,how do} I distinguish between a suffix that's due to a unicode error, and one that's due to a lack of input?

@Lysxia
Copy link
Contributor

Lysxia commented Mar 19, 2024

@414owen You can look at the third component of the returned triple Maybe Utf8State.

In the doc of decodeUtf8More:

Just the new state, or Nothing if an invalid byte was encountered (it will be within the first 4 bytes of the undecoded remainder).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants