New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flaky decode #3115
Fix flaky decode #3115
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great find! I've been meaning to clean this up and offer it to fs2, because it's not HTTP, but not gotten around to it. But this is a good step forward.
@@ -24,7 +26,8 @@ package object util { | |||
else Pull.output1(outputString).as(None) | |||
case Some((chunk, stream)) => | |||
if (chunk.nonEmpty) { | |||
val bytes = chunk.toArray | |||
val chunkWithoutBom = skipByteOrderMark(chunk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably fast enough as is, but do we only need to check this on the first chunk? And is it incorrect if we drop it in the middle of a stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, checking only the first chunk is enough.
According to wikipedia:
its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8, or that it was converted to UTF-8 from a stream that contained an optional BOM.
So in both implementations (here and in fs2) if UTF-8 BOM is not at the very beginning of the stream - it will not be dropped and will remain in the result.
So we could optimize it possibly introducing some var to mark the first chunk... But hopefully that is not necessary.
@@ -55,5 +55,11 @@ class DecodeSpec extends Http4sSpec { | |||
decoded must_== expected | |||
} | |||
} | |||
|
|||
"drop Byte Order Mark" in { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably exceedingly rare, but what if the BOM spans chunks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given, that we exclude cases of it appearing in the middle of the stream, I don't think it would ever happen as we use unconsN
. It might be more likely in fs2 implementation, where they use uncons1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having written this it became clear to me that it is actually unnecessary to check chunk.size >= 3
.
I took it from fs2, but because of unconsN
it is redundant in our implementation.
As far as I know, this isn't a bug that has bitten any users in practice, but the change to the code goes beyond the Testing label. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree we should have A rollover for across chunks
Do we want to merge this as is and follow up with another PR for the corner case of the corner case? |
Yup. |
Test dropping UTF-8 BOM
Fixes #2956
Test was failing when the generated string contained UTF-8 BOM character (that's why the minimum example looked like an empty String).
Last year fs2 started ignoring UTF-8 BOM characters:
typelevel/fs2#1484
So flakiness of this test is a successful discovery of the difference between the behavior of fs2 and http4s decoders.
Following the example of fs2, I made changes to the
decode
function dropping BOM characters and added an example test to verify that.