[RFC] Builder: Efficiently handle literal strings #132

bgamari · 2017-07-13T20:34:17Z

Previously Strings would be handled with P.primMapListBounded P.charUtf8. In the case that the String was a literal, we would decode UTF-8 from the primitive string and then reencode each character as we wrote it to the target buffer. Not only was this inefficient to run, it was also inefficient to compile as we would be forced to inline and simplify large swathes of the builder machinery (see GHC #13960).

The obvious solution here is to do what we should have done all along: strcpy directly out of the primitive string into the target buffer. In the case of UTF-8 things are slightly trickier as we must recognize NULL characters, which GHC encodes as 0xc0 0x80.

Fixing this is a win in several respects: code size of a trivial main = print $ BSL.length $ B.toLazyByteString $ B.string "hello world" program is roughly cut in half. Moreover, the new approach is about twice as fast as the previous according to the provided benchmarks.

Data/ByteString/Builder.hs

bgamari · 2017-07-14T15:08:32Z

Good catch, @phadej!

Data/ByteString/Builder.hs

hvr · 2017-07-19T23:00:23Z

Btw, how does this code handle surrogate-codepoints in literals? Does it handle them liberally according to WTF-8?

bgamari · 2017-07-20T02:24:04Z

Btw, how does this code handle surrogate-codepoints in literals? Does it handle them liberally according to WTF-8?

I'm not sure I understand the question. What handling of surrogates do you propose is necessary here? This code doesn't attempt to do any decoding beyond what is necessary to handle the modified UTF-8 encoding of the U+0 codepoint.

hvr · 2017-07-20T10:04:37Z

@bgamari the question I was basically asking is what happens for a string-literal like

"Z\xd800Z"

and whether it gets encoded as

(valid UTF8 stream) [90,239,191,189,90] (replacement char, this is what e.g. Data.Text.encodeUtf8 does), or as
the invalid UTF8 stream [90,237,160,128,90] (which would have Data.Text.decodeUtf8 choke) or
something else happens

phadej · 2017-07-20T10:09:21Z

For the record, currently:

Prelude Data.ByteString> unpack "Z\xd800Z"
[90,0,90]
Prelude Data.ByteString> unpack "Z\x02fcZ"
[90,252,90]

Prelude Data.ByteString Data.Word> fromIntegral (0x02fc :: Int) :: Word8
252

i.e. the thing you would expect from fromIntegral :: Int -> Word8

hvr · 2017-07-20T10:38:55Z

@phadej doesn't this PR affect the code-paths for e.g. stringUtf8 "Z\xd800Z" :: Builder?

bgamari · 2017-07-20T13:11:11Z

I see the question now. With this patch we have this,

>>> print $ BSL.unpack $ B.toLazyByteString $ B.stringUtf8 "Z\xd800Z"
[90,237,160,128,90]

which is the invalid UTF-8 sequence that you point out in your question. It is also the same thing that we would produce today. I really don't think we are in a position where we can change this.

In general we are in a bit of a tight spot here since we don't have ByteArray literals, therefore we abuse CString for this. Moreover, we use modified UTF-8 to encode strings containing code-point 0, so it's nearly impossible to distinguish between "strings" and "chunks of bytes". I'm working on resuscitating the length-annotated string patch which should help a bit here since we will no longer need UTF-8 encoding for plain ASCII strings.

hvr · 2017-07-20T13:50:30Z

@bgamari it's not a big deal; it'd just be good to warn about this in the documentation (and maybe at some point GHC could implement a warning about text literals containing suspicious code-points, mostly U+D800 through U+DFFF)

bgamari · 2017-07-24T23:26:03Z

Curiously, Travis seems to fail reliably yet I don't see any of these failures locally. Hmmmm.

bgamari · 2018-01-22T01:26:44Z

@dcoutts, ping.

bgamari · 2018-02-18T16:34:24Z

Pinging @dcoutts.

knupfer · 2018-04-28T20:19:43Z

Ping @dcoutts, this would simplify my library

chessai · 2019-10-13T04:10:40Z

Ping

tests/builder/Data/ByteString/Builder/Prim/Tests.hs

sjakobi · 2020-07-14T16:19:47Z

@bgamari It looks like there's not very much left to do before this can be merged. Do you intend to put the finishing touches on this PR soon?

Bodigrim

LGTM

hsyl20 · 2020-08-25T17:12:53Z

Data/ByteString/Builder/Prim.hs

+          IO $ \s -> case writeWord8OffAddr# op0# 0# 0## s of
+                       s' -> (# s', () #)
+          let br' = BufferRange (op0 `plusPtr` 1) ope
+          step (addr `plusAddr#` 1#) k br'


Shouldn't it be 'plusAddr# 2#` ? We've read two bytes.

Nice! It is interesting that tests were too weak to catch it.

@hsyl20 I improved tests and changed the increment to 2#. Could you please take another look?

LGTM. Thanks for the fix!

phadej reviewed Jul 13, 2017

View reviewed changes

Data/ByteString/Builder.hs Outdated Show resolved Hide resolved

bgamari force-pushed the build-cstring branch from 5c592d5 to c321bb5 Compare July 14, 2017 15:08

bgamari force-pushed the build-cstring branch from c321bb5 to c93ed1d Compare July 14, 2017 16:51

thoughtpolice suggested changes Jul 19, 2017

View reviewed changes

Data/ByteString/Builder.hs Show resolved Hide resolved

bgamari force-pushed the build-cstring branch from c93ed1d to 460c2c5 Compare July 20, 2017 02:15

thoughtpolice approved these changes Jul 20, 2017

View reviewed changes

bgamari force-pushed the build-cstring branch 4 times, most recently from 50a4705 to 0bcb435 Compare July 24, 2017 23:08

bgamari force-pushed the build-cstring branch from 0bcb435 to 0f0c94a Compare August 1, 2017 17:55

bgamari added 4 commits August 1, 2017 14:12

Test naive String Builder

a75a67c

Test and benchmark cstring

aff9ea2

Efficiently copy CStrings

fadbdce

Benchmark UTF-8 strings

7c891d6

bgamari force-pushed the build-cstring branch from 0f0c94a to 7c891d6 Compare August 1, 2017 18:13

vdukhovni mentioned this pull request May 8, 2020

Bump version to 0.10.10.1 #215

Merged

Bodigrim reviewed May 10, 2020

View reviewed changes

tests/builder/Data/ByteString/Builder/Prim/Tests.hs Outdated Show resolved Hide resolved

sjakobi added the performance label Jul 2, 2020

sjakobi added this to the 0.10.12.0 milestone Jul 3, 2020

sjakobi modified the milestones: 0.10.12.0, Soon Aug 19, 2020

Bodigrim added 2 commits August 21, 2020 21:14

Merge branch 'master' into build-cstring

46d859a

Test cstringUtf8 and encoding of NULL

083b2db

Bodigrim modified the milestones: Soon, 0.11.0.0 Aug 21, 2020

Bodigrim approved these changes Aug 21, 2020

View reviewed changes

Bodigrim requested review from hsyl20 and sjakobi August 23, 2020 19:13

sjakobi approved these changes Aug 24, 2020

View reviewed changes

hsyl20 reviewed Aug 25, 2020

View reviewed changes

Bodigrim added 3 commits August 26, 2020 00:27

Really test encoding of NULL

270bb48

Fix compatibility with older GHCs

5c83a80

Fix encoding of NULL

6e29895

Bodigrim merged commit 155bf8a into haskell:master Aug 26, 2020

Ericson2314 mentioned this pull request Jan 24, 2021

Backport #326 to 0.11 #353

Closed

Bodigrim mentioned this pull request May 24, 2021

Add test for #393 #394

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Builder: Efficiently handle literal strings #132

[RFC] Builder: Efficiently handle literal strings #132

bgamari commented Jul 13, 2017 •

edited

bgamari commented Jul 14, 2017

hvr commented Jul 19, 2017

bgamari commented Jul 20, 2017

hvr commented Jul 20, 2017 •

edited

phadej commented Jul 20, 2017 •

edited

hvr commented Jul 20, 2017

bgamari commented Jul 20, 2017 •

edited

hvr commented Jul 20, 2017

bgamari commented Jul 24, 2017

bgamari commented Jan 22, 2018

bgamari commented Feb 18, 2018

knupfer commented Apr 28, 2018

chessai commented Oct 13, 2019

sjakobi commented Jul 14, 2020

Bodigrim left a comment

hsyl20 Aug 25, 2020

Bodigrim Aug 25, 2020

Bodigrim Aug 26, 2020

hsyl20 Aug 26, 2020

[RFC] Builder: Efficiently handle literal strings #132

[RFC] Builder: Efficiently handle literal strings #132

Conversation

bgamari commented Jul 13, 2017 • edited

bgamari commented Jul 14, 2017

hvr commented Jul 19, 2017

bgamari commented Jul 20, 2017

hvr commented Jul 20, 2017 • edited

phadej commented Jul 20, 2017 • edited

hvr commented Jul 20, 2017

bgamari commented Jul 20, 2017 • edited

hvr commented Jul 20, 2017

bgamari commented Jul 24, 2017

bgamari commented Jan 22, 2018

bgamari commented Feb 18, 2018

knupfer commented Apr 28, 2018

chessai commented Oct 13, 2019

sjakobi commented Jul 14, 2020

Bodigrim left a comment

Choose a reason for hiding this comment

hsyl20 Aug 25, 2020

Choose a reason for hiding this comment

Bodigrim Aug 25, 2020

Choose a reason for hiding this comment

Bodigrim Aug 26, 2020

Choose a reason for hiding this comment

hsyl20 Aug 26, 2020

Choose a reason for hiding this comment

bgamari commented Jul 13, 2017 •

edited

hvr commented Jul 20, 2017 •

edited

phadej commented Jul 20, 2017 •

edited

bgamari commented Jul 20, 2017 •

edited