Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch internal representation to UTF8 #365

Merged
merged 38 commits into from
Sep 8, 2021
Merged

Conversation

Bodigrim
Copy link
Contributor

Results

Benchmark results for GHC 8.10 can be found at https://gist.github.com/Bodigrim/365e388e080b17de45e80ab50a55fb4f. I'll publish a detailed analysis later, but here are the most notable improvements:

  • decodeUtf8 is up to 10x faster for non-ASCII texts.
  • encodeUtf8 is ~1.5-4x faster for strict and up to 10x faster for lazy Text.
  • take / drop / length are up to 20x faster.
  • toUpper / toLower are 10-30% faster.
  • Eq instance of strict Text is 10x faster.
  • Ord instance is typically 30%+ faster.
  • isInfixOf and search routines are up to 10x faster.
  • replicate of Char is up to 20x faster.

Geometric mean of benchmark times is 0.33.

How to review

I'd like to encourage as many reviewers as possible, and the branch is structured to facilitate it. Each commit builds and passes tests, so they can be reviewed individually. Actual switch from UTF16 to UTF8 happens in 1369cd3 and remaining TODOs resolved in 99f3c48. Everything else is performance improvements. There are two commits of intimidating size: 2cb3b30, which dumps autogenerated case mappings, and c3ccdb3, which bundles amalgamated simdutf8. I tried to keep other commits to a reasonable size and scope, hopefully it is palatable.

I'm happy to answer questions and add comments to any code, which is not clear enough. Do not hesitate to ask for this.

Known issues

This branch uses simdutf8 library, which is massive and written in C++. Since this is a bit inconvenient for a boot package, @kozross is currently finalizing a pure C replacement, which will be submitted for review in a separate PR.

@Fuuzetsu
Copy link
Member

Fuuzetsu commented Aug 22, 2021

I guess this also addresses #272 in one swoop

@chrisdone
Copy link
Member

What’s slower?

@Fuuzetsu
Copy link
Member

What’s slower?

I guess it'd be good to see what encodeUtf16 and friends give now. I suspect these are barely used compared to encodeUtf8 anyway.

@jberryman
Copy link

jberryman commented Aug 23, 2021

EDIT: Oops, I totally botched processing the CSV as Bodigrim points out below

I'll see if I can build https://github.com/hasura/graphql-engine with this branch and run it through our benchmarking infra this week.

EDIT: going back to the UTF-8 proposal it looks like for better or worse the only concrete acceptance criteria vis a vis performance are:

  • decodeUtf8 and encodeUtf8 become at least 2x faster.
  • Geometric mean of existing benchmarks (which favor UTF-16) decreases.
  • Fusion (as per our test suite) does not regress beyond at most several cases.

so above is more than acceptable according to that spec

@jkachmar
Copy link

jkachmar commented Aug 23, 2021

EDIT: Comment is no longer relevant to the discussion; see Bodigrim's response below.

@chrisdone
Copy link
Member

Thanks for the numbers @jberryman

@Bodigrim
Copy link
Contributor Author

@jberryman I'm afraid your table does not make sense:

Name master branch utf8 branch 0 Log Ratio
All.Programs.Throughput.LazyTextByteString 27378356325 4471250933 Ratio -0.787
All.Programs.Throughput.TextByteString 23929606012 4610553675 4.361 -0.715

First two rows are corrupted, so let's start from the third one. Here Ratio does not match time measurements: 4610553675 / 23929606012 = 0.193, which means that UTF8 branch is 5x faster, not 4.361x slower. Everything else is misrepresenting original data in the similar way.

Instead of sorting rows by Ratio, you sorted values in Ratio column on their own and reversed the table, apparently obtaining nonsensical results.

@jberryman
Copy link

jberryman commented Aug 23, 2021

Doh! Sorry

Click for regressions...
Name master branch utf8 branch Ratio Log Ratio
All.Pure.japanese.filter.filter.Text 4286460 18693165 4.361 0.640
All.Pure.russian.filter.filter.Text 6391499 26611977 4.164 0.619
All.Pure.russian.length.filter.Text 5990205 24360620 4.067 0.609
All.Pure.ascii.mapAccumL.Text 908974332800 3638772627300 4.003 0.602
All.Pure.japanese.length.filter.Text 3907379 14647059 3.749 0.574
All.Pure.japanese.length.words.Text 38836980 143547741 3.696 0.568
All.Pure.japanese.words.Text 39674886 140793437 3.549 0.550
All.Pure.russian.length.filter.filter.Text 6046403 21338005 3.529 0.548
All.Pure.russian.map.map.Text 25095436 85788868 3.419 0.534
All.Pure.english.mapAccumL.Text 60778409175 207472001800 3.414 0.533
All.Pure.ascii-small.intersperse.Text 477122164 1584540007 3.321 0.521
All.Pure.japanese.intersperse.Text 35325822 112014173 3.171 0.501
All.Pure.japanese.length.filter.filter.Text 3852873 11845036 3.074 0.488
All.Pure.ascii-small.mapAccumL.Text 524822893 1584027082 3.018 0.480
All.Pure.russian.intersperse.Text 51438771 154943289 3.012 0.479
All.Pure.japanese.map.map.Text 17463211 52084321 2.983 0.475
All.Pure.ascii.map.map.Text 234876768275 689893193000 2.937 0.468
All.Pure.english.map.map.Text 16816703650 48141679000 2.863 0.457
All.Pure.ascii-small.map.map.Text 268999974 769964654 2.862 0.457
All.Pure.english.mapAccumL.LazyText 47612369112 135492460800 2.846 0.454
All.Pure.ascii-small.mapAccumL.LazyText 492221926 1368635479 2.781 0.444
All.Pure.ascii.mapAccumR.Text 1360928562550 3743383120800 2.751 0.439
All.Pure.english.mapAccumR.Text 79957684125 209392756075 2.619 0.418
All.Pure.ascii-small.length.filter.Text 60110941 155777043 2.591 0.414
All.Pure.english.length.filter.Text 3518092281 9072480412 2.579 0.411
All.Pure.ascii.length.filter.Text 53715988100 134837148600 2.510 0.400
All.Pure.ascii-small.mapAccumR.Text 689645017 1708705487 2.478 0.394
All.Pure.ascii.filter.filter.Text 55365971900 135428844350 2.446 0.388
All.Pure.ascii-small.filter.filter.Text 64062812 156431460 2.442 0.388
All.Pure.ascii.concat.LazyText 25956179262 61828057150 2.382 0.377
All.Pure.ascii.mapAccumL.LazyText 892289471600 2104454784000 2.358 0.373
All.Pure.russian.mapAccumL.LazyText 40192719 94138204 2.342 0.370
All.Pure.english.filter.filter.Text 3861193850 8971767900 2.324 0.366
All.Pure.english.mapAccumR.LazyText 64822552500 140001104800 2.160 0.334
All.Pure.tiny.length.cons.Text 14363 30306 2.110 0.324
All.Pure.tiny.length.take.LazyText 29270 61375 2.097 0.322
All.Pure.ascii-small.mapAccumR.LazyText 677831689 1391476798 2.053 0.312
All.Pure.tiny.length.take.Text 17334 34548 1.993 0.300
All.Pure.tiny.length.drop.LazyText 31327 61538 1.964 0.293
All.FileIndices.Text 1722222759 3368145693 1.956 0.291
All.Pure.russian.mapAccumL.Text 44009081 85401880 1.941 0.288
All.Pure.russian.length.words.Text 59840591 113316215 1.894 0.277
All.FileRead.LazyText 18601989650 34579194700 1.859 0.269
All.Pure.ascii-small.length.filter.filter.Text 60108618 110940976 1.846 0.266
All.Pure.tiny.length.tail.Text 15947 29348 1.840 0.265
All.Pure.tiny.map.map.Text 40689 73561 1.808 0.257
All.Pure.ascii.length.filter.filter.Text 53382547300 96310759700 1.804 0.256
All.Pure.english.length.filter.filter.Text 3554604815 6396166743 1.799 0.255
All.Pure.tiny.length.replicate string.Text 17935 32192 1.795 0.254
All.Pure.tiny.length.drop.Text 22087 39054 1.768 0.248
All.Pure.tiny.take.LazyText 28916 50071 1.732 0.238
All.Pure.english.intersperse.Text 64852656850 110296260500 1.701 0.231
All.Pure.tiny.length.map.Text 17986 30326 1.686 0.227
All.Pure.russian.words.Text 63318701 106785640 1.686 0.227
All.Pure.english.words.Text 94305583550 156552526850 1.660 0.220
All.Pure.ascii.intersperse.Text 1097287378800 1807749568100 1.647 0.217
All.Pure.tiny.length.init.Text 16455 27004 1.641 0.215
All.Pure.ascii-small.words.Text 435580662 706163431 1.621 0.210
All.Programs.Throughput.LazyText 33153355300 53489191400 1.613 0.208
All.Pure.tiny.words.Text 35595 56988 1.601 0.204
All.Pure.tiny.length.words.Text 37474 59682 1.593 0.202
All.Pure.ascii.mapAccumR.LazyText 1337590373100 2128827338800 1.592 0.202
All.Programs.Throughput.Text 36894744925 58716190900 1.591 0.202
All.Pure.tiny.length.replicate char.Text 20409 31951 1.566 0.195
All.Pure.english.length.words.Text 24862166800 38848806662 1.563 0.194
All.Pure.japanese.mapAccumL.LazyText 31111157 48004810 1.543 0.188
All.Pure.russian.mapAccumR.Text 61982220 95396325 1.539 0.187
All.Pure.ascii.length.words.Text 380019569800 581904752000 1.531 0.185
All.Pure.russian.mapAccumR.LazyText 60163025 90158680 1.499 0.176
All.Pure.ascii-small.length.words.Text 431157632 641999207 1.489 0.173
All.FileRead.Text 22193181037 32356484450 1.458 0.164
All.Pure.tiny.length.init.LazyText 25821 37288 1.444 0.160
All.Pure.tiny.drop.LazyText 34048 48234 1.417 0.151
All.Programs.Fold 130703628900 185012830000 1.416 0.151
All.Pure.tiny.take.Text 18041 25449 1.411 0.149
All.Pure.tiny.length.tail.LazyText 22157 30972 1.398 0.145
All.ReadLines.Text 26481958375 36025864550 1.360 0.134
All.Pure.tiny.intersperse.Text 121332 163914 1.351 0.131
All.Pure.japanese.isInfixOf.LazyText 9485642 12734068 1.342 0.128
All.Pure.tiny.drop.Text 18699 24777 1.325 0.122
All.Pure.japanese.append.Text 3813036 5048982 1.324 0.122
All.FileIndices.LazyText 6163252356 8156591087 1.323 0.122
All.Pure.ascii-small.Builder.mappend char 40402773 53383941 1.321 0.121
All.Pure.japanese.Builder.mappend char 40523358 53078920 1.310 0.117
All.Pure.tiny.length.map.map.Text 23670 30874 1.304 0.115
All.Pure.tiny.length.filter.Text 17223 22367 1.299 0.113
All.Pure.russian.Builder.mappend char 40321784 52296500 1.297 0.113
All.Replace.LazyText 7591184037 9838982087 1.296 0.113
All.Pure.english.Builder.mappend char 40925067 52787836 1.290 0.111
All.Pure.russian.map.Text 70205178 90176283 1.284 0.109
All.Pure.tiny.Builder.mappend char 41598222 52693260 1.267 0.103
All.Pure.japanese.map.Text 48719205 61551012 1.263 0.102
All.Pure.ascii.Builder.mappend char 42090766 52697784 1.252 0.098
All.Pure.russian.foldl'.Text 46374687 57498525 1.240 0.093
All.Pure.english.tail.LazyText 313540 387425 1.236 0.092
All.Pure.ascii.words.LazyText 3200002919100 3951757023600 1.235 0.092
All.Pure.russian.reverse.Text 13685795 16806661 1.228 0.089
All.DecodeUtf8.ascii.strict decodeASCII 16597790097 20361800856 1.227 0.089
All.Pure.ascii.words.Text 1923375551450 2335346355100 1.214 0.084
All.DecodeUtf8.ascii.strict decodeLatin1 16669017806 20201149628 1.212 0.083
All.Pure.tiny.length.filter.filter.Text 17128 20717 1.210 0.083
All.Pure.russian.length.filter.filter.LazyText 111583214 134526054 1.206 0.081
All.DecodeUtf8.ascii.Strict 16353951638 19685246662 1.204 0.081
All.Stream.stream.Text 33265855475 39864031600 1.198 0.079
All.Pure.russian.isInfixOf.LazyText 8267631 9853384 1.192 0.076
All.Pure.russian.length.filter.LazyText 115133204 136725833 1.188 0.075
All.DecodeUtf8.ascii.strict decodeUtf8 16567841664 19555564587 1.180 0.072
All.Programs.Cut.Text 38600987350 45420349225 1.177 0.071
All.Pure.japanese.mapAccumL.Text 33773446 39456762 1.168 0.068
All.Pure.russian.zipWith.Text 155804762 180284474 1.157 0.063
All.Pure.japanese.mapAccumR.Text 44827642 51784380 1.155 0.063
All.Pure.japanese.foldl'.Text 32705090 37488457 1.146 0.059
All.Pure.russian.intersperse.LazyText 283186349 323150051 1.141 0.057
All.Pure.japanese.zipWith.Text 108932297 124234861 1.140 0.057
All.Pure.ascii-small.map.Text 756522332 859308795 1.136 0.055
All.Pure.tiny.map.LazyText 203443 229399 1.128 0.052
All.ReadNumbers.DecimalText 453009064 509126414 1.124 0.051
All.Pure.english.Builder.mappend 8 char 70987 79812 1.124 0.051
All.Pure.japanese.Builder.mappend 8 char 74208 83193 1.121 0.050
All.Pure.russian.length.words.LazyText 87385175 97790055 1.119 0.049
All.Programs.BigTable 175518625700 196159030400 1.118 0.048
All.Pure.russian.filter.LazyText 128635242 143446543 1.115 0.047
All.Pure.tiny.Builder.mappend 8 char 71272 79092 1.110 0.045
All.Pure.ascii.decode.Text 17054489931 18899973825 1.108 0.045
All.Builder.Int.Decimal.Show.12 185275 204834 1.106 0.044
All.Pure.ascii.length.intercalate.LazyText 70177439375 77541390000 1.105 0.043
All.Pure.japanese.concat.Text 9245860 10207330 1.104 0.043
All.Pure.tiny.length.cons.LazyText 28795 31707 1.101 0.042
All.Stream.stream.LazyText 47489877000 52205395000 1.099 0.041
All.Pure.ascii.decode'.Text 17238519398 18947115375 1.099 0.041
All.Pure.japanese.words.LazyText 54124018 59078675 1.092 0.038
All.Pure.russian.words.LazyText 102432943 110916035 1.083 0.035
All.Pure.russian.foldl'.LazyText 127752375 138417774 1.083 0.035
All.Pure.russian.Builder.mappend 8 char 73192 79158 1.082 0.034
All.Pure.tiny.uncons.LazyText 46043 49599 1.077 0.032
All.Pure.tiny.filter.LazyText 105763 113667 1.075 0.031
All.Pure.ascii-small.map.map.LazyText 1907625318 2048294912 1.074 0.031
All.Pure.japanese.length.words.LazyText 52596298 56397154 1.072 0.030
All.ReadNumbers.DoubleText 2877579064 3073777312 1.068 0.029
All.Pure.tiny.foldl'.Text 45520 48612 1.068 0.029
All.Pure.ascii-small.zipWith.Text 1690593403 1804527026 1.067 0.028
All.Pure.japanese.Builder.mappend text 417791315 445266578 1.066 0.028
All.Pure.russian.filter.filter.LazyText 116662979 124236636 1.065 0.027
All.Pure.japanese.map.LazyText 132459812 140812427 1.063 0.027
All.Pure.japanese.filter.LazyText 76294521 81085629 1.063 0.026
All.Pure.ascii-small.Builder.mappend 8 char 73099 77613 1.062 0.026
All.Pure.russian.map.LazyText 196094014 208047672 1.061 0.026
All.Pure.russian.map.map.LazyText 181665864 192183338 1.058 0.024
All.Pure.japanese.intersperse.LazyText 200228871 211215202 1.055 0.023
All.Pure.japanese.length.filter.LazyText 71222038 75014667 1.053 0.023

Removing the "japanese" and "russian" benchmarks doesn't change the picture significantly. Like @chrisdone I'd be interested in an assessment of the regressions from @Bodigrim or someone who knows the benchmarks well.

@Bodigrim
Copy link
Contributor Author

As I have written in the starting post, I'm working on a detailed analysis of performance.
@jberryman could you please wrap the table into a spoiler?

jberryman added a commit to jberryman/deferred-folds that referenced this pull request Aug 23, 2021
jberryman added a commit to jberryman/attoparsec that referenced this pull request Aug 23, 2021
@Bodigrim
Copy link
Contributor Author

Performance report

This report compares performance of text package with UTF-8-encoded internal representation (utf8 branch) to UTF-16 representation (master branch). Readers are encouraged to read the original proposal for UTF-8 transition first, especially "Performance impact" section, which provides necessary background.

The original Tom Harper's thesis, "Fusion on Haskell Unicode Strings" discusses performance differences between UTF-8, UTF-16 and UTF-32 encodings. Basically, UTF-32 offers the best performance for string operations in synthetic benchmarks, simple because it is a fixed-length encoding and parsing UTF-32-encoded buffer is no-op. Further, UTF-16 is worse, because characters can take 16 or 32 bits. However, parsing/printing codepoints is still very simple, there are only two branches, and since the vast majority of codepoints are 16-bits-long, CPU branch prediction works wonderfully.

However, UTF-8 performance poses certain challenges. Code points can be represented as 1, 2, 3 or 4 bytes, their parsing and printing involves multiple bitwise operations, and CPU branch prediction becomes ineffective, because non-ASCII texts constantly switch between branches. Memory savings from UTF-8 hardly affect synthetic benchmarks, because they usually fit into CPU cache. Synthetic benchmarks also do not account for encoding/decoding data from external sources (usually in UTF-8) and measure pure processing time. That's why text originally went for UTF-16 encoding.

The key to a better UTF-8 performance is to avoid parsing/printing of codepoints as much as possible. For example, the most common operations on text are cutting and concatenation - and neither of these requires actual parsing of individual characters, they can be executed over opaque memory buffers. Given that external sources are most likely UTF-8 encoded, an application can spend its entire lifetime without ever interpreting a text character-by-character - thus achieving a very decent performance.

Operations, which necessitate a character-by-character interpretation, are likely to regress with UTF-8. E. g., map succ, measured in isolation, could very well be faster on UTF-16 buffer. We argue that this is an acceptable tradeoff for two reasons. Firstly, if we measure a full pipeline including decoding an input file and encoding an output, savings from these no-ops are likely to outweigh map succ, unless there is a very long chain of map. Secondly, Unicode is so complex and mutifaceted that for practical applications parsing/printing of characters is trumped by application of f in map f. For instance, toUpper / toLower are actually faster in our utf8 branch than they were in master.


The original proposal stated three performance goals:

  • Fusion (as per our test suite) does not regress beyond at most several cases.
  • decodeUtf8 and encodeUtf8 become at least 2x faster.
  • Geometric mean of existing benchmarks (which favor UTF-16) decreases.

While the work on UTF-8 transition was underway, text package decided to abandon implicit fusion framework, as it was demonstrated to harm asymptotic (!) performance (#348). FWIW early drafts of utf8 branch, prior to the 21st of June, showed no issues with fusion, e. g., https://github.com/Bodigrim/text/commits/utf8-210609.

Next, here are results for encodeUtf8:

git checkout master
cabal run text-benchmarks -- -t100 -p encode --csv text-master-encode.csv
git checkout utf8
cabal run text-benchmarks -- -t100 -p encode --baseline text-master-encode.csv
tiny
  encode
    Text:
      18.3 ns ± 202 ps, 58% faster than baseline
    LazyText:
      33.1 ns ± 364 ps, 61% faster than baseline
ascii-small
  encode
    Text:
      4.80 μs ±  58 ns, 55% faster than baseline
    LazyText:
      7.57 μs ±  71 ns, 86% faster than baseline
ascii
  encode
    Text:
      7.17 ms ±  40 μs, 60% faster than baseline
    LazyText:
      78.1 ms ± 672 μs, 32% faster than baseline
english
  encode
    Text:
      347  μs ± 5.5 μs, 44% faster than baseline
    LazyText:
      510  μs ± 4.4 μs, 81% faster than baseline
russian
  encode
    Text:
      1.36 μs ±  24 ns, 88% faster than baseline
    LazyText:
      1.37 μs ±  13 ns, 92% faster than baseline
japanese
  encode
    Text:
      1.26 μs ±  19 ns, 88% faster than baseline
    LazyText:
      1.28 μs ±  23 ns, 91% faster than baseline

As expected, we receive astonishing speed up for non-ASCII data. Results for English texts are a bit less impressive, but bear in mind that master recently gained (#302) a twice-faster SIMD-based encoder, which is very good for pure, uninterrupted ASCII. And for ascii benchmark, where the input is 50M long, memory bandwidth becomes a bottleneck, throttling speed up opportunities.

Moving to decodeUtf8, it's worth noticing that this is not a no-op. While for a valid UTF-8 ByteString it suffices just to copy it to Text, one must check that the input is valid first. Naively, validating UTF-8 encoding is no faster than parsing characters one-by-one with appropriate error reporting. However, we employ simdutf library, which validates Unicode using a vectorised state machine.

git checkout master
cabal run text-benchmarks -- -t100 -p '$2=="Pure" && $4=="decode"' --csv text-master-decode.csv
git checkout utf8
cabal run text-benchmarks -- -t100 -p '$2=="Pure" && $4=="decode"' --baseline text-master-decode.csv
tiny
  decode
    Text:
      34.0 ns ± 286 ps, 36% faster than baseline
    LazyText:
      150  ns ± 106 ps, 42% faster than baseline
ascii-small
  decode
    Text:
      5.94 μs ±  92 ns, 58% faster than baseline
    LazyText:
      8.68 μs ± 148 ns, 52% faster than baseline
ascii
  decode
    Text:
      10.3 ms ± 163 μs, 54% faster than baseline
    LazyText:
      74.5 ms ± 406 μs, 59% faster than baseline
english
  decode
    Text:
      398  μs ± 3.6 μs, 63% faster than baseline
    LazyText:
      575  μs ± 7.3 μs, 46% faster than baseline
russian
  decode
    Text:
      1.88 μs ±  27 ns, 95% faster than baseline
    LazyText: OK (5.02s)
      2.41 μs ±  30 ns, 93% faster than baseline
japanese
  decode
    Text:
      2.15 μs ± 5.2 ns, 93% faster than baseline
    LazyText: OK (22.11s)
      2.63 μs ±  13 ns, 91% faster than baseline

Again, results for non-ASCII inputs are pretty much fantastic, while English texts are less impressive (but still 2x faster), for the similar reasons as above. tiny benchmark is really tiny: just five letters, so there is not enough runway for simdutf vectorised state machine to accelerate, and we receive comparably modest speed up.

With regards to the third stated goal, as mentioned earlier, the geometric mean over all benchmarks is 0.33. Obviously, this number does not quite characterise them in full and indeed it's a mixed bag. I'll cover most notable regressions (and most notable improvements!) later this week.

@chrisdone
Copy link
Member

Great report, thanks @Bodigrim. I am highly enthusiastic about this change. 👏

Copy link
Member

@chrisdone chrisdone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a cursory look; limited time right now.

src/Data/Text/Internal.hs Outdated Show resolved Hide resolved
src/Data/Text/Internal.hs Outdated Show resolved Hide resolved
src/Data/Text/Array.hs Outdated Show resolved Hide resolved
jberryman added a commit to jberryman/attoparsec that referenced this pull request Aug 26, 2021
copyI semantics changed
renaming for 16 -> 8
use iter* functions from Text directly
@tomjaguarpaw
Copy link
Member

I'm really pleased to see this! I have had a look at each of the commits in order to get an overall impression of the PR. I don't feel that I have the expertise in this area to give a review though.

@Bodigrim
Copy link
Contributor Author

Performance report: regressions

(Bear in mind that I recently fixed more performance issues, so earlier reports are no longer fully relevant)

I was composing the report below piecewise, so numbers do not belong to the same commit (because this is a moving target and benchmarks take hours) or the same machine. However, I believe it reliably characterises key classes of performance issues. I'm happy to delve deeper and answer questions about specific instances or use cases, if desirable.

Data.Text.readFile vs. T.decodeUtf8 . ByteString.readFile

All.FileRead.Text,21440148375,32011948000,1.49308
All.FileRead.LazyText,19729004900,30573543125,1.54967
All.FileRead.TextByteString,23057155743,3958878015,0.171698
All.FileRead.LazyTextByteString,20230595575,4249293104,0.210043

These results correspond to

[ bench "Text" $ whnfIO $ T.length <$> T.readFile p
, bench "LazyText" $ whnfIO $ LT.length <$> LT.readFile p
, bench "TextByteString" $ whnfIO $
    (T.length . T.decodeUtf8) <$> SB.readFile p
, bench "LazyTextByteString" $ whnfIO $
    (LT.length . LT.decodeUtf8) <$> LB.readFile p
]

First two benchmarks measure locale-dependent file reading of a Russian text. The nature of locale-dependent reading is that GHC.IO.Buffer first decodes an input file from a system locale to UTF32-encoded buffer, and then Data.Text.IO converts UTF-32 buffer to Text. As discussed above, it's expected that decoding UTF-32 to UTF-8 is a slower (up to 55%) process than to UTF-16 (which mostly boils down to Word32 truncation). There is nothing to win here, as long as we are limited by GHC.IO.Buffer.

However, it was argued long ago that users should beware of TL.readFile and use T.decodeUtf8 . ByteString.readFile instead. And indeed as we can see from two latter benchmarks that T.decodeUtf8 . ByteString.readFile is 5x faster now, which in my opinion completely redeems slow down in T.readFile.

All.ReadLines.Text,26991334300,38123337400,1.41243
All.Programs.Fold,154487472000,197604059800,1.27909

Same here: this is a locale-dependent file reading of a Russian text.

All.Programs.Throughput.Text,36687180825,60851039100,1.65865
All.Programs.Throughput.LazyText,32469217750,55323163500,1.70386
All.Programs.Throughput.TextByteString,29604138418,4192649650,0.141624
All.Programs.Throughput.LazyTextByteString,27621090150,4613553668,0.16703

And here yet again: while locale-dependent reading regresses, decoding known UTF-8 inputs is 6-7x faster.

Tiny benchmarks

All.Pure.tiny.drop.Text,15132,25045,1.6551
All.Pure.tiny.take.Text,16078,23947,1.48943
All.Pure.tiny.length.cons.Text,12695,23493,1.85057
All.Pure.tiny.length.drop.Text,19860,36794,1.85267
All.Pure.tiny.length.drop.LazyText,28536,40957,1.43527
All.Pure.tiny.length.init.Text,16658,24041,1.44321
All.Pure.tiny.length.init.LazyText,23645,34648,1.46534
All.Pure.tiny.length.map.Text,17709,24007,1.35564
All.Pure.tiny.length.replicate char.Text,18784,29627,1.57725
All.Pure.tiny.length.replicate string.Text,18014,31220,1.7331
All.Pure.tiny.length.take.Text,16378,31957,1.95122
All.Pure.tiny.length.take.LazyText,27934,40618,1.45407
All.Pure.tiny.length.tail.Text,15305,25656,1.67631
All.Pure.tiny.length.tail.LazyText,18594,28876,1.55297

tiny benchmarks are really tiny: the text is only 5 characters long, and all operations are in nano-range, so unlikely to be a bottleneck, as long as nothing is seriously slower. The explanation is that drop / take / length now use 512-bit vectorised implementations, which need certain runway to accelerate.

Plenty of other tiny benchmarks are faster, so the geometric mean of this group is 0.82, still well below one.

Filtering

All.Pure.ascii-small.filter.filter.Text,71800920,160835488,2.24002
All.Pure.ascii-small.length.filter.Text,68873323,138127519,2.00553
All.Pure.ascii-small.length.filter.filter.Text,68676682,135933025,1.97932
All.Pure.ascii.filter.filter.Text,62801740200,139964924300,2.22868
All.Pure.ascii.length.filter.Text,60781141300,121388757200,1.99715
All.Pure.ascii.length.filter.filter.Text,60407876225,120215499200,1.99006
All.Pure.english.filter.filter.Text,4339939587,9306667587,2.14442
All.Pure.english.length.filter.Text,4016696125,7925535487,1.97315
All.Pure.english.length.filter.filter.Text,3977912193,8051621793,2.02408
All.Pure.russian.filter.filter.Text,7496362,28343670,3.78099
All.Pure.russian.length.filter.Text,7225880,27338591,3.78343
All.Pure.russian.length.filter.filter.Text,7363396,25771804,3.49999
All.Pure.japanese.filter.filter.Text,5054263,19458210,3.84986
All.Pure.japanese.length.filter.Text,4508120,15550037,3.44934
All.Pure.japanese.length.filter.filter.Text,4561359,14601019,3.20102

The thing is that benchmarks for filter are quite unrepresentative:

            , bgroup "filter"
                [ benchT   $ nf (T.length . T.filter p0) ta
                , benchTL  $ nf (TL.length . TL.filter p0) tla
                ]
    c  = 'й'
    p0 = (== c)

As one might expect, T.filter (== 'й') returns an empty Text for anything which is not Russian, and even for Russian the output is a tiny fraction of an input (one should rather use T.replicate and T.count). Essentially, these benchmarks measure parsing a buffer into a stream of Char (and discarding the stream outright). As discussed previously, parsing is expected to be faster for UTF-16.

I would also like to mention that filtering Unicode can produce meaningful results only in a handful of scenarios. E. g., T.filter isAscii produces different results depending on Unicode normalization of observably indistinguishable inputs. It's unlikely that in a well-designed system T.filter becomes a bottleneck.

Mapping

All.Pure.ascii-small.mapAccumL.Text,608466155,1565305392,2.57254
All.Pure.ascii-small.mapAccumL.LazyText,544866703,1232343353,2.26173
All.Pure.ascii-small.mapAccumR.Text,728927382,1694930725,2.32524
All.Pure.ascii-small.mapAccumR.LazyText,716806537,1355231652,1.89065
All.Pure.ascii.mapAccumL.Text,986479975600,2729346036400,2.76675
All.Pure.ascii.mapAccumL.LazyText,1043058781400,2070791592400,1.98531
All.Pure.ascii.mapAccumR.Text,1422074596600,2720855812100,1.9133
All.Pure.ascii.mapAccumR.LazyText,1490645122500,2148077009200,1.44104
All.Pure.english.mapAccumL.Text,60874261256,176335668000,2.89672
All.Pure.english.mapAccumL.LazyText,52894601675,137879584200,2.60669
All.Pure.english.mapAccumR.Text,71870856700,187798907350,2.613
All.Pure.english.mapAccumR.LazyText,80292101500,144580070800,1.80068
All.Pure.russian.mapAccumL.Text,59171156,93213606,1.57532
All.Pure.russian.mapAccumL.LazyText,39046606,91835323,2.35194
All.Pure.russian.mapAccumR.Text,67980959,94374382,1.38825
All.Pure.russian.mapAccumR.LazyText,66807119,94789778,1.41886
All.Pure.japanese.mapAccumL.LazyText,30840265,46513529,1.50821
All.Pure.japanese.mapAccumR.Text,49972835,55210934,1.10482

mapAccumL / mapAccumR certainly regress badly. Unfortunately, the nature of these functions require us to parse and print character-by-character, which is slower for UTF-8 than for UTF-16. Originally I did not anticipate that worse CPU branch prediction can lead to such drastic performance difference.

I believe that this is acceptable nevertheless, simply because I struggle to find any real-world use cases for mapAccum{L,R}. Again, it's incredibly difficult to imagine a usage of mapAccum{L,R}, which carefully handles Unicode intricacies and subtleties, but still bound by performance of UTF-8 parsing.

All.Pure.tiny.map.Text,69715,46306,0.664219
All.Pure.tiny.map.LazyText,172889,57529,0.332751
All.Pure.tiny.map.map.Text,35199,51573,1.46518
All.Pure.tiny.map.map.LazyText,147299,61317,0.416276
All.Pure.ascii-small.map.Text,570986035,313665625,0.54934
All.Pure.ascii-small.map.LazyText,1495289648,307545056,0.205676
All.Pure.ascii-small.map.map.Text,138758526,393272460,2.83422
All.Pure.ascii-small.map.map.LazyText,1399017382,355304809,0.253967
All.Pure.ascii.map.Text,581282600000,265156600000,0.456158
All.Pure.ascii.map.LazyText,1431011100000,306147300000,0.213938
All.Pure.ascii.map.map.Text,117102675000,338362775000,2.88945
All.Pure.ascii.map.map.LazyText,1347195900000,347654612500,0.258058
All.Pure.english.map.Text,40569093750,17671465625,0.435589
All.Pure.english.map.LazyText,92442137500,17780534375,0.192342
All.Pure.english.map.map.Text,8030321875,22338706250,2.78179
All.Pure.english.map.map.LazyText,84480615625,20408321875,0.241574
All.Pure.russian.map.Text,54396923,46203198,0.849372
All.Pure.russian.map.LazyText,139894458,45981140,0.328685
All.Pure.russian.map.map.Text,12951262,54321185,4.19428
All.Pure.russian.map.map.LazyText,125892663,54785674,0.435178
All.Pure.japanese.map.Text,37302227,32993960,0.884504
All.Pure.japanese.map.LazyText,95479821,32010337,0.335258
All.Pure.japanese.map.map.Text,9422100,34822103,3.69579
All.Pure.japanese.map.map.LazyText,90500512,35118249,0.388045

Similar to above, it's not surprising for map succ to regress. It's actually more surprising that results are not the same across the board and 3/4 of benchmarks became faster! Geometric mean is 0.62 over this group. Again, I do not expect a real-life application of map f, which faithfully processes Unicode, to be bottlenecked on UTF-8 parsing, because f must be quite involved.

If you are worried about performance of case conversions, worry no longer. They unanimously vote in favor of UTF-8 (with geometric mean 0.42):

All.Pure.tiny.toLower.Text,414602,186216,0.449144
All.Pure.tiny.toLower.LazyText,437520,232168,0.530645
All.Pure.tiny.toUpper.Text,424902,197850,0.465637
All.Pure.tiny.toUpper.LazyText,446135,251065,0.562756
All.Pure.ascii-small.toLower.Text,5427945312,2014412500,0.371119
All.Pure.ascii-small.toLower.LazyText,5603268750,2170191210,0.387308
All.Pure.ascii-small.toUpper.Text,5399546875,2515530078,0.465878
All.Pure.ascii-small.toUpper.LazyText,5807147265,2683082812,0.462031
All.Pure.ascii.toLower.Text,4645536200000,1836974600000,0.395428
All.Pure.ascii.toLower.LazyText,4860871800000,1819362600000,0.374287
All.Pure.ascii.toUpper.Text,4778418000000,2250802600000,0.471035
All.Pure.ascii.toUpper.LazyText,4997500800000,2288388800000,0.457907
All.Pure.english.toLower.Text,316019437500,126184218750,0.399293
All.Pure.english.toLower.LazyText,324979400000,126374043750,0.388868
All.Pure.english.toUpper.Text,317852300000,150322700000,0.472933
All.Pure.english.toUpper.LazyText,333748737500,156130225000,0.467808
All.Pure.russian.toLower.Text,497689624,195327001,0.392467
All.Pure.russian.toLower.LazyText,525514331,209794140,0.399217
All.Pure.russian.toUpper.Text,497672460,232213159,0.466598
All.Pure.russian.toUpper.LazyText,532040087,249128051,0.468251
All.Pure.japanese.toLower.Text,347989550,120762292,0.347029
All.Pure.japanese.toLower.LazyText,368167578,133719238,0.363202
All.Pure.japanese.toUpper.Text,353876123,121747778,0.344041
All.Pure.japanese.toUpper.LazyText,375892382,136650164,0.363535

Appending builders

All.Pure.tiny.Builder.mappend char,22403727,44948850,2.00631
All.Pure.tiny.Builder.mappend 8 char,79327,105339,1.32791
All.Pure.tiny.Builder.mappend text,501421785,478581864,0.95445
All.Pure.ascii-small.Builder.mappend char,23652920,45294095,1.91495
All.Pure.ascii-small.Builder.mappend 8 char,90781,110436,1.21651
All.Pure.ascii-small.Builder.mappend text,570745780,498310443,0.873087
All.Pure.ascii.Builder.mappend char,24550711,46687537,1.90168
All.Pure.ascii.Builder.mappend 8 char,102611,113728,1.10834
All.Pure.ascii.Builder.mappend text,593820546,538251198,0.906421
All.Pure.english.Builder.mappend char,24330778,46123828,1.8957
All.Pure.english.Builder.mappend 8 char,106896,107841,1.00884
All.Pure.english.Builder.mappend text,585586463,521747294,0.890983
All.Pure.russian.Builder.mappend char,23933431,45348496,1.89478
All.Pure.russian.Builder.mappend 8 char,108024,104745,0.969646
All.Pure.russian.Builder.mappend text,601874503,511843154,0.850415
All.Pure.japanese.Builder.mappend char,24568698,45829181,1.86535
All.Pure.japanese.Builder.mappend 8 char,103933,111322,1.07109
All.Pure.japanese.Builder.mappend text,580084975,522780439,0.901214

mappend char is a benchmark, which glues together 10000 T.singleton. Since UTF-8 encoding is more involved, this takes more time that for UTF-16. However, results for mappend text (which glues together 10000 short texts) demonstrate that this slow down is limited only to rather artificial scenarios.

Consing and unconsing of lazy Text

All.Pure.ascii.cons.LazyText,15004236,16599998,1.10635
All.Pure.ascii.tail.LazyText,14748675,16838993,1.14173
All.Pure.english.tail.LazyText,354389,461140,1.30123
All.Pure.russian.tail.LazyText,21311,19200,0.900943
All.Pure.japanese.tail.LazyText,21753,20404,0.937986

All.Pure.ascii.uncons.LazyText,14884936,16790311,1.12801
All.Pure.ascii-small.uncons.LazyText,31627,33086,1.04613
All.Pure.english.uncons.LazyText,389700,389421,0.999284
All.Pure.russian.uncons.LazyText,30911,28948,0.936495
All.Pure.japanese.uncons.LazyText,33544,31087,0.926753

Benchmarks for lazy cons / uncons / tail are pretty meaningless: while operations themselves succeed in nanoseconds, most of the time is spent in forcing a chain of chunks and memory churn. Their strict counterparts do not signal any regressions.

Japanese

All.Pure.japanese.append.Text,1557057,2219109,1.42519
All.Pure.japanese.words.LazyText,83176171,94262719,1.13329
All.Pure.japanese.zipWith.Text,82934658,113370141,1.36698

Japanese benchmarks are especially challenging for UTF-8, because it is an alphabet, which takes 50% more space in UTF-8 than in UTF-16 and its parsing is thrice more difficult. So it's not surprising to have regressions, but it's surprising to have only a few and all below 50%. Actually, geometric mean of Japanese group is 0.26.

Summary

It was known since inception that certain benchmarks are likely to regress. The regressions are mostly limited to malformed benchmarks or scenarios, unlikely to emerge in a real-world application, and are redeemed by gains in other areas.

@emilypi
Copy link
Member

emilypi commented Aug 27, 2021

Just running to confirm the benchmarks on my machine, I get the following with GHC 8.10.4:

On master:

All
  Pure
    tiny
      encode
        Text:     OK (0.16s)
          33.1 ns ± 2.9 ns
        LazyText: OK (0.14s)
          60.3 ns ± 5.9 ns
    ascii-small
      encode
        Text:     OK (0.24s)
          13.0 μs ± 847 ns
        LazyText: OK (0.66s)
          76.4 μs ± 5.0 μs
    ascii
      encode
        Text:     OK (1.43s)
          14.1 ms ± 1.3 ms
        LazyText: OK (1.48s)
          87.5 ms ± 2.7 ms
    english
      encode
        Text:     OK (0.48s)
          761  μs ±  67 μs
        LazyText: OK (0.56s)
          3.92 ms ± 350 μs
    russian
      encode
        Text:     OK (0.56s)
          8.15 μs ± 442 ns
        LazyText: OK (0.50s)
          15.0 μs ± 370 ns
    japanese
      encode
        Text:     OK (0.39s)
          11.6 μs ± 1.1 μs
        LazyText: OK (0.39s)
          10.9 μs ± 832 ns

On utf8:

All
  Pure
    tiny
      encode
        Text:     OK (0.23s)
          13.0 ns ± 686 ps, 60% faster than baseline
        LazyText: OK (0.43s)
          24.5 ns ± 1.9 ns, 59% faster than baseline
    ascii-small
      encode
        Text:     OK (0.69s)
          4.86 μs ± 187 ns, 62% faster than baseline
        LazyText: OK (0.77s)
          5.27 μs ± 177 ns, 93% faster than baseline
    ascii
      encode
        Text:     OK (1.82s)
          4.12 ms ±  97 μs, 70% faster than baseline
        LazyText: OK (0.72s)
          36.3 ms ± 1.1 ms, 58% faster than baseline
    english
      encode
        Text:     OK (0.43s)
          306  μs ±  31 μs, 59% faster than baseline
        LazyText: OK (0.80s)
          342  μs ±  24 μs, 91% faster than baseline
    russian
      encode
        Text:     OK (0.35s)
          584  ns ±  57 ns, 92% faster than baseline
        LazyText: OK (0.37s)
          612  ns ±  35 ns, 95% faster than baseline
    japanese
      encode
        Text:     OK (0.74s)
          634  ns ±  48 ns, 94% faster than baseline
        LazyText: OK (0.75s)
          650  ns ±  33 ns, 94% faster than baseline

Some of the benches in roughly the same ballpark, but others are drastically faster. Notably, some of the lazy text on my machine are much faster. This is great! I'll have a substantive review shortly.

src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved
src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved
src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved
src/Data/Text/Internal/Encoding/Utf8.hs Outdated Show resolved Hide resolved
@L-as
Copy link

L-as commented Aug 27, 2021

Perhaps I've missed this, but how is the performance difference on non-x86-64 platforms?

@Bodigrim
Copy link
Contributor Author

@L-as underlying simdutf library offers vectorised UTF-8 validation for arm64 and ppc64 architectures as well.

@Bodigrim
Copy link
Contributor Author

Speaking of not-so-synthetic benchmarks, here is prettyprinter, measured against text-1.2.5.0:

All
  80 characters, 50% ribbon
    prettyprinter
      layoutPretty:  OK (2.79s)
        180  ms ±  14 ms, 22% faster than baseline
      layoutSmart:   OK (2.76s)
        180  ms ± 9.9 ms, 22% faster than baseline
      layoutCompact: OK (2.50s)
        162  ms ± 7.6 ms
  Infinite/large page width
    prettyprinter
      layoutPretty:  OK (5.82s)
        184  ms ±  11 ms, 20% faster than baseline
      layoutSmart:   OK (1.34s)
        181  ms ±  14 ms, 24% faster than baseline
      layoutCompact: OK (1.22s)
        166  ms ±  16 ms
All
  Many small words
    Unoptimized:             OK (2.85s)
      2.72 μs ± 141 ns, 50% faster than baseline
    Shallowly fused:         OK (2.23s)
      533  ns ±  42 ns, 29% faster than baseline
    Deeply fused:            OK (2.23s)
      532  ns ±  26 ns, 29% faster than baseline
  vs. other libs
    renderPretty
      this, unoptimized:     OK (3.01s)
        5.68 μs ± 273 ns, 32% faster than baseline
      this, shallowly fused: OK (1.31s)
        5.02 μs ± 410 ns, 34% faster than baseline
      this, deeply fused:    OK (2.64s)
        5.05 μs ± 238 ns, 35% faster than baseline
    renderSmart
      this, unoptimized:     OK (1.71s)
        6.52 μs ± 592 ns, 31% faster than baseline
      this, shallowly fused: OK (5.70s)
        5.45 μs ± 215 ns, 35% faster than baseline
      this, deeply fused:    OK (2.90s)
        5.63 μs ± 333 ns, 32% faster than baseline
    renderCompact
      this, unoptimized:     OK (1.21s)
        4.65 μs ± 448 ns, 42% faster than baseline
      this, shallowly fused: OK (2.18s)
        4.22 μs ± 341 ns, 43% faster than baseline
      this, deeply fused:    OK (9.00s)
        4.29 μs ± 190 ns, 43% faster than baseline

@Bodigrim
Copy link
Contributor Author

Bodigrim commented Sep 6, 2021

Is there some documentation somewhere that outlines the plan for getting this merged and transitioning the community?

text is a boot package, bundled with GHC. In the best case scenario text-2.0 is to be shipped with GHC 9.4, around summer 2022. So there is plenty of time for transition. The outline is as follows:

  • Merge utf8 branch into text HEAD.
  • Relax upper bounds for text in parsec and Cabal HEAD.
  • Bump text submodule (and parsec / Cabal as well) in GHC source tree.
  • Work through head.hackage to provide migration patches.
  • Do a proper text-2.0 release on Hackage.

@Bodigrim
Copy link
Contributor Author

Bodigrim commented Sep 6, 2021

I addressed all the feedback above and pushed changes. Unless there are critical bugs, I'd appreciate if we wrap up and merge the branch by the end of the week, so that further work is unblocked.

This is a final call for reviews and approvals.

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some questions about the doc updates.

changelog.md Show resolved Hide resolved
src/Data/Text/Array.hs Outdated Show resolved Hide resolved
@ketzacoatl
Copy link

text is a boot package, bundled with GHC. In the best case scenario text-2.0 is to be shipped with GHC 9.4, around summer 2022. So there is plenty of time for transition.

@Bodigrim, is there a way to use the new text package in projects for testing, ahead of when it's included in GHC? or does this question not make sense?

@tomjaguarpaw
Copy link
Member

@ketzacoatl Are you looking for something like this:

https://cabal.readthedocs.io/en/3.4/cabal-project.html#specifying-packages-from-remote-version-control-locations

Copy link

@Boarders Boarders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @Bodigrim - I read through what I could and it looks good to me.

@ketzacoatl
Copy link

Are you looking for something like this:

@Bodigrim, more like your confirmation that you'd expect that type of reference to work. And I'll take your answer as a yes. Thanks!

Copy link
Member

@emilypi emilypi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Amazing job @Bodigrim 🎉

@Bodigrim Bodigrim merged commit 3488190 into haskell:master Sep 8, 2021
@Bodigrim
Copy link
Contributor Author

Bodigrim commented Sep 8, 2021

I think after 150 comments and 200 likes we are in a good position to merge. Thanks everyone for active participation, feel free to provide more feedback here or in separate issues.

@Bodigrim Bodigrim deleted the utf8 branch September 8, 2021 21:03
@ghost
Copy link

ghost commented Sep 9, 2021

Congratulations @Bodigrim, fantastic work, well done for getting it across the line.

@Bodigrim
Copy link
Contributor Author

This is old news, but the PR has been released as a part of text-2.0. My ZuriHac talk, covering 10-years-long story of UTF-8 transition, is available at https://www.youtube.com/watch?v=1qlGe2qnGZQ.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.