Switch internal representation to UTF8 #365

Bodigrim · 2021-08-22T02:02:20Z

Results

Benchmark results for GHC 8.10 can be found at https://gist.github.com/Bodigrim/365e388e080b17de45e80ab50a55fb4f. I'll publish a detailed analysis later, but here are the most notable improvements:

decodeUtf8 is up to 10x faster for non-ASCII texts.
encodeUtf8 is ~1.5-4x faster for strict and up to 10x faster for lazy Text.
take / drop / length are up to 20x faster.
toUpper / toLower are 10-30% faster.
Eq instance of strict Text is 10x faster.
Ord instance is typically 30%+ faster.
isInfixOf and search routines are up to 10x faster.
replicate of Char is up to 20x faster.

Geometric mean of benchmark times is 0.33.

How to review

I'd like to encourage as many reviewers as possible, and the branch is structured to facilitate it. Each commit builds and passes tests, so they can be reviewed individually. Actual switch from UTF16 to UTF8 happens in 1369cd3 and remaining TODOs resolved in 99f3c48. Everything else is performance improvements. There are two commits of intimidating size: 2cb3b30, which dumps autogenerated case mappings, and c3ccdb3, which bundles amalgamated simdutf8. I tried to keep other commits to a reasonable size and scope, hopefully it is palatable.

I'm happy to answer questions and add comments to any code, which is not clear enough. Do not hesitate to ask for this.

Known issues

This branch uses simdutf8 library, which is massive and written in C++. Since this is a bit inconvenient for a boot package, @kozross is currently finalizing a pure C replacement, which will be submitted for review in a separate PR.

Fuuzetsu · 2021-08-22T11:40:32Z

I guess this also addresses #272 in one swoop

chrisdone · 2021-08-22T17:06:17Z

What’s slower?

Fuuzetsu · 2021-08-22T23:50:05Z

What’s slower?

I guess it'd be good to see what encodeUtf16 and friends give now. I suspect these are barely used compared to encodeUtf8 anyway.

jberryman · 2021-08-23T14:50:45Z

EDIT: Oops, I totally botched processing the CSV as Bodigrim points out below

I'll see if I can build https://github.com/hasura/graphql-engine with this branch and run it through our benchmarking infra this week.

EDIT: going back to the UTF-8 proposal it looks like for better or worse the only concrete acceptance criteria vis a vis performance are:

decodeUtf8 and encodeUtf8 become at least 2x faster.

Geometric mean of existing benchmarks (which favor UTF-16) decreases.

Fusion (as per our test suite) does not regress beyond at most several cases.

so above is more than acceptable according to that spec

jkachmar · 2021-08-23T15:57:45Z

EDIT: Comment is no longer relevant to the discussion; see Bodigrim's response below.

chrisdone · 2021-08-23T16:46:16Z

Thanks for the numbers @jberryman

Bodigrim · 2021-08-23T17:21:12Z

@jberryman I'm afraid your table does not make sense:

Name	master branch	utf8 branch	0	Log Ratio
All.Programs.Throughput.LazyTextByteString	27378356325	4471250933	Ratio	-0.787
All.Programs.Throughput.TextByteString	23929606012	4610553675	4.361	-0.715

First two rows are corrupted, so let's start from the third one. Here Ratio does not match time measurements: 4610553675 / 23929606012 = 0.193, which means that UTF8 branch is 5x faster, not 4.361x slower. Everything else is misrepresenting original data in the similar way.

Instead of sorting rows by Ratio, you sorted values in Ratio column on their own and reversed the table, apparently obtaining nonsensical results.

jberryman · 2021-08-23T18:05:09Z

Doh! Sorry

Click for regressions...

Name	master branch	utf8 branch	Ratio	Log Ratio
All.Pure.japanese.filter.filter.Text	4286460	18693165	4.361	0.640
All.Pure.russian.filter.filter.Text	6391499	26611977	4.164	0.619
All.Pure.russian.length.filter.Text	5990205	24360620	4.067	0.609
All.Pure.ascii.mapAccumL.Text	908974332800	3638772627300	4.003	0.602
All.Pure.japanese.length.filter.Text	3907379	14647059	3.749	0.574
All.Pure.japanese.length.words.Text	38836980	143547741	3.696	0.568
All.Pure.japanese.words.Text	39674886	140793437	3.549	0.550
All.Pure.russian.length.filter.filter.Text	6046403	21338005	3.529	0.548
All.Pure.russian.map.map.Text	25095436	85788868	3.419	0.534
All.Pure.english.mapAccumL.Text	60778409175	207472001800	3.414	0.533
All.Pure.ascii-small.intersperse.Text	477122164	1584540007	3.321	0.521
All.Pure.japanese.intersperse.Text	35325822	112014173	3.171	0.501
All.Pure.japanese.length.filter.filter.Text	3852873	11845036	3.074	0.488
All.Pure.ascii-small.mapAccumL.Text	524822893	1584027082	3.018	0.480
All.Pure.russian.intersperse.Text	51438771	154943289	3.012	0.479
All.Pure.japanese.map.map.Text	17463211	52084321	2.983	0.475
All.Pure.ascii.map.map.Text	234876768275	689893193000	2.937	0.468
All.Pure.english.map.map.Text	16816703650	48141679000	2.863	0.457
All.Pure.ascii-small.map.map.Text	268999974	769964654	2.862	0.457
All.Pure.english.mapAccumL.LazyText	47612369112	135492460800	2.846	0.454
All.Pure.ascii-small.mapAccumL.LazyText	492221926	1368635479	2.781	0.444
All.Pure.ascii.mapAccumR.Text	1360928562550	3743383120800	2.751	0.439
All.Pure.english.mapAccumR.Text	79957684125	209392756075	2.619	0.418
All.Pure.ascii-small.length.filter.Text	60110941	155777043	2.591	0.414
All.Pure.english.length.filter.Text	3518092281	9072480412	2.579	0.411
All.Pure.ascii.length.filter.Text	53715988100	134837148600	2.510	0.400
All.Pure.ascii-small.mapAccumR.Text	689645017	1708705487	2.478	0.394
All.Pure.ascii.filter.filter.Text	55365971900	135428844350	2.446	0.388
All.Pure.ascii-small.filter.filter.Text	64062812	156431460	2.442	0.388
All.Pure.ascii.concat.LazyText	25956179262	61828057150	2.382	0.377
All.Pure.ascii.mapAccumL.LazyText	892289471600	2104454784000	2.358	0.373
All.Pure.russian.mapAccumL.LazyText	40192719	94138204	2.342	0.370
All.Pure.english.filter.filter.Text	3861193850	8971767900	2.324	0.366
All.Pure.english.mapAccumR.LazyText	64822552500	140001104800	2.160	0.334
All.Pure.tiny.length.cons.Text	14363	30306	2.110	0.324
All.Pure.tiny.length.take.LazyText	29270	61375	2.097	0.322
All.Pure.ascii-small.mapAccumR.LazyText	677831689	1391476798	2.053	0.312
All.Pure.tiny.length.take.Text	17334	34548	1.993	0.300
All.Pure.tiny.length.drop.LazyText	31327	61538	1.964	0.293
All.FileIndices.Text	1722222759	3368145693	1.956	0.291
All.Pure.russian.mapAccumL.Text	44009081	85401880	1.941	0.288
All.Pure.russian.length.words.Text	59840591	113316215	1.894	0.277
All.FileRead.LazyText	18601989650	34579194700	1.859	0.269
All.Pure.ascii-small.length.filter.filter.Text	60108618	110940976	1.846	0.266
All.Pure.tiny.length.tail.Text	15947	29348	1.840	0.265
All.Pure.tiny.map.map.Text	40689	73561	1.808	0.257
All.Pure.ascii.length.filter.filter.Text	53382547300	96310759700	1.804	0.256
All.Pure.english.length.filter.filter.Text	3554604815	6396166743	1.799	0.255
All.Pure.tiny.length.replicate string.Text	17935	32192	1.795	0.254
All.Pure.tiny.length.drop.Text	22087	39054	1.768	0.248
All.Pure.tiny.take.LazyText	28916	50071	1.732	0.238
All.Pure.english.intersperse.Text	64852656850	110296260500	1.701	0.231
All.Pure.tiny.length.map.Text	17986	30326	1.686	0.227
All.Pure.russian.words.Text	63318701	106785640	1.686	0.227
All.Pure.english.words.Text	94305583550	156552526850	1.660	0.220
All.Pure.ascii.intersperse.Text	1097287378800	1807749568100	1.647	0.217
All.Pure.tiny.length.init.Text	16455	27004	1.641	0.215
All.Pure.ascii-small.words.Text	435580662	706163431	1.621	0.210
All.Programs.Throughput.LazyText	33153355300	53489191400	1.613	0.208
All.Pure.tiny.words.Text	35595	56988	1.601	0.204
All.Pure.tiny.length.words.Text	37474	59682	1.593	0.202
All.Pure.ascii.mapAccumR.LazyText	1337590373100	2128827338800	1.592	0.202
All.Programs.Throughput.Text	36894744925	58716190900	1.591	0.202
All.Pure.tiny.length.replicate char.Text	20409	31951	1.566	0.195
All.Pure.english.length.words.Text	24862166800	38848806662	1.563	0.194
All.Pure.japanese.mapAccumL.LazyText	31111157	48004810	1.543	0.188
All.Pure.russian.mapAccumR.Text	61982220	95396325	1.539	0.187
All.Pure.ascii.length.words.Text	380019569800	581904752000	1.531	0.185
All.Pure.russian.mapAccumR.LazyText	60163025	90158680	1.499	0.176
All.Pure.ascii-small.length.words.Text	431157632	641999207	1.489	0.173
All.FileRead.Text	22193181037	32356484450	1.458	0.164
All.Pure.tiny.length.init.LazyText	25821	37288	1.444	0.160
All.Pure.tiny.drop.LazyText	34048	48234	1.417	0.151
All.Programs.Fold	130703628900	185012830000	1.416	0.151
All.Pure.tiny.take.Text	18041	25449	1.411	0.149
All.Pure.tiny.length.tail.LazyText	22157	30972	1.398	0.145
All.ReadLines.Text	26481958375	36025864550	1.360	0.134
All.Pure.tiny.intersperse.Text	121332	163914	1.351	0.131
All.Pure.japanese.isInfixOf.LazyText	9485642	12734068	1.342	0.128
All.Pure.tiny.drop.Text	18699	24777	1.325	0.122
All.Pure.japanese.append.Text	3813036	5048982	1.324	0.122
All.FileIndices.LazyText	6163252356	8156591087	1.323	0.122
All.Pure.ascii-small.Builder.mappend char	40402773	53383941	1.321	0.121
All.Pure.japanese.Builder.mappend char	40523358	53078920	1.310	0.117
All.Pure.tiny.length.map.map.Text	23670	30874	1.304	0.115
All.Pure.tiny.length.filter.Text	17223	22367	1.299	0.113
All.Pure.russian.Builder.mappend char	40321784	52296500	1.297	0.113
All.Replace.LazyText	7591184037	9838982087	1.296	0.113
All.Pure.english.Builder.mappend char	40925067	52787836	1.290	0.111
All.Pure.russian.map.Text	70205178	90176283	1.284	0.109
All.Pure.tiny.Builder.mappend char	41598222	52693260	1.267	0.103
All.Pure.japanese.map.Text	48719205	61551012	1.263	0.102
All.Pure.ascii.Builder.mappend char	42090766	52697784	1.252	0.098
All.Pure.russian.foldl'.Text	46374687	57498525	1.240	0.093
All.Pure.english.tail.LazyText	313540	387425	1.236	0.092
All.Pure.ascii.words.LazyText	3200002919100	3951757023600	1.235	0.092
All.Pure.russian.reverse.Text	13685795	16806661	1.228	0.089
All.DecodeUtf8.ascii.strict decodeASCII	16597790097	20361800856	1.227	0.089
All.Pure.ascii.words.Text	1923375551450	2335346355100	1.214	0.084
All.DecodeUtf8.ascii.strict decodeLatin1	16669017806	20201149628	1.212	0.083
All.Pure.tiny.length.filter.filter.Text	17128	20717	1.210	0.083
All.Pure.russian.length.filter.filter.LazyText	111583214	134526054	1.206	0.081
All.DecodeUtf8.ascii.Strict	16353951638	19685246662	1.204	0.081
All.Stream.stream.Text	33265855475	39864031600	1.198	0.079
All.Pure.russian.isInfixOf.LazyText	8267631	9853384	1.192	0.076
All.Pure.russian.length.filter.LazyText	115133204	136725833	1.188	0.075
All.DecodeUtf8.ascii.strict decodeUtf8	16567841664	19555564587	1.180	0.072
All.Programs.Cut.Text	38600987350	45420349225	1.177	0.071
All.Pure.japanese.mapAccumL.Text	33773446	39456762	1.168	0.068
All.Pure.russian.zipWith.Text	155804762	180284474	1.157	0.063
All.Pure.japanese.mapAccumR.Text	44827642	51784380	1.155	0.063
All.Pure.japanese.foldl'.Text	32705090	37488457	1.146	0.059
All.Pure.russian.intersperse.LazyText	283186349	323150051	1.141	0.057
All.Pure.japanese.zipWith.Text	108932297	124234861	1.140	0.057
All.Pure.ascii-small.map.Text	756522332	859308795	1.136	0.055
All.Pure.tiny.map.LazyText	203443	229399	1.128	0.052
All.ReadNumbers.DecimalText	453009064	509126414	1.124	0.051
All.Pure.english.Builder.mappend 8 char	70987	79812	1.124	0.051
All.Pure.japanese.Builder.mappend 8 char	74208	83193	1.121	0.050
All.Pure.russian.length.words.LazyText	87385175	97790055	1.119	0.049
All.Programs.BigTable	175518625700	196159030400	1.118	0.048
All.Pure.russian.filter.LazyText	128635242	143446543	1.115	0.047
All.Pure.tiny.Builder.mappend 8 char	71272	79092	1.110	0.045
All.Pure.ascii.decode.Text	17054489931	18899973825	1.108	0.045
All.Builder.Int.Decimal.Show.12	185275	204834	1.106	0.044
All.Pure.ascii.length.intercalate.LazyText	70177439375	77541390000	1.105	0.043
All.Pure.japanese.concat.Text	9245860	10207330	1.104	0.043
All.Pure.tiny.length.cons.LazyText	28795	31707	1.101	0.042
All.Stream.stream.LazyText	47489877000	52205395000	1.099	0.041
All.Pure.ascii.decode'.Text	17238519398	18947115375	1.099	0.041
All.Pure.japanese.words.LazyText	54124018	59078675	1.092	0.038
All.Pure.russian.words.LazyText	102432943	110916035	1.083	0.035
All.Pure.russian.foldl'.LazyText	127752375	138417774	1.083	0.035
All.Pure.russian.Builder.mappend 8 char	73192	79158	1.082	0.034
All.Pure.tiny.uncons.LazyText	46043	49599	1.077	0.032
All.Pure.tiny.filter.LazyText	105763	113667	1.075	0.031
All.Pure.ascii-small.map.map.LazyText	1907625318	2048294912	1.074	0.031
All.Pure.japanese.length.words.LazyText	52596298	56397154	1.072	0.030
All.ReadNumbers.DoubleText	2877579064	3073777312	1.068	0.029
All.Pure.tiny.foldl'.Text	45520	48612	1.068	0.029
All.Pure.ascii-small.zipWith.Text	1690593403	1804527026	1.067	0.028
All.Pure.japanese.Builder.mappend text	417791315	445266578	1.066	0.028
All.Pure.russian.filter.filter.LazyText	116662979	124236636	1.065	0.027
All.Pure.japanese.map.LazyText	132459812	140812427	1.063	0.027
All.Pure.japanese.filter.LazyText	76294521	81085629	1.063	0.026
All.Pure.ascii-small.Builder.mappend 8 char	73099	77613	1.062	0.026
All.Pure.russian.map.LazyText	196094014	208047672	1.061	0.026
All.Pure.russian.map.map.LazyText	181665864	192183338	1.058	0.024
All.Pure.japanese.intersperse.LazyText	200228871	211215202	1.055	0.023
All.Pure.japanese.length.filter.LazyText	71222038	75014667	1.053	0.023

Removing the "japanese" and "russian" benchmarks doesn't change the picture significantly. Like @chrisdone I'd be interested in an assessment of the regressions from @Bodigrim or someone who knows the benchmarks well.

Bodigrim · 2021-08-23T18:18:27Z

As I have written in the starting post, I'm working on a detailed analysis of performance.
@jberryman could you please wrap the table into a spoiler?

Bodigrim · 2021-08-23T23:38:29Z

Performance report

This report compares performance of text package with UTF-8-encoded internal representation (utf8 branch) to UTF-16 representation (master branch). Readers are encouraged to read the original proposal for UTF-8 transition first, especially "Performance impact" section, which provides necessary background.

The original Tom Harper's thesis, "Fusion on Haskell Unicode Strings" discusses performance differences between UTF-8, UTF-16 and UTF-32 encodings. Basically, UTF-32 offers the best performance for string operations in synthetic benchmarks, simple because it is a fixed-length encoding and parsing UTF-32-encoded buffer is no-op. Further, UTF-16 is worse, because characters can take 16 or 32 bits. However, parsing/printing codepoints is still very simple, there are only two branches, and since the vast majority of codepoints are 16-bits-long, CPU branch prediction works wonderfully.

However, UTF-8 performance poses certain challenges. Code points can be represented as 1, 2, 3 or 4 bytes, their parsing and printing involves multiple bitwise operations, and CPU branch prediction becomes ineffective, because non-ASCII texts constantly switch between branches. Memory savings from UTF-8 hardly affect synthetic benchmarks, because they usually fit into CPU cache. Synthetic benchmarks also do not account for encoding/decoding data from external sources (usually in UTF-8) and measure pure processing time. That's why text originally went for UTF-16 encoding.

The key to a better UTF-8 performance is to avoid parsing/printing of codepoints as much as possible. For example, the most common operations on text are cutting and concatenation - and neither of these requires actual parsing of individual characters, they can be executed over opaque memory buffers. Given that external sources are most likely UTF-8 encoded, an application can spend its entire lifetime without ever interpreting a text character-by-character - thus achieving a very decent performance.

Operations, which necessitate a character-by-character interpretation, are likely to regress with UTF-8. E. g., map succ, measured in isolation, could very well be faster on UTF-16 buffer. We argue that this is an acceptable tradeoff for two reasons. Firstly, if we measure a full pipeline including decoding an input file and encoding an output, savings from these no-ops are likely to outweigh map succ, unless there is a very long chain of map. Secondly, Unicode is so complex and mutifaceted that for practical applications parsing/printing of characters is trumped by application of f in map f. For instance, toUpper / toLower are actually faster in our utf8 branch than they were in master.

The original proposal stated three performance goals:

Fusion (as per our test suite) does not regress beyond at most several cases.

decodeUtf8 and encodeUtf8 become at least 2x faster.

Geometric mean of existing benchmarks (which favor UTF-16) decreases.

While the work on UTF-8 transition was underway, text package decided to abandon implicit fusion framework, as it was demonstrated to harm asymptotic (!) performance (#348). FWIW early drafts of utf8 branch, prior to the 21st of June, showed no issues with fusion, e. g., https://github.com/Bodigrim/text/commits/utf8-210609.

Next, here are results for encodeUtf8:

git checkout master
cabal run text-benchmarks -- -t100 -p encode --csv text-master-encode.csv
git checkout utf8
cabal run text-benchmarks -- -t100 -p encode --baseline text-master-encode.csv

tiny
  encode
    Text:
      18.3 ns ± 202 ps, 58% faster than baseline
    LazyText:
      33.1 ns ± 364 ps, 61% faster than baseline
ascii-small
  encode
    Text:
      4.80 μs ±  58 ns, 55% faster than baseline
    LazyText:
      7.57 μs ±  71 ns, 86% faster than baseline
ascii
  encode
    Text:
      7.17 ms ±  40 μs, 60% faster than baseline
    LazyText:
      78.1 ms ± 672 μs, 32% faster than baseline
english
  encode
    Text:
      347  μs ± 5.5 μs, 44% faster than baseline
    LazyText:
      510  μs ± 4.4 μs, 81% faster than baseline
russian
  encode
    Text:
      1.36 μs ±  24 ns, 88% faster than baseline
    LazyText:
      1.37 μs ±  13 ns, 92% faster than baseline
japanese
  encode
    Text:
      1.26 μs ±  19 ns, 88% faster than baseline
    LazyText:
      1.28 μs ±  23 ns, 91% faster than baseline

As expected, we receive astonishing speed up for non-ASCII data. Results for English texts are a bit less impressive, but bear in mind that master recently gained (#302) a twice-faster SIMD-based encoder, which is very good for pure, uninterrupted ASCII. And for ascii benchmark, where the input is 50M long, memory bandwidth becomes a bottleneck, throttling speed up opportunities.

Moving to decodeUtf8, it's worth noticing that this is not a no-op. While for a valid UTF-8 ByteString it suffices just to copy it to Text, one must check that the input is valid first. Naively, validating UTF-8 encoding is no faster than parsing characters one-by-one with appropriate error reporting. However, we employ simdutf library, which validates Unicode using a vectorised state machine.

git checkout master
cabal run text-benchmarks -- -t100 -p '$2=="Pure" && $4=="decode"' --csv text-master-decode.csv
git checkout utf8
cabal run text-benchmarks -- -t100 -p '$2=="Pure" && $4=="decode"' --baseline text-master-decode.csv

tiny
  decode
    Text:
      34.0 ns ± 286 ps, 36% faster than baseline
    LazyText:
      150  ns ± 106 ps, 42% faster than baseline
ascii-small
  decode
    Text:
      5.94 μs ±  92 ns, 58% faster than baseline
    LazyText:
      8.68 μs ± 148 ns, 52% faster than baseline
ascii
  decode
    Text:
      10.3 ms ± 163 μs, 54% faster than baseline
    LazyText:
      74.5 ms ± 406 μs, 59% faster than baseline
english
  decode
    Text:
      398  μs ± 3.6 μs, 63% faster than baseline
    LazyText:
      575  μs ± 7.3 μs, 46% faster than baseline
russian
  decode
    Text:
      1.88 μs ±  27 ns, 95% faster than baseline
    LazyText: OK (5.02s)
      2.41 μs ±  30 ns, 93% faster than baseline
japanese
  decode
    Text:
      2.15 μs ± 5.2 ns, 93% faster than baseline
    LazyText: OK (22.11s)
      2.63 μs ±  13 ns, 91% faster than baseline

Again, results for non-ASCII inputs are pretty much fantastic, while English texts are less impressive (but still 2x faster), for the similar reasons as above. tiny benchmark is really tiny: just five letters, so there is not enough runway for simdutf vectorised state machine to accelerate, and we receive comparably modest speed up.

With regards to the third stated goal, as mentioned earlier, the geometric mean over all benchmarks is 0.33. Obviously, this number does not quite characterise them in full and indeed it's a mixed bag. I'll cover most notable regressions (and most notable improvements!) later this week.

chrisdone · 2021-08-24T06:08:40Z

Great report, thanks @Bodigrim. I am highly enthusiastic about this change. 👏

chrisdone

Did a cursory look; limited time right now.

src/Data/Text/Internal.hs

src/Data/Text/Array.hs

copyI semantics changed renaming for 16 -> 8 use iter* functions from Text directly

tomjaguarpaw · 2021-08-27T13:00:37Z

I'm really pleased to see this! I have had a look at each of the commits in order to get an overall impression of the PR. I don't feel that I have the expertise in this area to give a review though.

Bodigrim · 2021-08-27T18:11:52Z

Performance report: regressions

(Bear in mind that I recently fixed more performance issues, so earlier reports are no longer fully relevant)

I was composing the report below piecewise, so numbers do not belong to the same commit (because this is a moving target and benchmarks take hours) or the same machine. However, I believe it reliably characterises key classes of performance issues. I'm happy to delve deeper and answer questions about specific instances or use cases, if desirable.

`Data.Text.readFile` vs. `T.decodeUtf8 . ByteString.readFile`

All.FileRead.Text,21440148375,32011948000,1.49308
All.FileRead.LazyText,19729004900,30573543125,1.54967
All.FileRead.TextByteString,23057155743,3958878015,0.171698
All.FileRead.LazyTextByteString,20230595575,4249293104,0.210043

These results correspond to

[ bench "Text" $ whnfIO $ T.length <$> T.readFile p
, bench "LazyText" $ whnfIO $ LT.length <$> LT.readFile p
, bench "TextByteString" $ whnfIO $
    (T.length . T.decodeUtf8) <$> SB.readFile p
, bench "LazyTextByteString" $ whnfIO $
    (LT.length . LT.decodeUtf8) <$> LB.readFile p
]

First two benchmarks measure locale-dependent file reading of a Russian text. The nature of locale-dependent reading is that GHC.IO.Buffer first decodes an input file from a system locale to UTF32-encoded buffer, and then Data.Text.IO converts UTF-32 buffer to Text. As discussed above, it's expected that decoding UTF-32 to UTF-8 is a slower (up to 55%) process than to UTF-16 (which mostly boils down to Word32 truncation). There is nothing to win here, as long as we are limited by GHC.IO.Buffer.

However, it was argued long ago that users should beware of TL.readFile and use T.decodeUtf8 . ByteString.readFile instead. And indeed as we can see from two latter benchmarks that T.decodeUtf8 . ByteString.readFile is 5x faster now, which in my opinion completely redeems slow down in T.readFile.

All.ReadLines.Text,26991334300,38123337400,1.41243
All.Programs.Fold,154487472000,197604059800,1.27909

Same here: this is a locale-dependent file reading of a Russian text.

All.Programs.Throughput.Text,36687180825,60851039100,1.65865
All.Programs.Throughput.LazyText,32469217750,55323163500,1.70386
All.Programs.Throughput.TextByteString,29604138418,4192649650,0.141624
All.Programs.Throughput.LazyTextByteString,27621090150,4613553668,0.16703

And here yet again: while locale-dependent reading regresses, decoding known UTF-8 inputs is 6-7x faster.

Tiny benchmarks

All.Pure.tiny.drop.Text,15132,25045,1.6551
All.Pure.tiny.take.Text,16078,23947,1.48943
All.Pure.tiny.length.cons.Text,12695,23493,1.85057
All.Pure.tiny.length.drop.Text,19860,36794,1.85267
All.Pure.tiny.length.drop.LazyText,28536,40957,1.43527
All.Pure.tiny.length.init.Text,16658,24041,1.44321
All.Pure.tiny.length.init.LazyText,23645,34648,1.46534
All.Pure.tiny.length.map.Text,17709,24007,1.35564
All.Pure.tiny.length.replicate char.Text,18784,29627,1.57725
All.Pure.tiny.length.replicate string.Text,18014,31220,1.7331
All.Pure.tiny.length.take.Text,16378,31957,1.95122
All.Pure.tiny.length.take.LazyText,27934,40618,1.45407
All.Pure.tiny.length.tail.Text,15305,25656,1.67631
All.Pure.tiny.length.tail.LazyText,18594,28876,1.55297

tiny benchmarks are really tiny: the text is only 5 characters long, and all operations are in nano-range, so unlikely to be a bottleneck, as long as nothing is seriously slower. The explanation is that drop / take / length now use 512-bit vectorised implementations, which need certain runway to accelerate.

Plenty of other tiny benchmarks are faster, so the geometric mean of this group is 0.82, still well below one.

Filtering

All.Pure.ascii-small.filter.filter.Text,71800920,160835488,2.24002
All.Pure.ascii-small.length.filter.Text,68873323,138127519,2.00553
All.Pure.ascii-small.length.filter.filter.Text,68676682,135933025,1.97932
All.Pure.ascii.filter.filter.Text,62801740200,139964924300,2.22868
All.Pure.ascii.length.filter.Text,60781141300,121388757200,1.99715
All.Pure.ascii.length.filter.filter.Text,60407876225,120215499200,1.99006
All.Pure.english.filter.filter.Text,4339939587,9306667587,2.14442
All.Pure.english.length.filter.Text,4016696125,7925535487,1.97315
All.Pure.english.length.filter.filter.Text,3977912193,8051621793,2.02408
All.Pure.russian.filter.filter.Text,7496362,28343670,3.78099
All.Pure.russian.length.filter.Text,7225880,27338591,3.78343
All.Pure.russian.length.filter.filter.Text,7363396,25771804,3.49999
All.Pure.japanese.filter.filter.Text,5054263,19458210,3.84986
All.Pure.japanese.length.filter.Text,4508120,15550037,3.44934
All.Pure.japanese.length.filter.filter.Text,4561359,14601019,3.20102

The thing is that benchmarks for filter are quite unrepresentative:

            , bgroup "filter"
                [ benchT   $ nf (T.length . T.filter p0) ta
                , benchTL  $ nf (TL.length . TL.filter p0) tla
                ]

    c  = 'й'
    p0 = (== c)

As one might expect, T.filter (== 'й') returns an empty Text for anything which is not Russian, and even for Russian the output is a tiny fraction of an input (one should rather use T.replicate and T.count). Essentially, these benchmarks measure parsing a buffer into a stream of Char (and discarding the stream outright). As discussed previously, parsing is expected to be faster for UTF-16.

I would also like to mention that filtering Unicode can produce meaningful results only in a handful of scenarios. E. g., T.filter isAscii produces different results depending on Unicode normalization of observably indistinguishable inputs. It's unlikely that in a well-designed system T.filter becomes a bottleneck.

Mapping

All.Pure.ascii-small.mapAccumL.Text,608466155,1565305392,2.57254
All.Pure.ascii-small.mapAccumL.LazyText,544866703,1232343353,2.26173
All.Pure.ascii-small.mapAccumR.Text,728927382,1694930725,2.32524
All.Pure.ascii-small.mapAccumR.LazyText,716806537,1355231652,1.89065
All.Pure.ascii.mapAccumL.Text,986479975600,2729346036400,2.76675
All.Pure.ascii.mapAccumL.LazyText,1043058781400,2070791592400,1.98531
All.Pure.ascii.mapAccumR.Text,1422074596600,2720855812100,1.9133
All.Pure.ascii.mapAccumR.LazyText,1490645122500,2148077009200,1.44104
All.Pure.english.mapAccumL.Text,60874261256,176335668000,2.89672
All.Pure.english.mapAccumL.LazyText,52894601675,137879584200,2.60669
All.Pure.english.mapAccumR.Text,71870856700,187798907350,2.613
All.Pure.english.mapAccumR.LazyText,80292101500,144580070800,1.80068
All.Pure.russian.mapAccumL.Text,59171156,93213606,1.57532
All.Pure.russian.mapAccumL.LazyText,39046606,91835323,2.35194
All.Pure.russian.mapAccumR.Text,67980959,94374382,1.38825
All.Pure.russian.mapAccumR.LazyText,66807119,94789778,1.41886
All.Pure.japanese.mapAccumL.LazyText,30840265,46513529,1.50821
All.Pure.japanese.mapAccumR.Text,49972835,55210934,1.10482

mapAccumL / mapAccumR certainly regress badly. Unfortunately, the nature of these functions require us to parse and print character-by-character, which is slower for UTF-8 than for UTF-16. Originally I did not anticipate that worse CPU branch prediction can lead to such drastic performance difference.

I believe that this is acceptable nevertheless, simply because I struggle to find any real-world use cases for mapAccum{L,R}. Again, it's incredibly difficult to imagine a usage of mapAccum{L,R}, which carefully handles Unicode intricacies and subtleties, but still bound by performance of UTF-8 parsing.

All.Pure.tiny.map.Text,69715,46306,0.664219
All.Pure.tiny.map.LazyText,172889,57529,0.332751
All.Pure.tiny.map.map.Text,35199,51573,1.46518
All.Pure.tiny.map.map.LazyText,147299,61317,0.416276
All.Pure.ascii-small.map.Text,570986035,313665625,0.54934
All.Pure.ascii-small.map.LazyText,1495289648,307545056,0.205676
All.Pure.ascii-small.map.map.Text,138758526,393272460,2.83422
All.Pure.ascii-small.map.map.LazyText,1399017382,355304809,0.253967
All.Pure.ascii.map.Text,581282600000,265156600000,0.456158
All.Pure.ascii.map.LazyText,1431011100000,306147300000,0.213938
All.Pure.ascii.map.map.Text,117102675000,338362775000,2.88945
All.Pure.ascii.map.map.LazyText,1347195900000,347654612500,0.258058
All.Pure.english.map.Text,40569093750,17671465625,0.435589
All.Pure.english.map.LazyText,92442137500,17780534375,0.192342
All.Pure.english.map.map.Text,8030321875,22338706250,2.78179
All.Pure.english.map.map.LazyText,84480615625,20408321875,0.241574
All.Pure.russian.map.Text,54396923,46203198,0.849372
All.Pure.russian.map.LazyText,139894458,45981140,0.328685
All.Pure.russian.map.map.Text,12951262,54321185,4.19428
All.Pure.russian.map.map.LazyText,125892663,54785674,0.435178
All.Pure.japanese.map.Text,37302227,32993960,0.884504
All.Pure.japanese.map.LazyText,95479821,32010337,0.335258
All.Pure.japanese.map.map.Text,9422100,34822103,3.69579
All.Pure.japanese.map.map.LazyText,90500512,35118249,0.388045

Similar to above, it's not surprising for map succ to regress. It's actually more surprising that results are not the same across the board and 3/4 of benchmarks became faster! Geometric mean is 0.62 over this group. Again, I do not expect a real-life application of map f, which faithfully processes Unicode, to be bottlenecked on UTF-8 parsing, because f must be quite involved.

If you are worried about performance of case conversions, worry no longer. They unanimously vote in favor of UTF-8 (with geometric mean 0.42):

All.Pure.tiny.toLower.Text,414602,186216,0.449144
All.Pure.tiny.toLower.LazyText,437520,232168,0.530645
All.Pure.tiny.toUpper.Text,424902,197850,0.465637
All.Pure.tiny.toUpper.LazyText,446135,251065,0.562756
All.Pure.ascii-small.toLower.Text,5427945312,2014412500,0.371119
All.Pure.ascii-small.toLower.LazyText,5603268750,2170191210,0.387308
All.Pure.ascii-small.toUpper.Text,5399546875,2515530078,0.465878
All.Pure.ascii-small.toUpper.LazyText,5807147265,2683082812,0.462031
All.Pure.ascii.toLower.Text,4645536200000,1836974600000,0.395428
All.Pure.ascii.toLower.LazyText,4860871800000,1819362600000,0.374287
All.Pure.ascii.toUpper.Text,4778418000000,2250802600000,0.471035
All.Pure.ascii.toUpper.LazyText,4997500800000,2288388800000,0.457907
All.Pure.english.toLower.Text,316019437500,126184218750,0.399293
All.Pure.english.toLower.LazyText,324979400000,126374043750,0.388868
All.Pure.english.toUpper.Text,317852300000,150322700000,0.472933
All.Pure.english.toUpper.LazyText,333748737500,156130225000,0.467808
All.Pure.russian.toLower.Text,497689624,195327001,0.392467
All.Pure.russian.toLower.LazyText,525514331,209794140,0.399217
All.Pure.russian.toUpper.Text,497672460,232213159,0.466598
All.Pure.russian.toUpper.LazyText,532040087,249128051,0.468251
All.Pure.japanese.toLower.Text,347989550,120762292,0.347029
All.Pure.japanese.toLower.LazyText,368167578,133719238,0.363202
All.Pure.japanese.toUpper.Text,353876123,121747778,0.344041
All.Pure.japanese.toUpper.LazyText,375892382,136650164,0.363535

Appending builders

All.Pure.tiny.Builder.mappend char,22403727,44948850,2.00631
All.Pure.tiny.Builder.mappend 8 char,79327,105339,1.32791
All.Pure.tiny.Builder.mappend text,501421785,478581864,0.95445
All.Pure.ascii-small.Builder.mappend char,23652920,45294095,1.91495
All.Pure.ascii-small.Builder.mappend 8 char,90781,110436,1.21651
All.Pure.ascii-small.Builder.mappend text,570745780,498310443,0.873087
All.Pure.ascii.Builder.mappend char,24550711,46687537,1.90168
All.Pure.ascii.Builder.mappend 8 char,102611,113728,1.10834
All.Pure.ascii.Builder.mappend text,593820546,538251198,0.906421
All.Pure.english.Builder.mappend char,24330778,46123828,1.8957
All.Pure.english.Builder.mappend 8 char,106896,107841,1.00884
All.Pure.english.Builder.mappend text,585586463,521747294,0.890983
All.Pure.russian.Builder.mappend char,23933431,45348496,1.89478
All.Pure.russian.Builder.mappend 8 char,108024,104745,0.969646
All.Pure.russian.Builder.mappend text,601874503,511843154,0.850415
All.Pure.japanese.Builder.mappend char,24568698,45829181,1.86535
All.Pure.japanese.Builder.mappend 8 char,103933,111322,1.07109
All.Pure.japanese.Builder.mappend text,580084975,522780439,0.901214

mappend char is a benchmark, which glues together 10000 T.singleton. Since UTF-8 encoding is more involved, this takes more time that for UTF-16. However, results for mappend text (which glues together 10000 short texts) demonstrate that this slow down is limited only to rather artificial scenarios.

Consing and unconsing of lazy Text

All.Pure.ascii.cons.LazyText,15004236,16599998,1.10635
All.Pure.ascii.tail.LazyText,14748675,16838993,1.14173
All.Pure.english.tail.LazyText,354389,461140,1.30123
All.Pure.russian.tail.LazyText,21311,19200,0.900943
All.Pure.japanese.tail.LazyText,21753,20404,0.937986

All.Pure.ascii.uncons.LazyText,14884936,16790311,1.12801
All.Pure.ascii-small.uncons.LazyText,31627,33086,1.04613
All.Pure.english.uncons.LazyText,389700,389421,0.999284
All.Pure.russian.uncons.LazyText,30911,28948,0.936495
All.Pure.japanese.uncons.LazyText,33544,31087,0.926753

Benchmarks for lazy cons / uncons / tail are pretty meaningless: while operations themselves succeed in nanoseconds, most of the time is spent in forcing a chain of chunks and memory churn. Their strict counterparts do not signal any regressions.

Japanese

All.Pure.japanese.append.Text,1557057,2219109,1.42519
All.Pure.japanese.words.LazyText,83176171,94262719,1.13329
All.Pure.japanese.zipWith.Text,82934658,113370141,1.36698

Japanese benchmarks are especially challenging for UTF-8, because it is an alphabet, which takes 50% more space in UTF-8 than in UTF-16 and its parsing is thrice more difficult. So it's not surprising to have regressions, but it's surprising to have only a few and all below 50%. Actually, geometric mean of Japanese group is 0.26.

Summary

It was known since inception that certain benchmarks are likely to regress. The regressions are mostly limited to malformed benchmarks or scenarios, unlikely to emerge in a real-world application, and are redeemed by gains in other areas.

emilypi · 2021-08-27T19:46:20Z

Just running to confirm the benchmarks on my machine, I get the following with GHC 8.10.4:

On master:

All
  Pure
    tiny
      encode
        Text:     OK (0.16s)
          33.1 ns ± 2.9 ns
        LazyText: OK (0.14s)
          60.3 ns ± 5.9 ns
    ascii-small
      encode
        Text:     OK (0.24s)
          13.0 μs ± 847 ns
        LazyText: OK (0.66s)
          76.4 μs ± 5.0 μs
    ascii
      encode
        Text:     OK (1.43s)
          14.1 ms ± 1.3 ms
        LazyText: OK (1.48s)
          87.5 ms ± 2.7 ms
    english
      encode
        Text:     OK (0.48s)
          761  μs ±  67 μs
        LazyText: OK (0.56s)
          3.92 ms ± 350 μs
    russian
      encode
        Text:     OK (0.56s)
          8.15 μs ± 442 ns
        LazyText: OK (0.50s)
          15.0 μs ± 370 ns
    japanese
      encode
        Text:     OK (0.39s)
          11.6 μs ± 1.1 μs
        LazyText: OK (0.39s)
          10.9 μs ± 832 ns

On utf8:

All
  Pure
    tiny
      encode
        Text:     OK (0.23s)
          13.0 ns ± 686 ps, 60% faster than baseline
        LazyText: OK (0.43s)
          24.5 ns ± 1.9 ns, 59% faster than baseline
    ascii-small
      encode
        Text:     OK (0.69s)
          4.86 μs ± 187 ns, 62% faster than baseline
        LazyText: OK (0.77s)
          5.27 μs ± 177 ns, 93% faster than baseline
    ascii
      encode
        Text:     OK (1.82s)
          4.12 ms ±  97 μs, 70% faster than baseline
        LazyText: OK (0.72s)
          36.3 ms ± 1.1 ms, 58% faster than baseline
    english
      encode
        Text:     OK (0.43s)
          306  μs ±  31 μs, 59% faster than baseline
        LazyText: OK (0.80s)
          342  μs ±  24 μs, 91% faster than baseline
    russian
      encode
        Text:     OK (0.35s)
          584  ns ±  57 ns, 92% faster than baseline
        LazyText: OK (0.37s)
          612  ns ±  35 ns, 95% faster than baseline
    japanese
      encode
        Text:     OK (0.74s)
          634  ns ±  48 ns, 94% faster than baseline
        LazyText: OK (0.75s)
          650  ns ±  33 ns, 94% faster than baseline

Some of the benches in roughly the same ballpark, but others are drastically faster. Notably, some of the lazy text on my machine are much faster. This is great! I'll have a substantive review shortly.

src/Data/Text/Encoding.hs

src/Data/Text/Internal/Encoding/Utf8.hs

L-as · 2021-08-27T20:37:08Z

Perhaps I've missed this, but how is the performance difference on non-x86-64 platforms?

Bodigrim · 2021-08-27T20:54:36Z

@L-as underlying simdutf library offers vectorised UTF-8 validation for arm64 and ppc64 architectures as well.

Bodigrim · 2021-08-27T20:57:25Z

Speaking of not-so-synthetic benchmarks, here is prettyprinter, measured against text-1.2.5.0:

All
  80 characters, 50% ribbon
    prettyprinter
      layoutPretty:  OK (2.79s)
        180  ms ±  14 ms, 22% faster than baseline
      layoutSmart:   OK (2.76s)
        180  ms ± 9.9 ms, 22% faster than baseline
      layoutCompact: OK (2.50s)
        162  ms ± 7.6 ms
  Infinite/large page width
    prettyprinter
      layoutPretty:  OK (5.82s)
        184  ms ±  11 ms, 20% faster than baseline
      layoutSmart:   OK (1.34s)
        181  ms ±  14 ms, 24% faster than baseline
      layoutCompact: OK (1.22s)
        166  ms ±  16 ms

All
  Many small words
    Unoptimized:             OK (2.85s)
      2.72 μs ± 141 ns, 50% faster than baseline
    Shallowly fused:         OK (2.23s)
      533  ns ±  42 ns, 29% faster than baseline
    Deeply fused:            OK (2.23s)
      532  ns ±  26 ns, 29% faster than baseline
  vs. other libs
    renderPretty
      this, unoptimized:     OK (3.01s)
        5.68 μs ± 273 ns, 32% faster than baseline
      this, shallowly fused: OK (1.31s)
        5.02 μs ± 410 ns, 34% faster than baseline
      this, deeply fused:    OK (2.64s)
        5.05 μs ± 238 ns, 35% faster than baseline
    renderSmart
      this, unoptimized:     OK (1.71s)
        6.52 μs ± 592 ns, 31% faster than baseline
      this, shallowly fused: OK (5.70s)
        5.45 μs ± 215 ns, 35% faster than baseline
      this, deeply fused:    OK (2.90s)
        5.63 μs ± 333 ns, 32% faster than baseline
    renderCompact
      this, unoptimized:     OK (1.21s)
        4.65 μs ± 448 ns, 42% faster than baseline
      this, shallowly fused: OK (2.18s)
        4.22 μs ± 341 ns, 43% faster than baseline
      this, deeply fused:    OK (9.00s)
        4.29 μs ± 190 ns, 43% faster than baseline

src/Data/Text/Internal/Fusion/CaseMapping.hs

src/Data/Text.hs

…n downstream packages

Bodigrim · 2021-09-06T22:08:23Z

Is there some documentation somewhere that outlines the plan for getting this merged and transitioning the community?

text is a boot package, bundled with GHC. In the best case scenario text-2.0 is to be shipped with GHC 9.4, around summer 2022. So there is plenty of time for transition. The outline is as follows:

Merge utf8 branch into text HEAD.
Relax upper bounds for text in parsec and Cabal HEAD.
Bump text submodule (and parsec / Cabal as well) in GHC source tree.
Work through head.hackage to provide migration patches.
Do a proper text-2.0 release on Hackage.

Bodigrim · 2021-09-06T22:19:42Z

I addressed all the feedback above and pushed changes. Unless there are critical bugs, I'd appreciate if we wrap up and merge the branch by the end of the week, so that further work is unblocked.

This is a final call for reviews and approvals.

ghost

Just some questions about the doc updates.

changelog.md

src/Data/Text/Array.hs

src/Data/Text/Internal/Encoding/Utf8.hs

ketzacoatl · 2021-09-07T11:43:45Z

text is a boot package, bundled with GHC. In the best case scenario text-2.0 is to be shipped with GHC 9.4, around summer 2022. So there is plenty of time for transition.

@Bodigrim, is there a way to use the new text package in projects for testing, ahead of when it's included in GHC? or does this question not make sense?

tomjaguarpaw · 2021-09-07T11:51:53Z

@ketzacoatl Are you looking for something like this:

https://cabal.readthedocs.io/en/3.4/cabal-project.html#specifying-packages-from-remote-version-control-locations

Boarders

Great work @Bodigrim - I read through what I could and it looks good to me.

ketzacoatl · 2021-09-08T03:11:29Z

Are you looking for something like this:

@Bodigrim, more like your confirmation that you'd expect that type of reference to work. And I'll take your answer as a yes. Thanks!

emilypi

Approved. Amazing job @Bodigrim 🎉

Bodigrim · 2021-09-08T18:29:34Z

I think after 150 comments and 200 likes we are in a good position to merge. Thanks everyone for active participation, feel free to provide more feedback here or in separate issues.

ghost · 2021-09-09T02:00:25Z

Congratulations @Bodigrim, fantastic work, well done for getting it across the line.

Bodigrim · 2022-06-17T18:11:16Z

This is old news, but the PR has been released as a part of text-2.0. My ZuriHac talk, covering 10-years-long story of UTF-8 transition, is available at https://www.youtube.com/watch?v=1qlGe2qnGZQ.

thomasjm mentioned this pull request Aug 22, 2021

Surprising behavior of ByteString literals via IsString haskell/bytestring#140

Open

jberryman added a commit to jberryman/deferred-folds that referenced this pull request Aug 23, 2021

Compatibility with haskell/text#365

62b7bf2

jberryman added a commit to jberryman/attoparsec that referenced this pull request Aug 23, 2021

Compatibility with haskell/text#365 (needs a close review)

806bb14

chrisdone reviewed Aug 24, 2021

View reviewed changes

src/Data/Text/Internal.hs Outdated Show resolved Hide resolved

src/Data/Text/Internal.hs Outdated Show resolved Hide resolved

jberryman reviewed Aug 24, 2021

View reviewed changes

src/Data/Text/Internal.hs Show resolved Hide resolved

This was referenced Aug 24, 2021

Support text-1.3 (with UTF8 internal encoding) haskell-unordered-containers/hashable#214

Closed

N/A haskell-unordered-containers/hashable#215

Closed

jberryman added a commit to jberryman/hashable that referenced this pull request Aug 24, 2021

Compatibility with haskell/text#365

dcc38c7

jberryman reviewed Aug 26, 2021

View reviewed changes

src/Data/Text/Array.hs Outdated Show resolved Hide resolved

jberryman added a commit to jberryman/attoparsec that referenced this pull request Aug 26, 2021

Compatibility with haskell/text#365 (needs a close review)

eeba289

copyI semantics changed renaming for 16 -> 8 use iter* functions from Text directly

emilypi reviewed Aug 27, 2021

View reviewed changes

src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved

src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved

src/Data/Text/Encoding.hs Outdated Show resolved Hide resolved

src/Data/Text/Internal/Encoding/Utf8.hs Outdated Show resolved Hide resolved

chessai reviewed Aug 27, 2021

View reviewed changes

src/Data/Text/Internal/Fusion/CaseMapping.hs Show resolved Hide resolved

chessai reviewed Aug 27, 2021

View reviewed changes

src/Data/Text.hs Show resolved Hide resolved

Bodigrim added 8 commits September 6, 2021 20:38

Redesign concat

32c76d1

Use simdutf for UTF8 validation

012612a

Use GHC 8.10.5 for Windows build, because of issues with TH and simdutf

bcc4dc6

Avoid reconstructing chars in commonPrefixes

20b901d

Switch internal representation to UTF-8

87755a0

Rename constructors in Data.Array to highlight compatibility issues i…

d3beb94

…n downstream packages

Make utf8Length branchless

fd8cf06

Bump version and update changelog

fd49707

Bodigrim dismissed emilypi’s stale review via fd49707 September 6, 2021 22:11

Bodigrim force-pushed the utf8 branch from 628c2b5 to fd49707 Compare September 6, 2021 22:11

ghost reviewed Sep 7, 2021

View reviewed changes

changelog.md Show resolved Hide resolved

src/Data/Text/Array.hs Outdated Show resolved Hide resolved

ghost reviewed Sep 7, 2021

View reviewed changes

src/Data/Text/Internal/Encoding/Utf8.hs Show resolved Hide resolved

Tweak documentation

4e066ac

Boarders approved these changes Sep 7, 2021

View reviewed changes

emilypi approved these changes Sep 8, 2021

View reviewed changes

Bodigrim merged commit 3488190 into haskell:master Sep 8, 2021

Bodigrim deleted the utf8 branch September 8, 2021 21:03

Xitian9 mentioned this pull request Apr 26, 2022

Performance issue w/ new Text instances haskell-hvr/regex-tdfa#9

Open

sullyj3 mentioned this pull request May 5, 2022

Outdated reference to utf16 Data.Text in docs aesiniath/unbeliever#119

Closed

Lysxia mentioned this pull request May 16, 2022

Use vector operations in text decoding #272

Closed

Bodigrim mentioned this pull request Jun 17, 2022

UTF-8 Encoded Text haskellfoundation/tech-proposals#1

Merged

jberryman mentioned this pull request May 1, 2023

Broken on GHC 9.4 / text 2.0 fpco/odbc#55

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch internal representation to UTF8 #365

Switch internal representation to UTF8 #365

Bodigrim commented Aug 22, 2021

Fuuzetsu commented Aug 22, 2021 •

edited

Loading

chrisdone commented Aug 22, 2021

Fuuzetsu commented Aug 22, 2021

jberryman commented Aug 23, 2021 •

edited

Loading

jkachmar commented Aug 23, 2021 •

edited

Loading

chrisdone commented Aug 23, 2021

Bodigrim commented Aug 23, 2021

jberryman commented Aug 23, 2021 •

edited

Loading

Bodigrim commented Aug 23, 2021

Bodigrim commented Aug 23, 2021

chrisdone commented Aug 24, 2021

chrisdone left a comment

tomjaguarpaw commented Aug 27, 2021

Bodigrim commented Aug 27, 2021

emilypi commented Aug 27, 2021

L-as commented Aug 27, 2021

Bodigrim commented Aug 27, 2021

Bodigrim commented Aug 27, 2021

Bodigrim commented Sep 6, 2021

Bodigrim commented Sep 6, 2021

ghost left a comment

ketzacoatl commented Sep 7, 2021

tomjaguarpaw commented Sep 7, 2021

Boarders left a comment

ketzacoatl commented Sep 8, 2021

emilypi left a comment

Bodigrim commented Sep 8, 2021

ghost commented Sep 9, 2021

Bodigrim commented Jun 17, 2022

Switch internal representation to UTF8 #365

Switch internal representation to UTF8 #365

Conversation

Bodigrim commented Aug 22, 2021

Results

How to review

Known issues

Fuuzetsu commented Aug 22, 2021 • edited Loading

chrisdone commented Aug 22, 2021

Fuuzetsu commented Aug 22, 2021

jberryman commented Aug 23, 2021 • edited Loading

jkachmar commented Aug 23, 2021 • edited Loading

chrisdone commented Aug 23, 2021

Bodigrim commented Aug 23, 2021

jberryman commented Aug 23, 2021 • edited Loading

Bodigrim commented Aug 23, 2021

Bodigrim commented Aug 23, 2021

Performance report

chrisdone commented Aug 24, 2021

chrisdone left a comment

Choose a reason for hiding this comment

tomjaguarpaw commented Aug 27, 2021

Bodigrim commented Aug 27, 2021

Performance report: regressions

Data.Text.readFile vs. T.decodeUtf8 . ByteString.readFile

Tiny benchmarks

Filtering

Mapping

Appending builders

Consing and unconsing of lazy Text

Japanese

Summary

emilypi commented Aug 27, 2021

L-as commented Aug 27, 2021

Bodigrim commented Aug 27, 2021

Bodigrim commented Aug 27, 2021

Bodigrim commented Sep 6, 2021

Bodigrim commented Sep 6, 2021

ghost left a comment

Choose a reason for hiding this comment

ketzacoatl commented Sep 7, 2021

tomjaguarpaw commented Sep 7, 2021

Boarders left a comment

Choose a reason for hiding this comment

ketzacoatl commented Sep 8, 2021

emilypi left a comment

Choose a reason for hiding this comment

Bodigrim commented Sep 8, 2021

ghost commented Sep 9, 2021

Bodigrim commented Jun 17, 2022

Fuuzetsu commented Aug 22, 2021 •

edited

Loading

jberryman commented Aug 23, 2021 •

edited

Loading

jkachmar commented Aug 23, 2021 •

edited

Loading

jberryman commented Aug 23, 2021 •

edited

Loading

`Data.Text.readFile` vs. `T.decodeUtf8 . ByteString.readFile`