You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It appears that when encountering invalid an UTF-8 sequence in decodeUtf8With, it calls decodeUtf8With2. In the inner loop there, it tries to decode the very first character, and then recursively calls itself on the rest.
This is an issue when the invalid code sequence appears at the (almost very end) of the string: For each character popped off the beginning, we try to decode the entire remainder of the string.
I wrote the following reproducer (using GHC 9.4.4 with text-2.0.1 and bytestring-0.11.3.1).
moduleMainwhereimportControl.ExceptionimportData.Foldable (for_)
importSystem.CPUTime (getCPUTime)
importqualifiedData.Text.EncodingasEncimportqualifiedData.ByteStringasBSmain::IO()
main =doputStrLn"Running"
for_ [1..75] $\factor ->dolet len = factor *10000-- The string must not end with the invalid sequence, otherwise it'll get cut off eagerlylet invalid =BS.replicate len 32<>BS.replicate1216<>BS.singleton 32
evaluate invalid
before <- getCPUTime
let numIterations =10
for_ [1..numIterations] $\_ ->do
evaluate $Enc.decodeUtf8Lenient invalid
after <- getCPUTime
let nanos = (after - before) `div`1000`div` numIterations
putStrLn$show (BS.length invalid) <>"\t"<>show nanos
Plotting the length vs run time records it prints clearly shows a quadratic growth:
Raw data of my run
Length (Bytes)
Nanoseconds
10002
363155
20002
440389
30002
758905
40002
1278086
50002
2079332
60002
2975872
70002
4092472
80002
5322233
90002
6755837
100002
8353817
110002
10090751
120002
12221735
130002
14223879
140002
16317896
150002
19183989
160002
21774090
170002
24199864
180002
26894248
190002
30935749
200002
34761685
210002
37555264
220002
40420152
230002
45140731
240002
50637179
250002
54801430
260002
58834538
270002
63470996
280002
71390759
290002
73341154
300002
78781234
310002
87749863
320002
93538205
330002
96206465
340002
104967901
350002
112452576
360002
118074408
370002
124928813
380002
133200718
390002
139740339
400002
147652567
410002
157100436
420002
163282491
430002
171497972
440002
182591830
450002
191219840
460002
198834990
470002
211232776
480002
220616658
490002
228667562
500002
241215551
510002
249927112
520002
260673296
530002
272971252
540002
282285464
550002
292999742
560002
307085056
570002
319529945
580002
329887608
590002
342599272
600002
351595686
610002
367887033
620002
380861250
630002
393925018
640002
409024668
650002
423207592
660002
434496271
670002
450354392
680002
464690475
690002
476479403
700002
493291074
710002
506146007
720002
519209847
730002
538656736
740002
549421069
750002
568419934
This issue seems to be mainly caused by the fact that isValidBS (regardless of its actual implementation as chosen by the preprocessor defines) only returns a boolean. That way, in the failure case, the error recovery code around it has no way of knowing where the faulty sequence starts.
If instead it would return an Int indicating the number of valid bytes found in the beginning, the error-recovery code would immediately know how many bytes can be used as-is for the decoded text, and where to start error recovery (like inserting replacement characters). It seems to me that the decoder used in isValidBS should already have this information (after all, it needs to know what part of the string it's currently processing), so maybe just exposing this would be an option.
The text was updated successfully, but these errors were encountered:
It appears that when encountering invalid an UTF-8 sequence in
decodeUtf8With
, it callsdecodeUtf8With2
. In theinner
loop there, it tries to decode the very first character, and then recursively calls itself on the rest.This is an issue when the invalid code sequence appears at the (almost very end) of the string: For each character popped off the beginning, we try to decode the entire remainder of the string.
I wrote the following reproducer (using GHC 9.4.4 with
text-2.0.1
andbytestring-0.11.3.1
).Plotting the length vs run time records it prints clearly shows a quadratic growth:
Raw data of my run
This issue seems to be mainly caused by the fact that
isValidBS
(regardless of its actual implementation as chosen by the preprocessor defines) only returns a boolean. That way, in the failure case, the error recovery code around it has no way of knowing where the faulty sequence starts.If instead it would return an
Int
indicating the number of valid bytes found in the beginning, the error-recovery code would immediately know how many bytes can be used as-is for the decoded text, and where to start error recovery (like inserting replacement characters). It seems to me that the decoder used inisValidBS
should already have this information (after all, it needs to know what part of the string it's currently processing), so maybe just exposing this would be an option.The text was updated successfully, but these errors were encountered: