Decoding of bytestrings with invalid UTF-8 has quadratic run time #495

fatho · 2023-01-25T13:24:30Z

It appears that when encountering invalid an UTF-8 sequence in decodeUtf8With, it calls decodeUtf8With2. In the inner loop there, it tries to decode the very first character, and then recursively calls itself on the rest.

This is an issue when the invalid code sequence appears at the (almost very end) of the string: For each character popped off the beginning, we try to decode the entire remainder of the string.

I wrote the following reproducer (using GHC 9.4.4 with text-2.0.1 and bytestring-0.11.3.1).

module Main where

import Control.Exception
import Data.Foldable (for_)
import System.CPUTime (getCPUTime)
import qualified Data.Text.Encoding as Enc
import qualified Data.ByteString as BS

main :: IO ()
main = do
  putStrLn "Running"
  for_ [1..75] $ \factor -> do
    let len = factor * 10000
    -- The string must not end with the invalid sequence, otherwise it'll get cut off eagerly
    let invalid = BS.replicate len 32 <> BS.replicate 1 216 <> BS.singleton 32
    evaluate invalid
    before <- getCPUTime
    let numIterations = 10
    for_ [1..numIterations] $ \_ -> do
      evaluate $ Enc.decodeUtf8Lenient invalid
    after <- getCPUTime
    let nanos = (after - before) `div` 1000 `div` numIterations
    putStrLn $ show (BS.length invalid) <> "\t" <> show nanos

Plotting the length vs run time records it prints clearly shows a quadratic growth:

Raw data of my run

Length (Bytes)	Nanoseconds
10002	363155
20002	440389
30002	758905
40002	1278086
50002	2079332
60002	2975872
70002	4092472
80002	5322233
90002	6755837
100002	8353817
110002	10090751
120002	12221735
130002	14223879
140002	16317896
150002	19183989
160002	21774090
170002	24199864
180002	26894248
190002	30935749
200002	34761685
210002	37555264
220002	40420152
230002	45140731
240002	50637179
250002	54801430
260002	58834538
270002	63470996
280002	71390759
290002	73341154
300002	78781234
310002	87749863
320002	93538205
330002	96206465
340002	104967901
350002	112452576
360002	118074408
370002	124928813
380002	133200718
390002	139740339
400002	147652567
410002	157100436
420002	163282491
430002	171497972
440002	182591830
450002	191219840
460002	198834990
470002	211232776
480002	220616658
490002	228667562
500002	241215551
510002	249927112
520002	260673296
530002	272971252
540002	282285464
550002	292999742
560002	307085056
570002	319529945
580002	329887608
590002	342599272
600002	351595686
610002	367887033
620002	380861250
630002	393925018
640002	409024668
650002	423207592
660002	434496271
670002	450354392
680002	464690475
690002	476479403
700002	493291074
710002	506146007
720002	519209847
730002	538656736
740002	549421069
750002	568419934

This issue seems to be mainly caused by the fact that isValidBS (regardless of its actual implementation as chosen by the preprocessor defines) only returns a boolean. That way, in the failure case, the error recovery code around it has no way of knowing where the faulty sequence starts.

If instead it would return an Int indicating the number of valid bytes found in the beginning, the error-recovery code would immediately know how many bytes can be used as-is for the decoded text, and where to start error recovery (like inserting replacement characters). It seems to me that the decoder used in isValidBS should already have this information (after all, it needs to know what part of the string it's currently processing), so maybe just exposing this would be an option.

The text was updated successfully, but these errors were encountered:

Bodigrim · 2023-01-25T23:23:55Z

Excellent catch! For starters, so that we have a reproducible example, could you possibly contribute a benchmark with a long enough string?

Lysxia · 2023-01-26T14:07:20Z

That should be fixed soon in #448 which overhauls the decoding logic.

fatho mentioned this issue Jan 25, 2023

Invalid UTF-8 byte sequence is accepted as valid UTF-8 by text-2.0.1 haskell/bytestring#575

Closed

Lysxia added the bug label Feb 4, 2023

Lysxia linked a pull request Feb 4, 2023 that will close this issue

Allow caller more control in stream decoding #448

Merged

Lysxia mentioned this issue Feb 6, 2023

Allow caller more control in stream decoding #448

Merged

Lysxia closed this as completed in #448 Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decoding of bytestrings with invalid UTF-8 has quadratic run time #495

Decoding of bytestrings with invalid UTF-8 has quadratic run time #495

fatho commented Jan 25, 2023 •

edited

Loading

Bodigrim commented Jan 25, 2023

Lysxia commented Jan 26, 2023 •

edited

Loading

Decoding of bytestrings with invalid UTF-8 has quadratic run time #495

Decoding of bytestrings with invalid UTF-8 has quadratic run time #495

Comments

fatho commented Jan 25, 2023 • edited Loading

Bodigrim commented Jan 25, 2023

Lysxia commented Jan 26, 2023 • edited Loading

fatho commented Jan 25, 2023 •

edited

Loading

Lysxia commented Jan 26, 2023 •

edited

Loading