Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoding of bytestrings with invalid UTF-8 has quadratic run time #495

Closed
fatho opened this issue Jan 25, 2023 · 2 comments · Fixed by #448
Closed

Decoding of bytestrings with invalid UTF-8 has quadratic run time #495

fatho opened this issue Jan 25, 2023 · 2 comments · Fixed by #448
Labels

Comments

@fatho
Copy link

fatho commented Jan 25, 2023

It appears that when encountering invalid an UTF-8 sequence in decodeUtf8With, it calls decodeUtf8With2. In the inner loop there, it tries to decode the very first character, and then recursively calls itself on the rest.

This is an issue when the invalid code sequence appears at the (almost very end) of the string: For each character popped off the beginning, we try to decode the entire remainder of the string.

I wrote the following reproducer (using GHC 9.4.4 with text-2.0.1 and bytestring-0.11.3.1).

module Main where

import Control.Exception
import Data.Foldable (for_)
import System.CPUTime (getCPUTime)
import qualified Data.Text.Encoding as Enc
import qualified Data.ByteString as BS

main :: IO ()
main = do
  putStrLn "Running"
  for_ [1..75] $ \factor -> do
    let len = factor * 10000
    -- The string must not end with the invalid sequence, otherwise it'll get cut off eagerly
    let invalid = BS.replicate len 32 <> BS.replicate 1 216 <> BS.singleton 32
    evaluate invalid
    before <- getCPUTime
    let numIterations = 10
    for_ [1..numIterations] $ \_ -> do
      evaluate $ Enc.decodeUtf8Lenient invalid
    after <- getCPUTime
    let nanos = (after - before) `div` 1000 `div` numIterations
    putStrLn $ show (BS.length invalid) <> "\t" <> show nanos

Plotting the length vs run time records it prints clearly shows a quadratic growth:

image

Raw data of my run
Length (Bytes) Nanoseconds
10002 363155
20002 440389
30002 758905
40002 1278086
50002 2079332
60002 2975872
70002 4092472
80002 5322233
90002 6755837
100002 8353817
110002 10090751
120002 12221735
130002 14223879
140002 16317896
150002 19183989
160002 21774090
170002 24199864
180002 26894248
190002 30935749
200002 34761685
210002 37555264
220002 40420152
230002 45140731
240002 50637179
250002 54801430
260002 58834538
270002 63470996
280002 71390759
290002 73341154
300002 78781234
310002 87749863
320002 93538205
330002 96206465
340002 104967901
350002 112452576
360002 118074408
370002 124928813
380002 133200718
390002 139740339
400002 147652567
410002 157100436
420002 163282491
430002 171497972
440002 182591830
450002 191219840
460002 198834990
470002 211232776
480002 220616658
490002 228667562
500002 241215551
510002 249927112
520002 260673296
530002 272971252
540002 282285464
550002 292999742
560002 307085056
570002 319529945
580002 329887608
590002 342599272
600002 351595686
610002 367887033
620002 380861250
630002 393925018
640002 409024668
650002 423207592
660002 434496271
670002 450354392
680002 464690475
690002 476479403
700002 493291074
710002 506146007
720002 519209847
730002 538656736
740002 549421069
750002 568419934

This issue seems to be mainly caused by the fact that isValidBS (regardless of its actual implementation as chosen by the preprocessor defines) only returns a boolean. That way, in the failure case, the error recovery code around it has no way of knowing where the faulty sequence starts.

If instead it would return an Int indicating the number of valid bytes found in the beginning, the error-recovery code would immediately know how many bytes can be used as-is for the decoded text, and where to start error recovery (like inserting replacement characters). It seems to me that the decoder used in isValidBS should already have this information (after all, it needs to know what part of the string it's currently processing), so maybe just exposing this would be an option.

@Bodigrim
Copy link
Contributor

Excellent catch! For starters, so that we have a reproducible example, could you possibly contribute a benchmark with a long enough string?

@Lysxia
Copy link
Contributor

Lysxia commented Jan 26, 2023

That should be fixed soon in #448 which overhauls the decoding logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants