AddByte allocAmt overflows for large input files #761

destroyhimmyrobots · 2018-10-10T04:29:14Z

Given a 2.8 gigabyte input file, the function AddByte (https://github.com/htacg/tidy-html5/blob/next/src/lexer.c#L949) will enter an infinite loop when called from prvTidyAddCharToLexer on both modern Linux & Darwin systems.

This is likely a result of the allocation strategy (multiplying by 2) and because the uint type used to define the allocAmt variable is an unsigned 32-bit integer on these systems. For example, the sys/types.h header on one system defines that type as unsigned int: https://github.com/apple/darwin-xnu/blob/master/bsd/sys/types.h#L92

The initial lexer state when the problem surfaces looks like this:

lexer->lexsize = 2147483646
lexer->lexlength = 2147483648
allocAmt = 0

Eventually, in my debugger it shows the value of allocAmt wrapping to 0 after reaching

allocAmt = 2147483648

at https://github.com/htacg/tidy-html5/blob/next/src/lexer.c#L955 when trying to increase the buffer by one more factor of two. The result overflows uint32 by one.

One solution may be to make the allocAmt a 64-bit integer type.

I searched for some alternative APIs in http://api.html-tidy.org/tidy/tidylib_api_5.6.0/group__IO.html, but it is not clear if these would solve this overflow issue.

The text was updated successfully, but these errors were encountered:

geoffmcl · 2018-10-16T19:37:36Z

@destroyhimmyrobots wow, playing with HTML files in the 2.8 gigabyte plus range, certainly points out a design limit of libtidy... thanks for this...

At present all text kept by libtidy is in an allocated for each document, lexer byte buffer, and thus nodes holding this text are maintained just by keeping the start and end offset into this buffer of the text... a uint, all presently 32-bit in most systems...

The present method of just doubling the last allocation value, only allows for some 20 reallocation, up to a maximum of 2,147,483,648 byte, 0x80000000, and as you point out the next double gives us ZERO, 0, leading to a loop forever lock bug...

Yes, one thought would be to use 64-bit offsets, but that would involve changing quite a lot of code... each time and place where the start and end offsets are stored and used... it is certainly possible... but first to explore other alternatives as well...

Changing that to an additive method would theoretically get us closer to the maximum of a 32-bit uint, 4,294,967,295, 0xFFFFFFFF... or at least 4,294,959,104, 0xFFFFE000, if an additive value of 8,192, 0x2000, was used, which nearly doubles the potential HTML file handled... but still has an ultimate limit...

Of course the loop forever bug must be detected, and if seen, reached, the only thing libtidy can do is call panic... which, unless taken over by the libtidy user app, will abort with a message... just like the default TidyRealloc would do if it gets a NULL...

A suggested, tested, patch -

diff --git a/src/lexer.c b/src/lexer.c
index ca66aee..1962c4d 100644
--- a/src/lexer.c
+++ b/src/lexer.c
@@ -940,19 +940,31 @@ void TY_(FreeLexer)( TidyDocImpl* doc )
 /* Lexer uses bigger memory chunks than pprint as
 ** it must hold the entire input document. not just
 ** the last line or three.
+**
+** The buffer starts with an allocated 8192 bytes,
+** and is increased, as needed, by the same 8192 bytes,
+** each time, and calls 'TidyPanic' if the 32-bit uint
+** size overflows at about 4GB...
+**
+** 'TidyPanic' must/will never return.
 */
+static ctmbstr overflow = "\nPanic: 4GB lexer text buffer overrun! Aborting...\n";
 static void AddByte( Lexer *lexer, tmbchar ch )
 {
     if ( lexer->lexsize + 2 >= lexer->lexlength )
     {
         tmbstr buf = NULL;
-        uint allocAmt = lexer->lexlength;
+        uint prev, allocAmt = lexer->lexlength;
         while ( lexer->lexsize + 2 >= allocAmt )
         {
-            if ( allocAmt == 0 )
-                allocAmt = 8192;
-            else
-                allocAmt *= 2;
+            /* Issue #761 - change to additive method, replacing the 
+               doubling, and deal with buffer overflow at abt 4GB of text */
+            prev = allocAmt;
+            allocAmt += 8192;
+            if (allocAmt < prev) {
+                /* YEEK - size wrapped - need bigger buffer! */
+                TidyPanic(lexer->allocator, overflow);
+            }
         }
         buf = (tmbstr) TidyRealloc( lexer->allocator, lexer->lexbuf, allocAmt );
         if ( buf )

Effectively almost doubling the text input capability of libtidy!

Is this enough for now?

… limit.

geoffmcl · 2018-12-01T14:07:43Z

Although no comment from @destroyhimmyrobots, or others, I think this fix is a good thing ;=))

It does not remove the use of a uint, usually 32-bits in most systems, but it raises the maximum lexer stored text to just under 4GB, double what it has been, forever...

And it removes the loop lock, and dies gracefully, with message... well like any other out of memory situation...

Remember users of libtidy can provide their own TidyPanic callback, and stop libtidy without damage to the using app...

Anyway, have pushed an issue-761 branch for testing... appreciate building, testing, comments,...

Created PR #784, if all agreed...

Look forward to any feedback... thanks...

geoffmcl · 2019-05-26T19:13:11Z

Have closed PR #784, since there appears to be a problem, in testing with large files!!!

To just address the uint wrap, using doubling, then suggest -

diff --git a/src/lexer.c b/src/lexer.c
index ca66aee..ce78a69 100644
--- a/src/lexer.c
+++ b/src/lexer.c
@@ -947,12 +947,15 @@ static void AddByte( Lexer *lexer, tmbchar ch )
     {
         tmbstr buf = NULL;
         uint allocAmt = lexer->lexlength;
+        uint prev = allocAmt; /* Is. #761 */
         while ( lexer->lexsize + 2 >= allocAmt )
         {
             if ( allocAmt == 0 )
                 allocAmt = 8192;
             else
                 allocAmt *= 2;
+            if (allocAmt < prev) /* Is. #761 - watch for wrap - and */
+                TidyPanic(lexer->allocator, "\nPanic: out of internal memory!\nDocument input too big!\n");
         }
         buf = (tmbstr) TidyRealloc( lexer->allocator, lexer->lexbuf, allocAmt );
         if ( buf )

Running this on my 2,348,438,784 in_761-3.html get the predictable tidy abort output -

using: 2019-05-20  21:03           769,024 tidy.exe
verions: HTML Tidy for Windows version 5.7.24.I761-1
cmd: tidy -q -o temp2.html in_761-3.html
Fatal error:
Panic: out of internal memory!
Document input too big!

Difference 114.88 seconds, or 00:01:54.88 ...```

Will get around to pushing another branch issue-761-1, testing the above... and PR... unless someone beats me to it... thanks...

geoffmcl · 2019-05-26T19:17:36Z

Have at least created the branch issue-761-1, so it can be pulled and tested... thanks...

geoffmcl · 2020-09-28T18:11:28Z

Have created PR #830, and tested it, and it appears ok...

Any last minute feedback, comments, etc, before I merge this... thanks...

Is. #761 - just deal with the 'uint' wrap

geoffmcl added the Bug label Oct 27, 2018

geoffmcl added a commit that referenced this issue Dec 1, 2018

Is. #761 - Double capacity, and remove loop lock - still a 4GB 'uint'…

c07c43e

… limit.

geoffmcl mentioned this issue Dec 1, 2018

Is. #761 - Double capacity, and remove loop lock - still a 4GB 'uint' #784

Closed

geoffmcl added a commit that referenced this issue May 26, 2019

Is. #761 - just deal with the 'uint' wrap

7bc23f0

geoffmcl added a commit that referenced this issue Oct 2, 2020

Merge pull request #830 from htacg/issue-761-1

3dafcf1

Is. #761 - just deal with the 'uint' wrap

geoffmcl closed this as completed in f7e9de6 Oct 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AddByte allocAmt overflows for large input files #761

AddByte allocAmt overflows for large input files #761

destroyhimmyrobots commented Oct 10, 2018 •

edited

geoffmcl commented Oct 16, 2018

geoffmcl commented Dec 1, 2018

geoffmcl commented May 26, 2019

geoffmcl commented May 26, 2019

geoffmcl commented Sep 28, 2020

AddByte allocAmt overflows for large input files #761

AddByte allocAmt overflows for large input files #761

Comments

destroyhimmyrobots commented Oct 10, 2018 • edited

geoffmcl commented Oct 16, 2018

geoffmcl commented Dec 1, 2018

geoffmcl commented May 26, 2019

geoffmcl commented May 26, 2019

geoffmcl commented Sep 28, 2020

destroyhimmyrobots commented Oct 10, 2018 •

edited