Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AddByte allocAmt overflows for large input files #761

Closed
destroyhimmyrobots opened this issue Oct 10, 2018 · 5 comments
Closed

AddByte allocAmt overflows for large input files #761

destroyhimmyrobots opened this issue Oct 10, 2018 · 5 comments
Labels

Comments

@destroyhimmyrobots
Copy link

destroyhimmyrobots commented Oct 10, 2018

Given a 2.8 gigabyte input file, the function AddByte (https://github.com/htacg/tidy-html5/blob/next/src/lexer.c#L949) will enter an infinite loop when called from prvTidyAddCharToLexer on both modern Linux & Darwin systems.

This is likely a result of the allocation strategy (multiplying by 2) and because the uint type used to define the allocAmt variable is an unsigned 32-bit integer on these systems. For example, the sys/types.h header on one system defines that type as unsigned int: https://github.com/apple/darwin-xnu/blob/master/bsd/sys/types.h#L92

The initial lexer state when the problem surfaces looks like this:

lexer->lexsize = 2147483646
lexer->lexlength = 2147483648
allocAmt = 0

Eventually, in my debugger it shows the value of allocAmt wrapping to 0 after reaching

allocAmt = 2147483648

at https://github.com/htacg/tidy-html5/blob/next/src/lexer.c#L955 when trying to increase the buffer by one more factor of two. The result overflows uint32 by one.

One solution may be to make the allocAmt a 64-bit integer type.

I searched for some alternative APIs in http://api.html-tidy.org/tidy/tidylib_api_5.6.0/group__IO.html, but it is not clear if these would solve this overflow issue.

@geoffmcl
Copy link
Contributor

@destroyhimmyrobots wow, playing with HTML files in the 2.8 gigabyte plus range, certainly points out a design limit of libtidy... thanks for this...

At present all text kept by libtidy is in an allocated for each document, lexer byte buffer, and thus nodes holding this text are maintained just by keeping the start and end offset into this buffer of the text... a uint, all presently 32-bit in most systems...

The present method of just doubling the last allocation value, only allows for some 20 reallocation, up to a maximum of 2,147,483,648 byte, 0x80000000, and as you point out the next double gives us ZERO, 0, leading to a loop forever lock bug...

Yes, one thought would be to use 64-bit offsets, but that would involve changing quite a lot of code... each time and place where the start and end offsets are stored and used... it is certainly possible... but first to explore other alternatives as well...

Changing that to an additive method would theoretically get us closer to the maximum of a 32-bit uint, 4,294,967,295, 0xFFFFFFFF... or at least 4,294,959,104, 0xFFFFE000, if an additive value of 8,192, 0x2000, was used, which nearly doubles the potential HTML file handled... but still has an ultimate limit...

Of course the loop forever bug must be detected, and if seen, reached, the only thing libtidy can do is call panic... which, unless taken over by the libtidy user app, will abort with a message... just like the default TidyRealloc would do if it gets a NULL...

A suggested, tested, patch -

diff --git a/src/lexer.c b/src/lexer.c
index ca66aee..1962c4d 100644
--- a/src/lexer.c
+++ b/src/lexer.c
@@ -940,19 +940,31 @@ void TY_(FreeLexer)( TidyDocImpl* doc )
 /* Lexer uses bigger memory chunks than pprint as
 ** it must hold the entire input document. not just
 ** the last line or three.
+**
+** The buffer starts with an allocated 8192 bytes,
+** and is increased, as needed, by the same 8192 bytes,
+** each time, and calls 'TidyPanic' if the 32-bit uint
+** size overflows at about 4GB...
+**
+** 'TidyPanic' must/will never return.
 */
+static ctmbstr overflow = "\nPanic: 4GB lexer text buffer overrun! Aborting...\n";
 static void AddByte( Lexer *lexer, tmbchar ch )
 {
     if ( lexer->lexsize + 2 >= lexer->lexlength )
     {
         tmbstr buf = NULL;
-        uint allocAmt = lexer->lexlength;
+        uint prev, allocAmt = lexer->lexlength;
         while ( lexer->lexsize + 2 >= allocAmt )
         {
-            if ( allocAmt == 0 )
-                allocAmt = 8192;
-            else
-                allocAmt *= 2;
+            /* Issue #761 - change to additive method, replacing the 
+               doubling, and deal with buffer overflow at abt 4GB of text */
+            prev = allocAmt;
+            allocAmt += 8192;
+            if (allocAmt < prev) {
+                /* YEEK - size wrapped - need bigger buffer! */
+                TidyPanic(lexer->allocator, overflow);
+            }
         }
         buf = (tmbstr) TidyRealloc( lexer->allocator, lexer->lexbuf, allocAmt );
         if ( buf )

Effectively almost doubling the text input capability of libtidy!

Is this enough for now?

@geoffmcl
Copy link
Contributor

geoffmcl commented Dec 1, 2018

Although no comment from @destroyhimmyrobots, or others, I think this fix is a good thing ;=))

It does not remove the use of a uint, usually 32-bits in most systems, but it raises the maximum lexer stored text to just under 4GB, double what it has been, forever...

And it removes the loop lock, and dies gracefully, with message... well like any other out of memory situation...

Remember users of libtidy can provide their own TidyPanic callback, and stop libtidy without damage to the using app...

Anyway, have pushed an issue-761 branch for testing... appreciate building, testing, comments,...

Created PR #784, if all agreed...

Look forward to any feedback... thanks...

@geoffmcl
Copy link
Contributor

Have closed PR #784, since there appears to be a problem, in testing with large files!!!

To just address the uint wrap, using doubling, then suggest -

diff --git a/src/lexer.c b/src/lexer.c
index ca66aee..ce78a69 100644
--- a/src/lexer.c
+++ b/src/lexer.c
@@ -947,12 +947,15 @@ static void AddByte( Lexer *lexer, tmbchar ch )
     {
         tmbstr buf = NULL;
         uint allocAmt = lexer->lexlength;
+        uint prev = allocAmt; /* Is. #761 */
         while ( lexer->lexsize + 2 >= allocAmt )
         {
             if ( allocAmt == 0 )
                 allocAmt = 8192;
             else
                 allocAmt *= 2;
+            if (allocAmt < prev) /* Is. #761 - watch for wrap - and */
+                TidyPanic(lexer->allocator, "\nPanic: out of internal memory!\nDocument input too big!\n");
         }
         buf = (tmbstr) TidyRealloc( lexer->allocator, lexer->lexbuf, allocAmt );
         if ( buf )

Running this on my 2,348,438,784 in_761-3.html get the predictable tidy abort output -

using: 2019-05-20  21:03           769,024 tidy.exe
verions: HTML Tidy for Windows version 5.7.24.I761-1
cmd: tidy -q -o temp2.html in_761-3.html
Fatal error:
Panic: out of internal memory!
Document input too big!

Difference 114.88 seconds, or 00:01:54.88 ...```

Will get around to pushing another branch issue-761-1, testing the above... and PR... unless someone beats me to it... thanks...

geoffmcl added a commit that referenced this issue May 26, 2019
@geoffmcl
Copy link
Contributor

Have at least created the branch issue-761-1, so it can be pulled and tested... thanks...

@geoffmcl
Copy link
Contributor

Have created PR #830, and tested it, and it appears ok...

Any last minute feedback, comments, etc, before I merge this... thanks...

geoffmcl added a commit that referenced this issue Oct 2, 2020
Is. #761 - just deal with the 'uint' wrap
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants