New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AddByte allocAmt overflows for large input files #761
Comments
@destroyhimmyrobots wow, playing with HTML files in the 2.8 gigabyte plus range, certainly points out a At present all The present method of just doubling the last allocation value, only allows for some 20 reallocation, up to a maximum of 2,147,483,648 byte, 0x80000000, and as you point out the next double gives us ZERO, 0, leading to a loop forever lock bug... Yes, one thought would be to use 64-bit offsets, but that would involve changing quite a lot of code... each time and place where the Changing that to an Of course the loop forever bug must be detected, and if seen, reached, the only thing libtidy can do is call A suggested, tested, patch - diff --git a/src/lexer.c b/src/lexer.c
index ca66aee..1962c4d 100644
--- a/src/lexer.c
+++ b/src/lexer.c
@@ -940,19 +940,31 @@ void TY_(FreeLexer)( TidyDocImpl* doc )
/* Lexer uses bigger memory chunks than pprint as
** it must hold the entire input document. not just
** the last line or three.
+**
+** The buffer starts with an allocated 8192 bytes,
+** and is increased, as needed, by the same 8192 bytes,
+** each time, and calls 'TidyPanic' if the 32-bit uint
+** size overflows at about 4GB...
+**
+** 'TidyPanic' must/will never return.
*/
+static ctmbstr overflow = "\nPanic: 4GB lexer text buffer overrun! Aborting...\n";
static void AddByte( Lexer *lexer, tmbchar ch )
{
if ( lexer->lexsize + 2 >= lexer->lexlength )
{
tmbstr buf = NULL;
- uint allocAmt = lexer->lexlength;
+ uint prev, allocAmt = lexer->lexlength;
while ( lexer->lexsize + 2 >= allocAmt )
{
- if ( allocAmt == 0 )
- allocAmt = 8192;
- else
- allocAmt *= 2;
+ /* Issue #761 - change to additive method, replacing the
+ doubling, and deal with buffer overflow at abt 4GB of text */
+ prev = allocAmt;
+ allocAmt += 8192;
+ if (allocAmt < prev) {
+ /* YEEK - size wrapped - need bigger buffer! */
+ TidyPanic(lexer->allocator, overflow);
+ }
}
buf = (tmbstr) TidyRealloc( lexer->allocator, lexer->lexbuf, allocAmt );
if ( buf ) Effectively almost doubling the text input capability of Is this enough for now? |
Although no comment from @destroyhimmyrobots, or others, I think this It does not remove the use of a uint, usually 32-bits in most systems, but it raises the maximum lexer stored text to just under 4GB, double what it has been, forever... And it removes the loop lock, and dies gracefully, with message... well like any other out of memory situation... Remember users of Anyway, have pushed an Created PR #784, if all agreed... Look forward to any feedback... thanks... |
Have closed PR #784, since there appears to be a problem, in testing with large files!!! To just address the
Running this on my 2,348,438,784
Will get around to pushing another branch |
Have at least created the branch |
Have created PR #830, and tested it, and it appears ok... Any last minute feedback, comments, etc, before I merge this... thanks... |
Is. #761 - just deal with the 'uint' wrap
Given a 2.8 gigabyte input file, the function
AddByte
(https://github.com/htacg/tidy-html5/blob/next/src/lexer.c#L949) will enter an infinite loop when called fromprvTidyAddCharToLexer
on both modern Linux & Darwin systems.This is likely a result of the allocation strategy (multiplying by 2) and because the
uint
type used to define theallocAmt
variable is an unsigned 32-bit integer on these systems. For example, thesys/types.h
header on one system defines that type asunsigned int
: https://github.com/apple/darwin-xnu/blob/master/bsd/sys/types.h#L92The initial lexer state when the problem surfaces looks like this:
Eventually, in my debugger it shows the value of
allocAmt
wrapping to 0 after reachingat https://github.com/htacg/tidy-html5/blob/next/src/lexer.c#L955 when trying to increase the buffer by one more factor of two. The result overflows uint32 by one.
One solution may be to make the
allocAmt
a 64-bit integer type.I searched for some alternative APIs in http://api.html-tidy.org/tidy/tidylib_api_5.6.0/group__IO.html, but it is not clear if these would solve this overflow issue.
The text was updated successfully, but these errors were encountered: