-
Notifications
You must be signed in to change notification settings - Fork 435
Description
Given a 2.8 gigabyte input file, the function AddByte (https://github.com/htacg/tidy-html5/blob/next/src/lexer.c#L949) will enter an infinite loop when called from prvTidyAddCharToLexer on both modern Linux & Darwin systems.
This is likely a result of the allocation strategy (multiplying by 2) and because the uint type used to define the allocAmt variable is an unsigned 32-bit integer on these systems. For example, the sys/types.h header on one system defines that type as unsigned int: https://github.com/apple/darwin-xnu/blob/master/bsd/sys/types.h#L92
The initial lexer state when the problem surfaces looks like this:
lexer->lexsize = 2147483646
lexer->lexlength = 2147483648
allocAmt = 0
Eventually, in my debugger it shows the value of allocAmt wrapping to 0 after reaching
allocAmt = 2147483648
at https://github.com/htacg/tidy-html5/blob/next/src/lexer.c#L955 when trying to increase the buffer by one more factor of two. The result overflows uint32 by one.
One solution may be to make the allocAmt a 64-bit integer type.
I searched for some alternative APIs in http://api.html-tidy.org/tidy/tidylib_api_5.6.0/group__IO.html, but it is not clear if these would solve this overflow issue.