-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ocamllex raises Invalid_argument String.sub/Bytes.sub and sometimes segfaults #12901
Comments
Here is a smaller testcase (not minimal, but it has no external dependencies and can be run with this single shell-script):
This raises an exception when running the native code, and segfaults when running bytecode. Running under valgrind even the native code version segfaults:
The line at which the segfault occurs looks like this:
|
Just to confirm, can you run the repro using |
That doesn't raise any exceptions, and doesn't crash (although it takes ages to compile the generated .ml file which is quite big):
|
|
Smaller repro:
(replacing
|
Thanks for the minimization, and spotting the difference between
That and the location of the crash made me suspicious and added a bounds check (based off trunk commit b3b892e) . If I did it right then it first crashes on this out of bounds write, and there are no such crashes when using
That array is supposed to be a lot smaller (FWIW the size of the array is the same when run with
This is the assertion that I added:
Then I've seen
and in lexing.c:
So this does look like a 16-bit overflow:
I don't know what the solution would be though (switch to 32-bit entries in tables?), but if this is indeed the problem it'd be nice if |
Currently this is the only limit that I can see, shouldn't all tables have a limit?
Maybe something like this as a last resort to catch any overflows? diff --git a/lex/output.ml b/lex/output.ml
index d5ce76eb7f..75ae9bda7f 100644
--- a/lex/output.ml
+++ b/lex/output.ml
@@ -28,11 +28,13 @@ let output_byte oc b =
output_char oc (Char.chr(48 + (b / 10) mod 10));
output_char oc (Char.chr(48 + b mod 10))
+exception Table_overflow
let output_array oc v =
output_string oc " \"";
for i = 0 to Array.length v - 1 do
output_byte oc (v.(i) land 0xFF);
output_byte oc ((v.(i) asr 8) land 0xFF);
+ if v.(i) >= 0x8000 then raise Table_overflow;
if i land 7 = 7 then output_string oc "\\\n "
done;
output_string oc "\""
@@ -117,7 +119,6 @@ let output_entry some_mem_code ic oc has_refill oci e =
(* Main output function *)
-exception Table_overflow
let output_lexdef ic oc oci header rh tables entry_points trailer =
if not !Common.quiet_mode then
|
I've debugged on my side a came to the same conclusion (a few minutes later...).
Technically we could have a code table slightly bigger than 32768 (or 65536 with unsigned shorts), as we only need the start of the code sequences to be encoded properly, but it's probably better to be conservative here. |
(Now that I know why it was crashing, I rewrote the regex to reduce the state size and avoid this error in realworldocaml/mdx#445) |
I was trying to modify ocaml-mdx's lexer when suddenly I noticed some SEGV in its testsuite.
Unfortunately I don't have the SEGV reproducer anymore, but I was able to 100% reproduce the following error:
Which was introduced by this modification to the lexer:
If I remove the
'='
from the first character set then it no longer raises that exception.In other places where the testsuite runs it segfaults, which should definitely not happen.
E.g. using 'coredumpctl gdb' shows:
Happens on 5.1.1 too:
I've pushed the test code here: realworldocaml/mdx@main...edwintorok:mdx:segv, I haven't yet tried to create a minimal testcase, I'll update this bugreport when I do.
The text was updated successfully, but these errors were encountered: