Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ocamllex raises Invalid_argument String.sub/Bytes.sub and sometimes segfaults #12901

Closed
edwintorok opened this issue Jan 16, 2024 · 9 comments · Fixed by #12908
Closed

ocamllex raises Invalid_argument String.sub/Bytes.sub and sometimes segfaults #12901

edwintorok opened this issue Jan 16, 2024 · 9 comments · Fixed by #12908

Comments

@edwintorok
Copy link
Contributor

I was trying to modify ocaml-mdx's lexer when suddenly I noticed some SEGV in its testsuite.
Unfortunately I don't have the SEGV reproducer anymore, but I was able to 100% reproduce the following error:

Raised at Stdlib.invalid_arg in file "stdlib.ml", line 30, characters 20-45
Called from Stdlib__Bytes.sub_string in file "bytes.ml" (inlined), line 73, characters 44-59
Called from Stdlib__Lexing.sub_lexeme in file "lexing.ml" (inlined), line 196, characters 2-43
Called from Mdx__Lexer_mdx.__ocaml_lex_text_rec in file "lib/lexer_mdx.ml", line 11976, characters 2-78
Called from Mdx__Lexer_mdx.__ocaml_lex_text_rec in file "lib/lexer_mdx.mll", line 59, characters 21-40
Called from Mdx__Lexer_mdx.__ocaml_lex_text_rec in file "lib/lexer_mdx.mll", line 59, characters 21-40
Called from Mdx__Lexer_mdx.__ocaml_lex_text_rec in file "lib/lexer_mdx.mll", line 59, characters 21-40
Called from Mdx__Lexer_mdx.markdown_token in file "lib/lexer_mdx.mll", line 126, characters 11-29

Which was introduced by this modification to the lexer:

   | ( "<!--" ws* "$MDX" ws* ([^' ' '\n']* as label_cmt) ws* "-->" ws* eol? )?
-      "```" ([^' ' '\n']* as header) ws* ([^'\n']* as legacy_labels) eol
+      "```"  ([^' ' '=' '\n']* as header)? ws* ([^ '{' '\n']* as legacy_labels)
+      ('{' (('#'[^' ' '\n']+)? as _identifier) ws* ('.'([^' ' '}' '\n']+) as _header2)? ws* ([^'}' '\n']* as _attributes) '}')?
+      eol

If I remove the '=' from the first character set then it no longer raises that exception.

In other places where the testsuite runs it segfaults, which should definitely not happen.

E.g. using 'coredumpctl gdb' shows:

Core was generated by `/var/home/edwin/.opam/4.14.1/bin/ocamlrun ../../../../../../install/default/bin'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  caml_string_equal (s1=275, s2=139980178532800) at str.c:277
Downloading source file /var/home/edwin/.opam/4.14.1/.opam-switch/build/ocaml-base-compiler.4.14.1/runtime/str.c
                                                                                                                                                 
warning: 277    str.c: No such file or directory
(gdb) bt
#0  caml_string_equal (s1=275, s2=139980178532800) at str.c:277
#1  0x000000000042c8a8 in caml_interprete (prog=0x7f4face00010, prog_size=<optimized out>) at interp.c:936
#2  0x000000000042f55c in caml_main (argv=0x7fff1d4aba48) at startup_byt.c:588
#3  0x000000000040f78c in main (argc=<optimized out>, argv=<optimized out>) at main.c:37

Happens on 5.1.1 too:

rogram terminated with signal SIGSEGV, Segmentation fault.
#0  caml_string_equal (s1=-1, s2=139720215583240) at runtime/str.c:277
                                                                                                                                                 
warning: 277    runtime/str.c: No such file or directory
(gdb) bt
#0  caml_string_equal (s1=-1, s2=139720215583240) at runtime/str.c:277
#1  0x000000000043a28b in caml_interprete (prog=<optimized out>, prog_size=<optimized out>) at runtime/interp.c:1044
#2  0x000000000043bd39 in caml_main (argv=0x7ffd4c9db8c8) at runtime/startup_byt.c:577
#3  0x0000000000411a0c in main (argc=<optimized out>, argv=<optimized out>) at runtime/main.c:37

I've pushed the test code here: realworldocaml/mdx@main...edwintorok:mdx:segv, I haven't yet tried to create a minimal testcase, I'll update this bugreport when I do.

@edwintorok
Copy link
Contributor Author

edwintorok commented Jan 16, 2024

Here is a smaller testcase (not minimal, but it has no external dependencies and can be run with this single shell-script):

#!/bin/sh

cat >lexer_mdx.mll <<'EOF'
let eol = '\n' | eof
let ws = ' ' | '\t'

rule text section = parse
  | eof { [] }
  | ( "<!--" ws* "$MDX" ws* ([^' ' '\n']* as _label_cmt) ws* "-->" ws* eol? )?
      "```"  ([^' ' '=' '\n']* as _header)? ws* ([^ '{' '\n']* as _legacy_labels)
      ('{' (('#'[^' ' '\n']+)? as _identifier) ws* ('.'([^' ' '}' '\n']+) as _header2)? ws* ([^'}' '\n']* as _attributes) '}')?
      eol
      {
        let contents = block lexbuf in
        `Block contents :: text section lexbuf }
  | ([^'\n']* as str) eol
      {
        `Text str :: text section lexbuf }

and block = parse
  | eof | ws* as end_pad "```" ws* eol
    {
      [end_pad] }
  | ([^'\n']* as str) eol
    {
      str :: block lexbuf }
EOF

cat >testlexer.ml <<'EOF'
let input = {|
<!-- $MDX set-SOME_VAR="a" -->
```sh set-SOME_OTHER_VAR="b"
echo $SOME_VAR
a
echo $SOME_OTHER_VAR
b
```
|}

let () =
  let lexbuf = Lexing.from_string input in
  while
  match Lexer_mdx.text None lexbuf with
  | [] -> false
  | _ -> true
  do () done

EOF

ocamllex lexer_mdx.mll
ocamlopt --version
ocamlopt -runtime-variant d lexer_mdx.ml testlexer.ml -o testlexer.native -g
ocamlc -runtime-variant d lexer_mdx.ml testlexer.ml -o testlexer.bc -g
./testlexer.native
ocamlrun ./testlexer.bc
valgrind ./testlexer.native
ocamlrun ./testlexer.bc

This raises an exception when running the native code, and segfaults when running bytecode. Running under valgrind even the native code version segfaults:

1528 states, 13500 transitions, table size 63168 bytes
116884 additional bytes used for bindings
4.14.1
### OCaml runtime: debug mode ###
Initial minor heap size: 256k words
Initial major heap size: 992k bytes
Initial space overhead: 120%
Initial max overhead: 500%
Initial heap increment: 15%
Initial allocation policy: 2
Initial smoothing window: 1
./testme.sh: line 54: 3276556 Segmentation fault      (core dumped) ./testlexer.native
Fatal error: exception Invalid_argument("String.sub / Bytes.sub")
==3276559== Memcheck, a memory error detector
==3276559== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==3276559== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==3276559== Command: ./testlexer.native
==3276559==
### OCaml runtime: debug mode ###
Initial minor heap size: 256k words
Initial major heap size: 992k bytes
Initial space overhead: 120%
Initial max overhead: 500%
Initial heap increment: 15%
Initial allocation policy: 2
Initial smoothing window: 1
==3276559== Invalid read of size 8
==3276559==    at 0x41B859: camlLexer_mdx____ocaml_lex_block_rec_295 (lexer_mdx.ml:11362)
==3276559==    by 0x41B61A: camlLexer_mdx____ocaml_lex_text_rec_293 (lexer_mdx.mll:11)
==3276559==    by 0x41B6BA: camlLexer_mdx____ocaml_lex_text_rec_293 (lexer_mdx.mll:15)
==3276559==    by 0x41B3C0: camlTestlexer__entry (testlexer.ml:14)
==3276559==    by 0x41A8B8: caml_program (in /home/edvint/git/mdx/t1/testlexer.native)
==3276559==    by 0x44C468: caml_start_program (in /home/edvint/git/mdx/t1/testlexer.native)
==3276559==    by 0x44CBD3: caml_startup_common (startup_nat.c:160)
==3276559==    by 0x44CC5A: caml_startup_exn (startup_nat.c:167)
==3276559==    by 0x44CC5A: caml_startup (startup_nat.c:172)
==3276559==    by 0x44CC5A: caml_main (startup_nat.c:179)
==3276559==    by 0x41A6BB: main (main.c:37)
==3276559==  Address 0x454d4f532d74656b is not stack'd, malloc'd or (recently) free'd
==3276559==
==3276559==
==3276559== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==3276559==  General Protection Fault
==3276559==    at 0x41B859: camlLexer_mdx____ocaml_lex_block_rec_295 (lexer_mdx.ml:11362)
==3276559==    by 0x41B61A: camlLexer_mdx____ocaml_lex_text_rec_293 (lexer_mdx.mll:11)
==3276559==    by 0x41B6BA: camlLexer_mdx____ocaml_lex_text_rec_293 (lexer_mdx.mll:15)
==3276559==    by 0x41B3C0: camlTestlexer__entry (testlexer.ml:14)
==3276559==    by 0x41A8B8: caml_program (in /home/edvint/git/mdx/t1/testlexer.native)
==3276559==    by 0x44C468: caml_start_program (in /home/edvint/git/mdx/t1/testlexer.native)
==3276559==    by 0x44CBD3: caml_startup_common (startup_nat.c:160)
==3276559==    by 0x44CC5A: caml_startup_exn (startup_nat.c:167)
==3276559==    by 0x44CC5A: caml_startup (startup_nat.c:172)
==3276559==    by 0x44CC5A: caml_main (startup_nat.c:179)
==3276559==    by 0x41A6BB: main (main.c:37)
==3276559==
==3276559== HEAP SUMMARY:
==3276559==     in use at exit: 4,194,657 bytes in 51 blocks
==3276559==   total heap usage: 51 allocs, 0 frees, 4,194,657 bytes allocated
==3276559==
==3276559== LEAK SUMMARY:
==3276559==    definitely lost: 0 bytes in 0 blocks
==3276559==    indirectly lost: 0 bytes in 0 blocks
==3276559==      possibly lost: 1,028,152 bytes in 2 blocks
==3276559==    still reachable: 3,166,505 bytes in 49 blocks
==3276559==         suppressed: 0 bytes in 0 blocks
==3276559== Rerun with --leak-check=full to see details of leaked memory
==3276559==
==3276559== For lists of detected and suppressed errors, rerun with: -s
==3276559== ERROR SUMMARY: 2 errors from 1 contexts (suppressed: 0 from 0)
./testme.sh: line 56: 3276559 Segmentation fault      (core dumped) valgrind ./testlexer.native
Fatal error: exception Invalid_argument("String.sub / Bytes.sub")

The line at which the segfault occurs looks like this:

 11362 "lexer_mdx.ml"
= Lexing.sub_lexeme lexbuf lexbuf.Lexing.lex_start_pos lexbuf.Lexing.lex_mem.(0) in

@nojb
Copy link
Contributor

nojb commented Jan 16, 2024

Just to confirm, can you run the repro using ocamllex -ml (to use the ML backend instead of the C automata interpreter)?

@edwintorok
Copy link
Contributor Author

Just to confirm, can you run the repro using ocamllex -ml (to use the ML backend instead of the C automata interpreter)?

That doesn't raise any exceptions, and doesn't crash (although it takes ages to compile the generated .ml file which is quite big):

wc -c lexer_mdx.ml
6804795 lexer_mdx.ml

@nojb
Copy link
Contributor

nojb commented Jan 16, 2024

The line at which the segfault occurs looks like this:

 11362 "lexer_mdx.ml"
= Lexing.sub_lexeme lexbuf lexbuf.Lexing.lex_start_pos lexbuf.Lexing.lex_mem.(0) in

lexbuf.Lexing.lex_mem.(0) is -1 here, which looks wrong.

@nojb
Copy link
Contributor

nojb commented Jan 16, 2024

Smaller repro:

lexer_mdx.mll

rule text = parse
  | ( ([^' ']* as _a) '}' )?
      ([^' ']* as _b)? ' '* ('{'* as _x)
      ('{' (('#'[^' ']+)? as _d) ' '* ('.'[^' ']+)? '}' '}')?
      { () }
  | [^'\n']* '\n'
      { () }

(replacing _x by _c makes the segfault go away).

testlexer.ml

let input = {|
```sh set-SOME_OTHER_VAR
```
|}

let () =
  let lexbuf = Lexing.from_string input in
  while true do
    Lexer_mdx.text lexbuf
  done

@edwintorok
Copy link
Contributor Author

Thanks for the minimization, and spotting the difference between _x and _c. If you look at the output of ocamllex then the size of the tables is different (is it expected that the name of a value changes the size?)

1112 states, 7499 transitions, table size 36668 bytes
93527 additional bytes used for bindings

1112 states, 7499 transitions, table size 36668 bytes
93371 additional bytes used for bindings

That and the location of the crash made me suspicious and added a bounds check (based off trunk commit b3b892e) . If I did it right then it first crashes on this out of bounds write, and there are no such crashes when using _c as the name:

$ ocamllex lexer_mdx.mll && ../ocamlc lexer_mdx.ml testlexer.ml -o t && ../runtime/ocamlrun ./t
...
[9] <- 7
[8] <- 7
[19] <- [13]
[253] <- [15]
ocamlrun: runtime/lexing.c:159: run_tag: Assertion `dst < caml_array_length(mem)' failed.
[1]    29627 IOT instruction (core dumped)  ../runtime/ocamlrun ./t

That array is supposed to be a lot smaller (FWIW the size of the array is the same when run with _c and _x but it never ends up trying to write 253 with _c):

lexbuf.Lexing.lex_mem <- Array.make 38 (-1);(* L=5 [12] <- p ; [11] <- p ; [10] <- p ; [9] <- p ; [8] <- p ;  *)

This is the assertion that I added:

@@ -146,12 +149,14 @@ static void run_tag(char *pc, value mem) {
     dst = *pc++ ;
     if (dst == 0xff)
       return ;
-    src = *pc++ ;
+    src = *pc++ ; 
     if (src == 0xff) {
-      /*      fprintf(stderr,"[%hhu] <- -1\n",dst) ; */
+            fprintf(stderr,"[%hhu] <- -1\n",dst) ; 
+      assert(dst < caml_array_length(mem));
       Field(mem,dst) = Val_int(-1) ;
     } else {
-      /*      fprintf(stderr,"[%hhu] <- [%hhu]\n",dst,src) ; */
+              fprintf(stderr,"[%hhu] <- [%hhu]\n",dst,src) ; 
+      assert(dst < caml_array_length(mem));
       Field(mem,dst) = Field(mem,src) ;
     }
   }

Then I've seen Short() and the not so short numbers in the ocamllex output, so I've added some more debug code:

--- a/lex/output.ml
+++ b/lex/output.ml
@@ -33,6 +33,8 @@ let output_array oc v =
   for i = 0 to Array.length v - 1 do
     output_byte oc (v.(i) land 0xFF);
     output_byte oc ((v.(i) asr 8) land 0xFF);
+    if v.(i) >= 32768 then
+      Printf.eprintf "Beware of v.[%d] = %d\n" i v.(i);
     if i land 7 = 7 then output_string oc "\\\n    "
   done;

and in lexing.c:

@@ -185,6 +190,7 @@ CAMLprim value caml_new_lex_engine(value vtbl, value start_state,
     backtrk = Short(tbl->lex_backtrk, state);
     if (backtrk >= 0) {
       int pc_off =  Short(tbl->lex_backtrk_code, state);
+      fprintf(stderr,"pc_off = %d (lex_backtrk_code[%d])\n", pc_off, state);
       run_tag(Bp_val(tbl->lex_code) + pc_off, lexbuf->lex_mem);
       lexbuf->lex_last_pos = lexbuf->lex_curr_pos;
       lexbuf->lex_last_action = Val_int(backtrk);
...
Beware of v.[792] = 32772
...
pc_off = -32764 (lex_backtrk_code[792])
[253] <- [15]
ocamlrun: runtime/lexing.c:159: run_tag: Assertion `dst < caml_array_length(mem)' failed.

So this does look like a 16-bit overflow:

(-32764 land 0xFF) = (32772 land 0xFF);;
- : bool = true
((-32764 asr 8) land 0xFF) = ((32772 asr 8) land 0xFF);;
- : bool = true

I don't know what the solution would be though (switch to 32-bit entries in tables?), but if this is indeed the problem it'd be nice if ocamllex could detect when its tables have become too big and fail (or switch to the ML backend) than produce something that segfaults.

@edwintorok
Copy link
Contributor Author

edwintorok commented Jan 16, 2024

Currently this is the only limit that I can see, shouldn't all tables have a limit?

output.ml:  if Array.length tables.tbl_trans > 0x8000 then raise Table_overflow;

Maybe something like this as a last resort to catch any overflows?

diff --git a/lex/output.ml b/lex/output.ml
index d5ce76eb7f..75ae9bda7f 100644
--- a/lex/output.ml
+++ b/lex/output.ml
@@ -28,11 +28,13 @@ let output_byte oc b =
   output_char oc (Char.chr(48 + (b / 10) mod 10));
   output_char oc (Char.chr(48 + b mod 10))
 
+exception Table_overflow
 let output_array oc v =
   output_string oc "   \"";
   for i = 0 to Array.length v - 1 do
     output_byte oc (v.(i) land 0xFF);
     output_byte oc ((v.(i) asr 8) land 0xFF);
+    if v.(i) >= 0x8000 then raise Table_overflow;
     if i land 7 = 7 then output_string oc "\\\n    "
   done;
   output_string oc "\""
@@ -117,7 +119,6 @@ let output_entry some_mem_code ic oc has_refill oci e =
 
 (* Main output function *)
 
-exception Table_overflow
 
 let output_lexdef ic oc oci header rh tables entry_points trailer =
   if not !Common.quiet_mode then

@lthls
Copy link
Contributor

lthls commented Jan 16, 2024

I've debugged on my side a came to the same conclusion (a few minutes later...).
I see two things we should do:

  • Properly raise on overflows, as suggested by @edwintorok above;
  • Switch to unsigned shorts for the pc_off reads, to recover an extra wasted bit

Technically we could have a code table slightly bigger than 32768 (or 65536 with unsigned shorts), as we only need the start of the code sequences to be encoded properly, but it's probably better to be conservative here.

@edwintorok
Copy link
Contributor Author

(Now that I know why it was crashing, I rewrote the regex to reduce the state size and avoid this error in realworldocaml/mdx#445)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants