Error message shows wrong source text with preprocessor #12238

Spivoxity · 2023-05-12T01:24:36Z

I'm using OCaml 4.11.1 on Debian/amd64.

If a (text-based) preprocessor heads its output with a #line directive, then compiler error messages show the correct line and char numbers, but the echo of the source text shown next to them is wrong. For example, try compiling the attached file foo.ml using the trivial preprocessor trivpp, a shell script that just prepends a #line directive.

ocamlc -c -pp trivpp foo.ml

The result is the following error message:

File "foo.ml", line 100, characters 8-11:
 99 | ..
100 | ..... = .
Error: This expression has type int
       This is not a function; it cannot be applied.

The line number is correct, but the echo of the source is garbled.

Interestingly, deleting all the input text after the line with the error results in a correct display of the relevant source code.

(I am not using this trivial preprocessor, but another one that implements certain constructs by textual expansion. For files that don't contain instances of the constructs, its effect is the same as the trivial preprocessor, however.)

The text was updated successfully, but these errors were encountered:

Spivoxity · 2023-05-12T01:27:42Z

Attached files
spivey.zip

Spivoxity · 2023-05-12T01:28:02Z

Reproduced in 5.0.0

xavierleroy · 2023-05-16T09:04:19Z

This is your trivpp script:

#!/bin/sh
echo "# 1 \"$1\""
cat $1

The # lineno filename directive gives the line number for the next line, not for the line containing the directive. So, it should be echo "# 2 \"$1\"". I know it's confusing, but that's how the C preprocessor works too.

Spivoxity · 2023-05-16T10:36:00Z

Thanks for your response.

Yes, it is confusing, but très respectueusement I suggest that ce n'est pas moi qui est confus. Line 2 of the trivpp output comes from line 1 of the input file, so it's right to label it as line 1. The error is truly on line 100, and that is the line number given in the error header.

File "foo.ml", line 100, characters 8-11:

What's wrong is the subsequent display of the source text. Making the change you suggest doesn't fix things, but just results in the header being wrong as well!

And cpp foo.ml results in the output

# 1 "foo.ml"
# 1 "<built-in>"
# 1 "<command-line>"
# 31 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 32 "<command-line>" 2
# 1 "foo.ml"
let x = 1
let x = 1
let x = 1
let x = 1
...

which starts with a lot of nonsense, but does indeed label the first true line of its output as coming from line 1 of the input file.

As further evidence that the trouble is elsewhere, keeping trivpp as I wrote it, but deleting the tail of the input file beyond line 100 or so results on my machine in a correct display of source text, suggesting that the process that is selecting fragments of source to display is losing track of where it is.

lthls · 2023-05-16T12:22:21Z

I can confirm that this behaviour exists on trunk. The fun part is that it depends on the path used to refer to foo.ml:

$ ./ocamlc.opt -nostdlib -I stdlib/ -c -pp trivpp foo.ml
File "foo.ml", line 100, characters 8-11:
 99 | ..
100 | ..... = .
Error: This expression has type int
       This is not a function; it cannot be applied.
$ ./ocamlc.opt -nostdlib -I stdlib/ -c -pp trivpp ./foo.ml
File "./foo.ml", line 100, characters 8-11:
 99 | 
100 | ....... 1
Error: This expression has type int
       This is not a function; it cannot be applied.
$ ./ocamlc.opt -nostdlib -I stdlib/ -c -pp trivpp ../ocaml/foo.ml
File "../ocaml/foo.ml", line 100, characters 8-11:
100 | let x = 100 100
              ^^^
Error: This expression has type int
       This is not a function; it cannot be applied.
$ ./ocamlc.opt -nostdlib -I stdlib/ -c -pp trivpp ${PWD}/foo.ml
File "/absolute/path/to/ocaml/foo.ml", line 100, characters 8-11:
 99 | .....
100 | ..t x....
Error: This expression has type int
       This is not a function; it cannot be applied.

A bit of quick experimentation seems to suggest that in location.ml, in lines_around_from_current_input, if we have to call from_file then we get the wrong result (probably because we look in the original file, but at positions computed from the internal lexing state which works on the pre-processed file). However, if we happen to still have the relevant phrases in the lexing buffer, then these will get printed correctly.

xavierleroy · 2023-05-18T17:14:26Z

Yes, it is confusing, but très respectueusement I suggest that ce n'est pas moi qui est confus. Line 2 of the trivpp output comes from line 1 of the input file, so it's right to label it as line 1.

The C preprocessors label it as line 2, so regardless of what you think and of how much French you use, OCaml will stick to this behavior. You can adapt your script as I suggested or keep complaining about it, that's your choice.

Spivoxity · 2023-05-18T18:03:08Z

I apologise for the offence, which was not intended. The facts, however, are as I stated them. It seems fruitless to say more.

Spivoxity · 2023-06-01T09:57:47Z

Just to be clear, this bug has nothing to do with the numbering of lines. Ocamlc implements the correct convention, consistent with the behaviour of cpp, even if Xavier describes it wrongly above. Using either convention in trivpp still causes the error to appear, but with the lines numbered wrongly in one case and correctly in the other.

I see that ocamlc suppresses the context lines after an error message when compiling a source containing #line directives without a preprocessor (e.g., the output of ocamlyacc). Perhaps the simplest fix would be to adopt the same behaviour when a preprocessor is used. To help my students (who are making the transition from Scala), I will provide a wrapper that invokes ocamlc with OCAML_ERROR_STYLE set to short, then reads and interprets the error messages to provide context. I have a suitable script (written in TCL, sorry) ready to go.

nojb · 2023-06-01T14:17:37Z

Thanks for the report.

Just paraphrasing what @lthls has already said, the problem is simple to explain: when using line directives, the specified file name is used to look up the "context" material printed in error messages; this is nonesense because the lexing offsets only make sense on the actual input file, not the one mentioned in the line directive. There's also an optimization that avoids looking inside the file if the material is still present in the lexing buffer; in this case the bug is not triggered (as the lexing offsets are correct for the contents of the lexing buffer).

shindere · 2023-07-19T13:44:27Z

My understanding is that nothing special will happen here so I am closing this issue, anybody should feel free to re-open it if necessary.

lthls · 2023-07-19T19:03:29Z

I don't see the point of closing this issue. There is something wrong, the cause is known, we just don't have any people working on the fix at the moment. It actually has a "bug" label, so I think we should make sure that the bug has actually disappeared before closing such issues.

Spivoxity · 2023-07-19T19:08:18Z

I agree. Though it doesn't affect my teaching any more (I'm setting OCAML_ERROR_STYLE to short and dealing with error context myself), it seems to me that a bug exists and a fix is needed, and a simple fix would be to disable the error context whenever a # line directive is seen, whether in a plain input file, like those produced by ocamlyacc and ocamllex, or in the output of a preprocessor. I'd prepare a patch myself, but I am knee deep in patches for other projects at the moment.

gasche · 2023-07-19T19:36:47Z

So what should we do to fix this issue? I can think of the following approaches:

Disable the clever printing style whenever line directives have been used.
Better: disable the fallback of reading the source file (when the lexing buffer does not have the code) whenever line directives have been used.
Increase the default size of the lexing buffer to make the problem go away in practice for human-written source files.
Instead of throwing past data when we refill the buffer, we could buffer the whole input in (as the Buffer module would do), keeping all the input source instead of only a fixed window of it. (This behaviour could be conditional on the error printing style, to provide an easy way to reduce the memory usage when compiling humongous source files.)

Currently the default buffer size appears to be 1024 bytes. The largest .ml file in the compiler distribution is the menhir-generated parser.ml, at 2,443,470 bytes. The largest human-written source file is typing/typecore.ml, at 246,316 bytes. If we setup the lexing buffer to start at 256KiB, strategy 3 would suffice for all human-written files, and strategy 4 would use at most 4Mio (assuming an exponential doubling strategy) to store source text. Does that sound reasonable on today's machines?

(Note: parser.cmo is 1.7Mio on my machine, and parser.cmt is 3.2Mio; the 4Mio of memory usage caused by the lexing buffer in more than those but not outlandish.)

Some of our error printing styles print the source code at the location of the error. This source is found either in the lexing buffer when available (it has not been discarded to make room for more source code), orelse by trying to re-open the source file. Re-opening the source file is not so reliable, in particular it fails in presence of preprocessor directives (the user-facing locations we have do not necessarily refer to real locations in the input file). We propose to workaround this issue by simply using large lexing buffers by default, so that the vast majority of program keep the whole source input in the lexing buffer, and the unreliable fallback is rarely used.

shindere · 2023-07-21T08:01:26Z

My bad, I took a too quick decision. My apologies to everybody and thanks to the other developers and participants for their awareness.

Some of our error printing styles print the source code at the location of the error. This source is found either in the lexing buffer when available (it has not been discarded to make room for more source code), orelse by trying to re-open the source file. Re-opening the source file is not so reliable, in particular it fails in presence of preprocessor directives (the user-facing locations we have do not necessarily refer to real locations in the input file), see ocaml#12238. This commit fixes the issue by reading the whole source file and then using Lexing.from_string, which preserves all the input in the buffer.

This logic is not robust in presence of lexer directives and will silently do the wrong file. We don't need it in the compiler anymore now that we read the entire file at once -- we do not need a fallback strategy after lines_around_from_lexbuf anymore. In theory this might make a difference for compiler-libs users that would set Location.input_name but not Location.input_lexbuf, and rely on the read-from-file fallback logic. Those users can fix their code (in a backward-compatible way) by setting Location.input_lexbuf themselves.

Some of our error printing styles print the source code at the location of the error. This source is found either in the lexing buffer when available (it has not been discarded to make room for more source code), orelse by trying to re-open the source file. Re-opening the source file is not so reliable, in particular it fails in presence of preprocessor directives (the user-facing locations we have do not necessarily refer to real locations in the input file), see ocaml#12238. This commit fixes the issue by reading the whole source file and then using Lexing.from_string, which preserves all the input in the buffer.

This logic is not robust in presence of lexer directives and will silently do the wrong file. We don't need it in the compiler anymore now that we read the entire file at once -- we do not need a fallback strategy after lines_around_from_lexbuf anymore. In theory this might make a difference for compiler-libs users that would set Location.input_name but not Location.input_lexbuf, and rely on the read-from-file fallback logic. Those users can fix their code (in a backward-compatible way) by setting Location.input_lexbuf themselves.

gasche · 2023-09-04T12:22:44Z

I just merged #12403 which has the compiler read files in full before parsing them. This should fix the issue. The fix will be included in OCaml 5.2 (not the imminent 5.1 release). Thanks for the report!

Some of our error printing styles print the source code at the location of the error. This source is found either in the lexing buffer when available (it has not been discarded to make room for more source code), orelse by trying to re-open the source file. Re-opening the source file is not so reliable, in particular it fails in presence of preprocessor directives (the user-facing locations we have do not necessarily refer to real locations in the input file), see ocaml#12238. This commit fixes the issue by reading the whole source file and then using Lexing.from_string, which preserves all the input in the buffer.

This logic is not robust in presence of lexer directives and will silently do the wrong file. We don't need it in the compiler anymore now that we read the entire file at once -- we do not need a fallback strategy after lines_around_from_lexbuf anymore. In theory this might make a difference for compiler-libs users that would set Location.input_name but not Location.input_lexbuf, and rely on the read-from-file fallback logic. Those users can fix their code (in a backward-compatible way) by setting Location.input_lexbuf themselves.

nojb added the bug label Jun 1, 2023

shindere closed this as completed Jul 19, 2023

nojb reopened this Jul 20, 2023

gasche mentioned this issue Jul 20, 2023

Workaround #12238 by using large lexing buffers #12396

Closed

gasche mentioned this issue Jul 21, 2023

Read source files in one go before lexing #12403

Merged

gasche closed this as completed Sep 4, 2023

gasche mentioned this issue Feb 25, 2024

Fix a 5.2 source-quotation-in-error-messages regression for dune+ppxlib #12991

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error message shows wrong source text with preprocessor #12238

Error message shows wrong source text with preprocessor #12238

Spivoxity commented May 12, 2023

Spivoxity commented May 12, 2023

Spivoxity commented May 12, 2023

xavierleroy commented May 16, 2023

Spivoxity commented May 16, 2023 •

edited

lthls commented May 16, 2023

xavierleroy commented May 18, 2023

Spivoxity commented May 18, 2023

Spivoxity commented Jun 1, 2023

nojb commented Jun 1, 2023 •

edited

shindere commented Jul 19, 2023 via email

lthls commented Jul 19, 2023

Spivoxity commented Jul 19, 2023

gasche commented Jul 19, 2023

shindere commented Jul 21, 2023 via email

gasche commented Sep 4, 2023

Error message shows wrong source text with preprocessor #12238

Error message shows wrong source text with preprocessor #12238

Comments

Spivoxity commented May 12, 2023

Spivoxity commented May 12, 2023

Spivoxity commented May 12, 2023

xavierleroy commented May 16, 2023

Spivoxity commented May 16, 2023 • edited

lthls commented May 16, 2023

xavierleroy commented May 18, 2023

Spivoxity commented May 18, 2023

Spivoxity commented Jun 1, 2023

nojb commented Jun 1, 2023 • edited

shindere commented Jul 19, 2023 via email

lthls commented Jul 19, 2023

Spivoxity commented Jul 19, 2023

gasche commented Jul 19, 2023

shindere commented Jul 21, 2023 via email

gasche commented Sep 4, 2023

Spivoxity commented May 16, 2023 •

edited

nojb commented Jun 1, 2023 •

edited