Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error message shows wrong source text with preprocessor #12238

Closed
Spivoxity opened this issue May 12, 2023 · 15 comments
Closed

Error message shows wrong source text with preprocessor #12238

Spivoxity opened this issue May 12, 2023 · 15 comments
Labels

Comments

@Spivoxity
Copy link

I'm using OCaml 4.11.1 on Debian/amd64.

If a (text-based) preprocessor heads its output with a #line directive, then compiler error messages show the correct line and char numbers, but the echo of the source text shown next to them is wrong. For example, try compiling the attached file foo.ml using the trivial preprocessor trivpp, a shell script that just prepends a #line directive.

ocamlc -c -pp trivpp foo.ml

The result is the following error message:

File "foo.ml", line 100, characters 8-11:
 99 | ..
100 | ..... = .
Error: This expression has type int
       This is not a function; it cannot be applied.

The line number is correct, but the echo of the source is garbled.

Interestingly, deleting all the input text after the line with the error results in a correct display of the relevant source code.

(I am not using this trivial preprocessor, but another one that implements certain constructs by textual expansion. For files that don't contain instances of the constructs, its effect is the same as the trivial preprocessor, however.)

@Spivoxity
Copy link
Author

Attached files
spivey.zip

@Spivoxity
Copy link
Author

Reproduced in 5.0.0

@xavierleroy
Copy link
Contributor

This is your trivpp script:

#!/bin/sh
echo "# 1 \"$1\""
cat $1

The # lineno filename directive gives the line number for the next line, not for the line containing the directive. So, it should be echo "# 2 \"$1\"". I know it's confusing, but that's how the C preprocessor works too.

@Spivoxity
Copy link
Author

Spivoxity commented May 16, 2023

Thanks for your response.

Yes, it is confusing, but très respectueusement I suggest that ce n'est pas moi qui est confus. Line 2 of the trivpp output comes from line 1 of the input file, so it's right to label it as line 1. The error is truly on line 100, and that is the line number given in the error header.

File "foo.ml", line 100, characters 8-11:

What's wrong is the subsequent display of the source text. Making the change you suggest doesn't fix things, but just results in the header being wrong as well!

And cpp foo.ml results in the output

# 1 "foo.ml"
# 1 "<built-in>"
# 1 "<command-line>"
# 31 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 32 "<command-line>" 2
# 1 "foo.ml"
let x = 1
let x = 1
let x = 1
let x = 1
...

which starts with a lot of nonsense, but does indeed label the first true line of its output as coming from line 1 of the input file.

As further evidence that the trouble is elsewhere, keeping trivpp as I wrote it, but deleting the tail of the input file beyond line 100 or so results on my machine in a correct display of source text, suggesting that the process that is selecting fragments of source to display is losing track of where it is.

@lthls
Copy link
Contributor

lthls commented May 16, 2023

I can confirm that this behaviour exists on trunk. The fun part is that it depends on the path used to refer to foo.ml:

$ ./ocamlc.opt -nostdlib -I stdlib/ -c -pp trivpp foo.ml
File "foo.ml", line 100, characters 8-11:
 99 | ..
100 | ..... = .
Error: This expression has type int
       This is not a function; it cannot be applied.
$ ./ocamlc.opt -nostdlib -I stdlib/ -c -pp trivpp ./foo.ml
File "./foo.ml", line 100, characters 8-11:
 99 | 
100 | ....... 1
Error: This expression has type int
       This is not a function; it cannot be applied.
$ ./ocamlc.opt -nostdlib -I stdlib/ -c -pp trivpp ../ocaml/foo.ml
File "../ocaml/foo.ml", line 100, characters 8-11:
100 | let x = 100 100
              ^^^
Error: This expression has type int
       This is not a function; it cannot be applied.
$ ./ocamlc.opt -nostdlib -I stdlib/ -c -pp trivpp ${PWD}/foo.ml
File "/absolute/path/to/ocaml/foo.ml", line 100, characters 8-11:
 99 | .....
100 | ..t x....
Error: This expression has type int
       This is not a function; it cannot be applied.

A bit of quick experimentation seems to suggest that in location.ml, in lines_around_from_current_input, if we have to call from_file then we get the wrong result (probably because we look in the original file, but at positions computed from the internal lexing state which works on the pre-processed file). However, if we happen to still have the relevant phrases in the lexing buffer, then these will get printed correctly.

@xavierleroy
Copy link
Contributor

Yes, it is confusing, but très respectueusement I suggest that ce n'est pas moi qui est confus. Line 2 of the trivpp output comes from line 1 of the input file, so it's right to label it as line 1.

The C preprocessors label it as line 2, so regardless of what you think and of how much French you use, OCaml will stick to this behavior. You can adapt your script as I suggested or keep complaining about it, that's your choice.

@Spivoxity
Copy link
Author

I apologise for the offence, which was not intended. The facts, however, are as I stated them. It seems fruitless to say more.

@Spivoxity
Copy link
Author

Just to be clear, this bug has nothing to do with the numbering of lines. Ocamlc implements the correct convention, consistent with the behaviour of cpp, even if Xavier describes it wrongly above. Using either convention in trivpp still causes the error to appear, but with the lines numbered wrongly in one case and correctly in the other.

I see that ocamlc suppresses the context lines after an error message when compiling a source containing #line directives without a preprocessor (e.g., the output of ocamlyacc). Perhaps the simplest fix would be to adopt the same behaviour when a preprocessor is used. To help my students (who are making the transition from Scala), I will provide a wrapper that invokes ocamlc with OCAML_ERROR_STYLE set to short, then reads and interprets the error messages to provide context. I have a suitable script (written in TCL, sorry) ready to go.

@nojb
Copy link
Contributor

nojb commented Jun 1, 2023

Thanks for the report.

Just paraphrasing what @lthls has already said, the problem is simple to explain: when using line directives, the specified file name is used to look up the "context" material printed in error messages; this is nonesense because the lexing offsets only make sense on the actual input file, not the one mentioned in the line directive. There's also an optimization that avoids looking inside the file if the material is still present in the lexing buffer; in this case the bug is not triggered (as the lexing offsets are correct for the contents of the lexing buffer).

@nojb nojb added the bug label Jun 1, 2023
@shindere
Copy link
Contributor

shindere commented Jul 19, 2023 via email

@lthls
Copy link
Contributor

lthls commented Jul 19, 2023

I don't see the point of closing this issue. There is something wrong, the cause is known, we just don't have any people working on the fix at the moment. It actually has a "bug" label, so I think we should make sure that the bug has actually disappeared before closing such issues.

@Spivoxity
Copy link
Author

I agree. Though it doesn't affect my teaching any more (I'm setting OCAML_ERROR_STYLE to short and dealing with error context myself), it seems to me that a bug exists and a fix is needed, and a simple fix would be to disable the error context whenever a # line directive is seen, whether in a plain input file, like those produced by ocamlyacc and ocamllex, or in the output of a preprocessor. I'd prepare a patch myself, but I am knee deep in patches for other projects at the moment.

@gasche
Copy link
Member

gasche commented Jul 19, 2023

So what should we do to fix this issue? I can think of the following approaches:

  1. Disable the clever printing style whenever line directives have been used.
  2. Better: disable the fallback of reading the source file (when the lexing buffer does not have the code) whenever line directives have been used.
  3. Increase the default size of the lexing buffer to make the problem go away in practice for human-written source files.
  4. Instead of throwing past data when we refill the buffer, we could buffer the whole input in (as the Buffer module would do), keeping all the input source instead of only a fixed window of it. (This behaviour could be conditional on the error printing style, to provide an easy way to reduce the memory usage when compiling humongous source files.)

Currently the default buffer size appears to be 1024 bytes. The largest .ml file in the compiler distribution is the menhir-generated parser.ml, at 2,443,470 bytes. The largest human-written source file is typing/typecore.ml, at 246,316 bytes. If we setup the lexing buffer to start at 256KiB, strategy 3 would suffice for all human-written files, and strategy 4 would use at most 4Mio (assuming an exponential doubling strategy) to store source text. Does that sound reasonable on today's machines?

(Note: parser.cmo is 1.7Mio on my machine, and parser.cmt is 3.2Mio; the 4Mio of memory usage caused by the lexing buffer in more than those but not outlandish.)

@nojb nojb reopened this Jul 20, 2023
gasche added a commit to gasche/ocaml that referenced this issue Jul 20, 2023
Some of our error printing styles print the source code at the
location of the error. This source is found either in the lexing
buffer when available (it has not been discarded to make room for more
source code), orelse by trying to re-open the source file. Re-opening
the source file is not so reliable, in particular it fails in presence
of preprocessor directives (the user-facing locations we have do not
necessarily refer to real locations in the input file).

We propose to workaround this issue by simply using large lexing
buffers by default, so that the vast majority of program keep the
whole source input in the lexing buffer, and the unreliable fallback
is rarely used.
@shindere
Copy link
Contributor

shindere commented Jul 21, 2023 via email

gasche added a commit to gasche/ocaml that referenced this issue Jul 21, 2023
Some of our error printing styles print the source code at the
location of the error. This source is found either in the lexing
buffer when available (it has not been discarded to make room for more
source code), orelse by trying to re-open the source file. Re-opening
the source file is not so reliable, in particular it fails in presence
of preprocessor directives (the user-facing locations we have do not
necessarily refer to real locations in the input file), see ocaml#12238.

This commit fixes the issue by reading the whole source file and then
using Lexing.from_string, which preserves all the input in the buffer.
gasche added a commit to gasche/ocaml that referenced this issue Jul 21, 2023
This logic is not robust in presence of lexer directives and will
silently do the wrong file. We don't need it in the compiler anymore
now that we read the entire file at once -- we do not need a fallback
strategy after lines_around_from_lexbuf anymore.

In theory this might make a difference for compiler-libs users that
would set Location.input_name but not Location.input_lexbuf, and rely
on the read-from-file fallback logic. Those users can fix their
code (in a backward-compatible way) by setting Location.input_lexbuf
themselves.
gasche added a commit to gasche/ocaml that referenced this issue Sep 3, 2023
Some of our error printing styles print the source code at the
location of the error. This source is found either in the lexing
buffer when available (it has not been discarded to make room for more
source code), orelse by trying to re-open the source file. Re-opening
the source file is not so reliable, in particular it fails in presence
of preprocessor directives (the user-facing locations we have do not
necessarily refer to real locations in the input file), see ocaml#12238.

This commit fixes the issue by reading the whole source file and then
using Lexing.from_string, which preserves all the input in the buffer.
gasche added a commit to gasche/ocaml that referenced this issue Sep 3, 2023
This logic is not robust in presence of lexer directives and will
silently do the wrong file. We don't need it in the compiler anymore
now that we read the entire file at once -- we do not need a fallback
strategy after lines_around_from_lexbuf anymore.

In theory this might make a difference for compiler-libs users that
would set Location.input_name but not Location.input_lexbuf, and rely
on the read-from-file fallback logic. Those users can fix their
code (in a backward-compatible way) by setting Location.input_lexbuf
themselves.
@gasche
Copy link
Member

gasche commented Sep 4, 2023

I just merged #12403 which has the compiler read files in full before parsing them. This should fix the issue. The fix will be included in OCaml 5.2 (not the imminent 5.1 release). Thanks for the report!

@gasche gasche closed this as completed Sep 4, 2023
eutro pushed a commit to eutro/ocaml that referenced this issue Sep 17, 2023
Some of our error printing styles print the source code at the
location of the error. This source is found either in the lexing
buffer when available (it has not been discarded to make room for more
source code), orelse by trying to re-open the source file. Re-opening
the source file is not so reliable, in particular it fails in presence
of preprocessor directives (the user-facing locations we have do not
necessarily refer to real locations in the input file), see ocaml#12238.

This commit fixes the issue by reading the whole source file and then
using Lexing.from_string, which preserves all the input in the buffer.
eutro pushed a commit to eutro/ocaml that referenced this issue Sep 17, 2023
This logic is not robust in presence of lexer directives and will
silently do the wrong file. We don't need it in the compiler anymore
now that we read the entire file at once -- we do not need a fallback
strategy after lines_around_from_lexbuf anymore.

In theory this might make a difference for compiler-libs users that
would set Location.input_name but not Location.input_lexbuf, and rely
on the read-from-file fallback logic. Those users can fix their
code (in a backward-compatible way) by setting Location.input_lexbuf
themselves.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants