-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
normalize \r*\n newlines to \n during lexing #12514
Conversation
@@ -684,7 +703,7 @@ and comment = parse | |||
} | |||
| newline | |||
{ update_loc lexbuf None 1 false 0; | |||
store_lexeme lexbuf; | |||
store_normalized_newline (); | |||
comment lexbuf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: newlines are also normalized in comments, because (some) comments are now stored as documentation attributes in the AST, so the same portability issue as string literals could arise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting point. That of course affects ocamldoc's verbatim and raw backends sections, but is again likely more an improvement than a problem.
99e7354
to
fb9294a
Compare
Note: the documentation in this PR (added to the "lexical conventions" section of the reference manual) makes explicit the fact that we consider any \r*\n sequence as a newline, not just \n and \r\n -- so for example \r\r\r\n gets normalized into \n. I am not completely sure why we use \r*\n and not \r?\n, and whether it is wise to add it explicitly to the specification. (I expect @damiendoligez to know all about this.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amusingly IIUC this was actually already done for character literals here. Also if I read this correctly OCaml never supported lone '\r' newlines.
Not enough into the lexer to formally approve but it looks good to me.
@@ -684,7 +703,7 @@ and comment = parse | |||
} | |||
| newline | |||
{ update_loc lexbuf None 1 false 0; | |||
store_lexeme lexbuf; | |||
store_normalized_newline (); | |||
comment lexbuf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting point. That of course affects ocamldoc's verbatim and raw backends sections, but is again likely more an improvement than a problem.
Thanks @gasche. I would perhaps consider also changing the definition of |
I believe this was intended to "do the right thing" even for double-text-encoded files under Windows, where \r\r\n end-of-lines can occur if e.g. you run "unix2dos" twice. (An instance of Postel's robustness principle, "be tolerant in what you accept from others".) I don't think it can hurt.
I think OCaml 1.x was running on classic MacOS, where end of line is marked by a single CR character. It (and everything else in classic MacOS) was a pain to handle. |
(exactly as @xavierleroy says for the newline regexp - see #6165) |
Personally I share @rossberg's concern here. Introducing hacks to accommodate other tools or tool operation failures is rarely a good idea. I don't know any file format that normalizes \r*\n to a single newline and that's certainly not a normalization you will find in the Unicode standard recommendations. It's likely safer not to play weirdo here. This makes OCaml's vision of lines incompatible with any tool that processes lines as the Unicode standard recommends. |
I believe that this PR suffers from the problem mentioned in #12502 by @jberdine and @ceastlund, and discussed in more details in ocaml-ppx/ocamlformat#1748. Jane Street uses quoted string literals to store the expected content of files in except tests, and some of these tests (including on Linux machines) are the output of Windows tools that use CRLF in their output:
An example provided by @bcc32 was as follows, where let%expect_test _ =
do_test (fun () ->
let%bind () = system "cat foo.csv" in
[%expect
{|
e,d,f^M
"x
x",x,xxx^M
e,d,f
"x
x",x,xxx^M
yy,y,"y
yy"
|}];
return ())
;; In this case, the use of \r\n-style line endings in the source code is intentional as it mirrors the expected content of an external file, and the use of quoted string literals is designed to be able to enable automated tooling that modifies the OCaml code to "promote" the test output -- in particular, this promotion step does not want to deal with the OCaml escape rules. The PR as proposed will silently break that use-case. I am not sure whether this is a show-stopper or whether a workaround should be proposed (Make it configurable with a command-line flag? ergh. Ask expect-test authors to support non-quoted string literals with explicit escape codes, specifically for the case of outputs that must preserve carriage return characters?) |
FWIW, An easy way to get CR CR LF is with an OCaml program whose CR-LF encoded source code has multiline string literals printed to a text file :-) |
One could argue that the test case you show would be better off being explicit about that since if I gather correctly it has a mixture \n (strings between double quote) and \r\n (standard CSV terminator). That's not going to be a silent failure since it will break the test suite :-) but it definitively breaks something. (Isn't it fascinating how badly designed stuff always ends up being used in creative ways) |
In the proposed new world... is there a way to express a |
It becomes unexpressible. Every newline convention gets normalized to Of course you can easily map these So rather what becomes inexpressible is a string where you need to preserve a mixture of newline conventions like in the test case shown above. |
I think if certain strings become inexpressible in either of our string formats, as in we can no longer write |
If a workaround for this use case is desired, it might be more appealing to anoit a specific cookie to use for string literals that should be exempted from newline normalization. Then such expect test strings could be written something like: {raw_newlines|
...
|raw_newlines} I am not saying that I think this is necessarily a good idea since the uniformity across the language argument for normalization takes a hit, but this seems better than a command-line flag for a very niche use case. |
Maybe you could introduce an additional primitive that escapes. Something like: [%expect_escaped
"\
e,d,f\r\n\
\"x\n\
x\",x,xxx\r\n\
e,d,f\n\
\"x\n\
x\",x,xxx\r\n\
yy,y,\"y\n\
yy"
|}]; which admitely makes that particular example less readable because of the escapes for quotes.
You can currently express it on Windows but it is brittle because it seems the default of "Git for Windows" is to normalize line endings on checkout to CRLF (see this git parameter). So you should override the parameter in the (Personally I'm not sure it's a good idea to expect the content of arbitrary file formats in OCaml sources like is done above. In my expectation tests I write results in separate files and diff that instead (vs surgically update sources). You do lose the nice closeness of the operation and the result though.) |
Looking back at #6165 again, the issue was of course outside strings, comments, etc. - i.e. the source file |
But that won't happen if you accept '\r' (again apparently) as a single newline. OCaml will see more newlines but that's blanks. What are the problems that this poses ?
Did you mean foo\n\n ? Because that's what would make sense from a standard line normalization point of view. |
I don't believe in this PR anymore with the proposed use-case of @ceastlund for the current worklow: it does not feel particularly wrong to me. Closing. In #12502 I propose an alternative approach for the problem "life is hard for Windows users that may not realize the consequence of \r\n newlines in their string literals", introducing a new warning rather than normalization. A nice with with a new warning is that it can easily be disabled locally in odd cases. I could possibly be convinced to go back to this normalization approach with command-line support for parametrization. But note that lexing-time logic is annoying to customize/configure, as inside-the-source mechanisms (attributes), which are much more pleasant than build-system-level tweaks, cannot be used. |
I quite like this PR, to the point that I'm going to reopen it. (Sorry, @gasche!) To me it's a very reasonable and pragmatic solution to "the CRLF Windows problem". Let's keep it alive. |
Ah, the joys of my eyesight - I had not seen/registered the return of |
The point of having arbitrary "cookies" is to not force a fixed delimiter. If you fix the end delimiter as |
@xavierleroy I don't mind reopening if there is a (weak) global consensus in the direction of this PR. I can make further changes if necessary. Can I guilt-trip you in making a decision on |
@xavierleroy, |
Sure! As a fallback case you can use |
Quoted, escaped strings can be significantly less readable for some contents, just as unescaped strings can be significantly less readable for others. Also having your whole string constant suddenly change from one format to another just because of a single byte change in its contents, is jarring and noisy for code reviewers/maintainers. It's been very handy to be able to pick one kind of delimiter and preserve it in repeated calls to code generators, including but not limited to |
No guilt, my pleasure :-) Looking at https://en.wikipedia.org/wiki/Newline , I see that the platform that use On the other hand, as I mentioned earlier, it's quite easy to get lines terminated by So, I still think That other languages make different choices doesn't move me. Perhaps they want to support MacOS 9 or another of the dead systems mentioned in the Wikipedia page. They are welcome. |
64fda9a
to
8c66bdd
Compare
8c66bdd
to
4cc315c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good modulo a few suggestions.
febdf21
to
bffe5dc
Compare
I applied @damiendoligez' suggestions. |
bffe5dc
to
5a69ac5
Compare
Merged. This was not an easy discussion (on a frustrating topic), but I think this is a good compromise: nobody likes it but there is a good consensus that it improves over the previous state. Thanks everyone! |
shouldn't this be marked as breaking change ? I had to patch jsoo after this PR. jsoo uses the |
Sure. Can you point to the fix that you applied on the jsoo side? |
ocsigen/js_of_ocaml@ad2b300#diff-0ac51debf1030734450ff62eb2f237f3aadf07bcb0abc6a7b9437a882793c62cR88 The patch is changing the generated source to use the double quote syntax with a ~name:%S
~content:{frag|%s|frag}
- ~fragments:(Some {frag|%s|frag})
+ ~fragments:(Some %S) |
+1. I have code in released projects which builds source files with data embedded in them as quoted strings. Will this data suddenly become corrupted under this change? Is it at least backward-compatible? i.e if I fix it to work in 5.2, will the data be unchanged in 5.1 and earlier? Or do I have to ship two different versions of my source code? |
@johnwhitington my recommendation, if you want to ship binary data rather than text, is to use non-quoted string literals with escaped characters. (It is useful if people point out here that they are negatively affected by the change. Maybe if we were to find out that it breaks too much stuff people would want to revert it.) |
I also want to point out a positive consequence of this PR. |
If I understand correctly, this PR implements the consensus on "an ideal behavior of the compiler would be ..." that emerged from #12502.
This PR changes the language semantics of multiline string literals for all OCaml files that use \r\n (or in fact \r+\n) line endings. Our expectation is that this only concerns Windows users, and that the behavior change in fact corresponds to a bug fix for them, because the files they are seeing were meant to use \n line endings but were converted on the fly by some tools (for example, Git for Windows on checkout).
Because the semantics of newlines in string literals is now "portable" across os-specific newline conventions, this PR removes warning 29 (which was disabled by default I think?) on newlines in
"..."
string literals. (There was no warning on newlines in quoted string literals.) The warning definition, warning number, etc. are all left for compatibility reasons, the warning is just not raised anymore.