normalize \r*\n newlines to \n during lexing #12514

gasche · 2023-08-29T15:22:56Z

If I understand correctly, this PR implements the consensus on "an ideal behavior of the compiler would be ..." that emerged from #12502.

This PR changes the language semantics of multiline string literals for all OCaml files that use \r\n (or in fact \r+\n) line endings. Our expectation is that this only concerns Windows users, and that the behavior change in fact corresponds to a bug fix for them, because the files they are seeing were meant to use \n line endings but were converted on the fly by some tools (for example, Git for Windows on checkout).

Because the semantics of newlines in string literals is now "portable" across os-specific newline conventions, this PR removes warning 29 (which was disabled by default I think?) on newlines in "..." string literals. (There was no warning on newlines in quoted string literals.) The warning definition, warning number, etc. are all left for compatibility reasons, the warning is just not raised anymore.

gasche · 2023-08-29T15:48:47Z

parsing/lexer.mll

@@ -684,7 +703,7 @@ and comment = parse
      }
  | newline
      { update_loc lexbuf None 1 false 0;
-        store_lexeme lexbuf;
+        store_normalized_newline ();
        comment lexbuf


Note: newlines are also normalized in comments, because (some) comments are now stored as documentation attributes in the AST, so the same portability issue as string literals could arise.

Interesting point. That of course affects ocamldoc's verbatim and raw backends sections, but is again likely more an improvement than a problem.

gasche · 2023-08-29T15:56:48Z

Note: the documentation in this PR (added to the "lexical conventions" section of the reference manual) makes explicit the fact that we consider any \r*\n sequence as a newline, not just \n and \r\n -- so for example \r\r\r\n gets normalized into \n. I am not completely sure why we use \r*\n and not \r?\n, and whether it is wise to add it explicitly to the specification. (I expect @damiendoligez to know all about this.)

dbuenzli

Amusingly IIUC this was actually already done for character literals here. Also if I read this correctly OCaml never supported lone '\r' newlines.

Not enough into the lexer to formally approve but it looks good to me.

parsing/lexer.mll

manual/src/refman/lex.etex

dbuenzli · 2023-08-29T16:00:27Z

parsing/lexer.mll

@@ -684,7 +703,7 @@ and comment = parse
      }
  | newline
      { update_loc lexbuf None 1 false 0;
-        store_lexeme lexbuf;
+        store_normalized_newline ();
        comment lexbuf


Interesting point. That of course affects ocamldoc's verbatim and raw backends sections, but is again likely more an improvement than a problem.

Changes

rossberg · 2023-08-29T16:20:51Z

Thanks @gasche. I would perhaps consider also changing the definition of newline to ('\n' | '\r' | "\r\n" ), since the current definition (lumping multiple CRs into a single newline) is highly non-standard. In particular, most other languages/tools normalise singleton CRs to newline as well, and would hence normalise CR CR LF to two newlines.

xavierleroy · 2023-08-29T16:22:40Z

I am not completely sure why we use \r*\n and not \r?\n, and whether it is wise to add it explicitly to the specification.

I believe this was intended to "do the right thing" even for double-text-encoded files under Windows, where \r\r\n end-of-lines can occur if e.g. you run "unix2dos" twice. (An instance of Postel's robustness principle, "be tolerant in what you accept from others".) I don't think it can hurt.

OCaml never supported lone '\r' newlines.

I think OCaml 1.x was running on classic MacOS, where end of line is marked by a single CR character. It (and everything else in classic MacOS) was a pain to handle.

dra27 · 2023-08-29T19:45:22Z

(exactly as @xavierleroy says for the newline regexp - see #6165)

dbuenzli · 2023-08-29T20:12:43Z

Personally I share @rossberg's concern here.

Introducing hacks to accommodate other tools or tool operation failures is rarely a good idea.

I don't know any file format that normalizes \r*\n to a single newline and that's certainly not a normalization you will find in the Unicode standard recommendations.

It's likely safer not to play weirdo here. This makes OCaml's vision of lines incompatible with any tool that processes lines as the Unicode standard recommends.

gasche · 2023-08-29T20:30:08Z

I believe that this PR suffers from the problem mentioned in #12502 by @jberdine and @ceastlund, and discussed in more details in ocaml-ppx/ocamlformat#1748. Jane Street uses quoted string literals to store the expected content of files in except tests, and some of these tests (including on Linux machines) are the output of Windows tools that use CRLF in their output:

Specifically, we have expect tests based on output from windows programs, and if the tests don't include the CRLFs they fail.

An example provided by @bcc32 was as follows, where ^M is used in place of a carriage return:

let%expect_test _ =
  do_test (fun () ->
    let%bind () = system "cat foo.csv" in
    [%expect
      {|
    e,d,f^M
    "x
    x",x,xxx^M
    e,d,f
    "x
    x",x,xxx^M
    yy,y,"y
    yy"
   |}];
    return ())
;;

In this case, the use of \r\n-style line endings in the source code is intentional as it mirrors the expected content of an external file, and the use of quoted string literals is designed to be able to enable automated tooling that modifies the OCaml code to "promote" the test output -- in particular, this promotion step does not want to deal with the OCaml escape rules.

The PR as proposed will silently break that use-case. I am not sure whether this is a show-stopper or whether a workaround should be proposed (Make it configurable with a command-line flag? ergh. Ask expect-test authors to support non-quoted string literals with explicit escape codes, specifically for the case of outputs that must preserve carriage return characters?)

alainfrisch · 2023-08-29T20:31:12Z

where \r\r\n end-of-lines can occur if e.g. you run "unix2dos" twice

FWIW, unix2dos seems to be idempotent. LF is converted to CR LF, but CR LF is left untouched.

An easy way to get CR CR LF is with an OCaml program whose CR-LF encoded source code has multiline string literals printed to a text file :-)

dbuenzli · 2023-08-29T20:51:47Z

Ask expect-test authors to support non-quoted string literals with explicit escape codes,

One could argue that the test case you show would be better off being explicit about that since if I gather correctly it has a mixture \n (strings between double quote) and \r\n (standard CSV terminator).

That's not going to be a silent failure since it will break the test suite :-) but it definitively breaks something.

(Isn't it fascinating how badly designed stuff always ends up being used in creative ways)

ceastlund · 2023-08-29T21:16:13Z

In the proposed new world... is there a way to express a {| ... |} string containing explicitly \r\n? Or does that become unexpressible?

dbuenzli · 2023-08-29T21:30:13Z

Or does that become unexpressible?

It becomes unexpressible. Every newline convention gets normalized to \n.

Of course you can easily map these \n to what you like afterwards including \r\n.

So rather what becomes inexpressible is a string where you need to preserve a mixture of newline conventions like in the test case shown above.

ceastlund · 2023-08-29T21:43:10Z

I think if certain strings become inexpressible in either of our string formats, as in we can no longer write "a\nb\r\nc" as a {| ... |} string, that's problematic. And if we can currently not express that on Windows machines, that's also problematic.

jberdine · 2023-08-29T22:23:58Z

I am not sure whether this is a show-stopper or whether a workaround should be proposed (Make it configurable with a command-line flag? ergh.

If a workaround for this use case is desired, it might be more appealing to anoit a specific cookie to use for string literals that should be exempted from newline normalization. Then such expect test strings could be written something like:

{raw_newlines|
...
|raw_newlines}

I am not saying that I think this is necessarily a good idea since the uniformity across the language argument for normalization takes a hit, but this seems better than a command-line flag for a very niche use case.

dbuenzli · 2023-08-29T22:41:56Z

I think if certain strings become inexpressible in either of our string formats, as in we can no longer write "a\nb\r\nc" as a {| ... |} string, that's problematic.

Maybe you could introduce an additional primitive that escapes. Something like:

  [%expect_escaped
      "\
    e,d,f\r\n\
    \"x\n\
    x\",x,xxx\r\n\
    e,d,f\n\
    \"x\n\
    x\",x,xxx\r\n\
    yy,y,\"y\n\
    yy"
   |}];

which admitely makes that particular example less readable because of the escapes for quotes.

And if we can currently not express that on Windows machines, that's also problematic.

You can currently express it on Windows but it is brittle because it seems the default of "Git for Windows" is to normalize line endings on checkout to CRLF (see this git parameter). So you should override the parameter in the .gitattribute of your repos otherwise people will checkout your test cases with newlines you don't expect on Windows.

(Personally I'm not sure it's a good idea to expect the content of arbitrary file formats in OCaml sources like is done above. In my expectation tests I write results in separate files and diff that instead (vs surgically update sources). You do lose the nice closeness of the operation and the result though.)

dra27 · 2023-08-30T08:41:02Z

Looking back at #6165 again, the issue was of course outside strings, comments, etc. - i.e. the source file let x = 42\r\r\n started resulting in Error: Illegal character (\r) with betas of 4.01. For strings / comments / attributes, etc. covered here it certainly makes sense to have let x = {|foo\r\r\n|} result in x having the value foo\r\n. Can that be done while leaving the EOL action alone, rather than adding \013 to blank?

dbuenzli · 2023-08-30T08:58:19Z

Looking back at #6165 again, the issue was of course outside strings, comments, etc. - i.e. the source file let x = 42\r\r\n started resulting in Error: Illegal character (\r) with betas of 4.01.

But that won't happen if you accept '\r' (again apparently) as a single newline. OCaml will see more newlines but that's blanks.

What are the problems that this poses ?

it certainly makes sense to have let x = {|foo\r\r\n|} result in x having the value foo\r\n

Did you mean foo\n\n ? Because that's what would make sense from a standard line normalization point of view.

gasche · 2023-08-30T13:27:10Z

I don't believe in this PR anymore with the proposed use-case of @ceastlund for the current worklow: it does not feel particularly wrong to me. Closing.

In #12502 I propose an alternative approach for the problem "life is hard for Windows users that may not realize the consequence of \r\n newlines in their string literals", introducing a new warning rather than normalization. A nice with with a new warning is that it can easily be disabled locally in odd cases.

I could possibly be convinced to go back to this normalization approach with command-line support for parametrization. But note that lexing-time logic is annoying to customize/configure, as inside-the-source mechanisms (attributes), which are much more pleasant than build-system-level tweaks, cannot be used.

xavierleroy · 2023-08-30T16:15:22Z

I quite like this PR, to the point that I'm going to reopen it. (Sorry, @gasche!) To me it's a very reasonable and pragmatic solution to "the CRLF Windows problem". Let's keep it alive.

dra27 · 2023-08-30T17:04:54Z

Ah, the joys of my eyesight - I had not seen/registered the return of \r in @rossberg's regexp. I'm somewhat meh about the idea of \r\r\n being normalised to two lines - are there actually editors which respect that? Mine, for example, would either display one or two CR glyphs depending on whether it had decided the entire file was meant to be using LF or CRLF line endings, but would it would only ever display one line? From our ecosystem perspective, we know that files with \r\r\n are accidentally generated but, for example, they report what I'd regard as sensible line numbers. I'm possibly happy to be the "weirdo" on not having \r\r\n ever imply two lines 🤷

ceastlund · 2023-08-30T17:10:52Z

I am not sure whether this is a show-stopper or whether a workaround should be proposed (Make it configurable with a command-line flag? ergh.

If a workaround for this use case is desired, it might be more appealing to anoit a specific cookie to use for string literals that should be exempted from newline normalization. Then such expect test strings could be written something like:
{raw_newlines|
...
|raw_newlines}
I am not saying that I think this is necessarily a good idea since the uniformity across the language argument for normalization takes a hit, but this seems better than a command-line flag for a very niche use case.

The point of having arbitrary "cookies" is to not force a fixed delimiter. If you fix the end delimiter as |raw_newlines}, then you can never express \r|raw_newlines} inside such a string.

gasche · 2023-08-30T17:12:29Z

@xavierleroy I don't mind reopening if there is a (weak) global consensus in the direction of this PR. I can make further changes if necessary. Can I guilt-trip you in making a decision on \r*\n vs. \r?\n ? I have no strong opinion.

ceastlund · 2023-08-30T17:15:11Z

I quite like this PR, to the point that I'm going to reopen it. (Sorry, @gasche!) To me it's a very reasonable and pragmatic solution to "the CRLF Windows problem". Let's keep it alive.

@xavierleroy, ppx_expect expects to be able to generate a string of either type -- " ... " or {| ... |} -- for arbitrary contents. It's not the only kind of code editor/generator where we might want to generate strings, just the first one I've thought of with this property. Are we content to break this property for all strings containing a \r character by making them inexpressible without changing delimiter and escaping formats? I would prefer we not do that. It's very handy to be able to use {| ... |} to express arbitrary strings, without internal escaping, and without having to sanity check the contents for certain characters.

xavierleroy · 2023-08-30T17:22:04Z

ppx_expect expects to be able to generate a string of either type -- " ... " or {| ... |} -- for arbitrary contents

Sure! As a fallback case you can use "\x01\x02\x03" string literals, with a \x hex escape for each byte. It's not pretty but it should do the job. Now, if you know in advance that the string doesn't contain \r or other weird characters, you can use {|...|} or even "...". Is there a way for users to say that it advance? " I know the output is printable text so please use "..." " or "I know the output fits in one line so please use {|...|}" ?

ceastlund · 2023-08-30T17:37:41Z

Quoted, escaped strings can be significantly less readable for some contents, just as unescaped strings can be significantly less readable for others. Also having your whole string constant suddenly change from one format to another just because of a single byte change in its contents, is jarring and noisy for code reviewers/maintainers. It's been very handy to be able to pick one kind of delimiter and preserve it in repeated calls to code generators, including but not limited to ppx_expect.

xavierleroy · 2023-08-30T17:41:48Z

Can I guilt-trip you in making a decision on \r*\n vs. \r?\n ? I have no strong opinion.

No guilt, my pleasure :-)

Looking at https://en.wikipedia.org/wiki/Newline , I see that the platform that use \r (CR) as a line ending are all dead: it's mainly 1970's and 1980's microcomputers that remind me of my teenage years (my beloved Apple ][ is there!), but are not (or no longer in the case of MacOS 9) and will not be supported by OCaml.

On the other hand, as I mentioned earlier, it's quite easy to get lines terminated by \r\r\n under Windows. (Just reading a text file in binary mode "because I know better", then print_string the data read will cause this, as standard output is in text mode by default.) It feels risky to read this back as two newlines. For example, my Emacs under Windows displays this as ^M and a newline. So, line numbers will disagree between OCaml and Emacs if \r\r\n in source files is treated as two newlines by OCaml.

So, I still think \r*\n is the most reasonable choice to recognize end of lines. It makes the lives of our Windows users slightly easier (they appreciate that), and doesn't break anything, since we've been using this regexp for many years.

That other languages make different choices doesn't move me. Perhaps they want to support MacOS 9 or another of the dead systems mentioned in the Wikipedia page. They are welcome.

utils/warnings.mli

Changes

damiendoligez

Looks good modulo a few suggestions.

Changes

parsing/lexer.mll

testsuite/tests/lexing/newlines.ml.gen

utils/warnings.mli

testsuite/tests/lexing/newlines.reference

gasche · 2023-09-20T15:01:17Z

I applied @damiendoligez' suggestions.

gasche · 2023-09-20T19:26:46Z

Merged. This was not an easy discussion (on a frustrating topic), but I think this is a good compromise: nobody likes it but there is a good consensus that it improves over the previous state. Thanks everyone!

hhugo · 2023-09-25T10:15:01Z

shouldn't this be marked as breaking change ? I had to patch jsoo after this PR. jsoo uses the {||} syntax to embed marshaled value in generated source files.

gasche · 2023-09-25T10:17:23Z

Sure. Can you point to the fix that you applied on the jsoo side?

hhugo · 2023-09-25T10:23:55Z

ocsigen/js_of_ocaml@ad2b300#diff-0ac51debf1030734450ff62eb2f237f3aadf07bcb0abc6a7b9437a882793c62cR88

The patch is changing the generated source to use the double quote syntax with a String.escaped content.

  ~name:%S
  ~content:{frag|%s|frag}
-  ~fragments:(Some {frag|%s|frag})
+  ~fragments:(Some %S)

johnwhitington · 2023-10-02T13:38:25Z

shouldn't this be marked as breaking change ? I had to patch jsoo after this PR. jsoo uses the {||} syntax to embed marshaled value in generated source files.

+1. I have code in released projects which builds source files with data embedded in them as quoted strings. Will this data suddenly become corrupted under this change?

Is it at least backward-compatible? i.e if I fix it to work in 5.2, will the data be unchanged in 5.1 and earlier? Or do I have to ship two different versions of my source code?

gasche · 2023-10-02T14:42:35Z

@johnwhitington my recommendation, if you want to ship binary data rather than text, is to use non-quoted string literals with escaped characters.

(It is useful if people point out here that they are negatively affected by the change. Maybe if we were to find out that it breaks too much stuff people would want to revert it.)

hhugo · 2023-10-02T14:52:17Z

I also want to point out a positive consequence of this PR.
As mentioned in janestreet/ppx_expect#40, with this PR, ppx_expect works out-of-the-box on windows. (one no longer need to configure git to disable nl translation)

gasche mentioned this pull request Aug 29, 2023

Newlines in quoted string literals mishandled on Windows/Cygwin #12502

Closed

gasche commented Aug 29, 2023

View reviewed changes

gasche force-pushed the crlf-normalization branch from 99e7354 to fb9294a Compare August 29, 2023 15:49

dbuenzli reviewed Aug 29, 2023

View reviewed changes

gasche closed this Aug 30, 2023

xavierleroy reopened this Aug 30, 2023

gasche force-pushed the crlf-normalization branch from 64fda9a to 8c66bdd Compare September 20, 2023 07:37

rossberg reviewed Sep 20, 2023

View reviewed changes

utils/warnings.mli Outdated Show resolved Hide resolved

Changes Outdated Show resolved Hide resolved

gasche force-pushed the crlf-normalization branch from 8c66bdd to 4cc315c Compare September 20, 2023 09:55

damiendoligez approved these changes Sep 20, 2023

View reviewed changes

gasche force-pushed the crlf-normalization branch 2 times, most recently from febdf21 to bffe5dc Compare September 20, 2023 15:00

gasche added the merge-me label Sep 20, 2023

gasche added 4 commits September 20, 2023 18:20

ocaml#12502: normalize \r*\n newlines to \n during lexing

1cbb92f

mention newline normalization in the manual

c78d3aa

newline normalization: only erase the last \r

9d65ae5

a testsuite for newline normalization

5a69ac5

gasche force-pushed the crlf-normalization branch from bffe5dc to 5a69ac5 Compare September 20, 2023 16:20

gasche merged commit 8e30385 into ocaml:trunk Sep 20, 2023
9 checks passed

hhugo mentioned this pull request Sep 25, 2023

Support OCaml trunk ocsigen/js_of_ocaml#1512

Closed

This was referenced Sep 27, 2023

Support for OCaml trunk ocsigen/js_of_ocaml#1487

Draft

[windows] Assertion failure janestreet/ppx_expect#40

Open

dra27 mentioned this pull request Jan 15, 2024

Explicitly compile the primitives object before linking bytecode executables #12896

Merged

This was referenced Feb 8, 2024

Tests failure with OCaml 5.2 dinosaure/unstrctrd#16

Closed

Fix the tests with OCaml 5.2 mirage/pecu#14

Merged

314eter mentioned this pull request Mar 17, 2024

Allow windows newlines in strings and characters tree-sitter/tree-sitter-ocaml#88

Merged

nojb mentioned this pull request Apr 13, 2024

Do not normalize newlines inside quoted strings ocaml-ppx/ocamlformat#1754

Merged

MisterDA mentioned this pull request May 13, 2024

Use quoted strings for chunks of code (multiline strings) #13162

Merged

normalize \r*\n newlines to \n during lexing #12514

normalize \r*\n newlines to \n during lexing #12514

Conversation

gasche commented Aug 29, 2023

gasche Aug 29, 2023

Choose a reason for hiding this comment

dbuenzli Aug 29, 2023

Choose a reason for hiding this comment

gasche commented Aug 29, 2023

dbuenzli left a comment

Choose a reason for hiding this comment

dbuenzli Aug 29, 2023

Choose a reason for hiding this comment

rossberg commented Aug 29, 2023

xavierleroy commented Aug 29, 2023

dra27 commented Aug 29, 2023

dbuenzli commented Aug 29, 2023 • edited

gasche commented Aug 29, 2023

alainfrisch commented Aug 29, 2023 • edited

dbuenzli commented Aug 29, 2023 • edited

ceastlund commented Aug 29, 2023

dbuenzli commented Aug 29, 2023

ceastlund commented Aug 29, 2023

jberdine commented Aug 29, 2023

dbuenzli commented Aug 29, 2023 • edited

dra27 commented Aug 30, 2023

dbuenzli commented Aug 30, 2023

gasche commented Aug 30, 2023

xavierleroy commented Aug 30, 2023

dra27 commented Aug 30, 2023

ceastlund commented Aug 30, 2023

gasche commented Aug 30, 2023

ceastlund commented Aug 30, 2023

xavierleroy commented Aug 30, 2023 • edited

ceastlund commented Aug 30, 2023

xavierleroy commented Aug 30, 2023

damiendoligez left a comment

Choose a reason for hiding this comment

gasche commented Sep 20, 2023

gasche commented Sep 20, 2023

hhugo commented Sep 25, 2023

gasche commented Sep 25, 2023

hhugo commented Sep 25, 2023

johnwhitington commented Oct 2, 2023 • edited

gasche commented Oct 2, 2023

hhugo commented Oct 2, 2023

dbuenzli commented Aug 29, 2023 •

edited

alainfrisch commented Aug 29, 2023 •

edited

dbuenzli commented Aug 29, 2023 •

edited

dbuenzli commented Aug 29, 2023 •

edited

xavierleroy commented Aug 30, 2023 •

edited

johnwhitington commented Oct 2, 2023 •

edited