Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

normalize \r*\n newlines to \n during lexing #12514

Merged
merged 4 commits into from
Sep 20, 2023

Conversation

gasche
Copy link
Member

@gasche gasche commented Aug 29, 2023

If I understand correctly, this PR implements the consensus on "an ideal behavior of the compiler would be ..." that emerged from #12502.

This PR changes the language semantics of multiline string literals for all OCaml files that use \r\n (or in fact \r+\n) line endings. Our expectation is that this only concerns Windows users, and that the behavior change in fact corresponds to a bug fix for them, because the files they are seeing were meant to use \n line endings but were converted on the fly by some tools (for example, Git for Windows on checkout).

Because the semantics of newlines in string literals is now "portable" across os-specific newline conventions, this PR removes warning 29 (which was disabled by default I think?) on newlines in "..." string literals. (There was no warning on newlines in quoted string literals.) The warning definition, warning number, etc. are all left for compatibility reasons, the warning is just not raised anymore.

@@ -684,7 +703,7 @@ and comment = parse
}
| newline
{ update_loc lexbuf None 1 false 0;
store_lexeme lexbuf;
store_normalized_newline ();
comment lexbuf
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: newlines are also normalized in comments, because (some) comments are now stored as documentation attributes in the AST, so the same portability issue as string literals could arise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting point. That of course affects ocamldoc's verbatim and raw backends sections, but is again likely more an improvement than a problem.

@gasche
Copy link
Member Author

gasche commented Aug 29, 2023

Note: the documentation in this PR (added to the "lexical conventions" section of the reference manual) makes explicit the fact that we consider any \r*\n sequence as a newline, not just \n and \r\n -- so for example \r\r\r\n gets normalized into \n. I am not completely sure why we use \r*\n and not \r?\n, and whether it is wise to add it explicitly to the specification. (I expect @damiendoligez to know all about this.)

Copy link
Contributor

@dbuenzli dbuenzli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amusingly IIUC this was actually already done for character literals here. Also if I read this correctly OCaml never supported lone '\r' newlines.

Not enough into the lexer to formally approve but it looks good to me.

parsing/lexer.mll Outdated Show resolved Hide resolved
parsing/lexer.mll Outdated Show resolved Hide resolved
manual/src/refman/lex.etex Outdated Show resolved Hide resolved
manual/src/refman/lex.etex Outdated Show resolved Hide resolved
@@ -684,7 +703,7 @@ and comment = parse
}
| newline
{ update_loc lexbuf None 1 false 0;
store_lexeme lexbuf;
store_normalized_newline ();
comment lexbuf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting point. That of course affects ocamldoc's verbatim and raw backends sections, but is again likely more an improvement than a problem.

Changes Outdated Show resolved Hide resolved
@rossberg
Copy link

Thanks @gasche. I would perhaps consider also changing the definition of newline to ('\n' | '\r' | "\r\n" ), since the current definition (lumping multiple CRs into a single newline) is highly non-standard. In particular, most other languages/tools normalise singleton CRs to newline as well, and would hence normalise CR CR LF to two newlines.

@xavierleroy
Copy link
Contributor

I am not completely sure why we use \r*\n and not \r?\n, and whether it is wise to add it explicitly to the specification.

I believe this was intended to "do the right thing" even for double-text-encoded files under Windows, where \r\r\n end-of-lines can occur if e.g. you run "unix2dos" twice. (An instance of Postel's robustness principle, "be tolerant in what you accept from others".) I don't think it can hurt.

OCaml never supported lone '\r' newlines.

I think OCaml 1.x was running on classic MacOS, where end of line is marked by a single CR character. It (and everything else in classic MacOS) was a pain to handle.

@dra27
Copy link
Member

dra27 commented Aug 29, 2023

(exactly as @xavierleroy says for the newline regexp - see #6165)

@dbuenzli
Copy link
Contributor

dbuenzli commented Aug 29, 2023

Personally I share @rossberg's concern here.

Introducing hacks to accommodate other tools or tool operation failures is rarely a good idea.

I don't know any file format that normalizes \r*\n to a single newline and that's certainly not a normalization you will find in the Unicode standard recommendations.

It's likely safer not to play weirdo here. This makes OCaml's vision of lines incompatible with any tool that processes lines as the Unicode standard recommends.

@gasche
Copy link
Member Author

gasche commented Aug 29, 2023

I believe that this PR suffers from the problem mentioned in #12502 by @jberdine and @ceastlund, and discussed in more details in ocaml-ppx/ocamlformat#1748. Jane Street uses quoted string literals to store the expected content of files in except tests, and some of these tests (including on Linux machines) are the output of Windows tools that use CRLF in their output:

Specifically, we have expect tests based on output from windows programs, and if the tests don't include the CRLFs they fail.

An example provided by @bcc32 was as follows, where ^M is used in place of a carriage return:

let%expect_test _ =
  do_test (fun () ->
    let%bind () = system "cat foo.csv" in
    [%expect
      {|
    e,d,f^M
    "x
    x",x,xxx^M
    e,d,f
    "x
    x",x,xxx^M
    yy,y,"y
    yy"
   |}];
    return ())
;;

In this case, the use of \r\n-style line endings in the source code is intentional as it mirrors the expected content of an external file, and the use of quoted string literals is designed to be able to enable automated tooling that modifies the OCaml code to "promote" the test output -- in particular, this promotion step does not want to deal with the OCaml escape rules.

The PR as proposed will silently break that use-case. I am not sure whether this is a show-stopper or whether a workaround should be proposed (Make it configurable with a command-line flag? ergh. Ask expect-test authors to support non-quoted string literals with explicit escape codes, specifically for the case of outputs that must preserve carriage return characters?)

@alainfrisch
Copy link
Contributor

alainfrisch commented Aug 29, 2023

where \r\r\n end-of-lines can occur if e.g. you run "unix2dos" twice

FWIW, unix2dos seems to be idempotent. LF is converted to CR LF, but CR LF is left untouched.

An easy way to get CR CR LF is with an OCaml program whose CR-LF encoded source code has multiline string literals printed to a text file :-)

@dbuenzli
Copy link
Contributor

dbuenzli commented Aug 29, 2023

Ask expect-test authors to support non-quoted string literals with explicit escape codes,

One could argue that the test case you show would be better off being explicit about that since if I gather correctly it has a mixture \n (strings between double quote) and \r\n (standard CSV terminator).

That's not going to be a silent failure since it will break the test suite :-) but it definitively breaks something.

(Isn't it fascinating how badly designed stuff always ends up being used in creative ways)

@ceastlund
Copy link

In the proposed new world... is there a way to express a {| ... |} string containing explicitly \r\n? Or does that become unexpressible?

@dbuenzli
Copy link
Contributor

Or does that become unexpressible?

It becomes unexpressible. Every newline convention gets normalized to \n.

Of course you can easily map these \n to what you like afterwards including \r\n.

So rather what becomes inexpressible is a string where you need to preserve a mixture of newline conventions like in the test case shown above.

@ceastlund
Copy link

I think if certain strings become inexpressible in either of our string formats, as in we can no longer write "a\nb\r\nc" as a {| ... |} string, that's problematic. And if we can currently not express that on Windows machines, that's also problematic.

@jberdine
Copy link
Contributor

I am not sure whether this is a show-stopper or whether a workaround should be proposed (Make it configurable with a command-line flag? ergh.

If a workaround for this use case is desired, it might be more appealing to anoit a specific cookie to use for string literals that should be exempted from newline normalization. Then such expect test strings could be written something like:

{raw_newlines|
...
|raw_newlines}

I am not saying that I think this is necessarily a good idea since the uniformity across the language argument for normalization takes a hit, but this seems better than a command-line flag for a very niche use case.

@dbuenzli
Copy link
Contributor

dbuenzli commented Aug 29, 2023

I think if certain strings become inexpressible in either of our string formats, as in we can no longer write "a\nb\r\nc" as a {| ... |} string, that's problematic.

Maybe you could introduce an additional primitive that escapes. Something like:

  [%expect_escaped
      "\
    e,d,f\r\n\
    \"x\n\
    x\",x,xxx\r\n\
    e,d,f\n\
    \"x\n\
    x\",x,xxx\r\n\
    yy,y,\"y\n\
    yy"
   |}];

which admitely makes that particular example less readable because of the escapes for quotes.

And if we can currently not express that on Windows machines, that's also problematic.

You can currently express it on Windows but it is brittle because it seems the default of "Git for Windows" is to normalize line endings on checkout to CRLF (see this git parameter). So you should override the parameter in the .gitattribute of your repos otherwise people will checkout your test cases with newlines you don't expect on Windows.

(Personally I'm not sure it's a good idea to expect the content of arbitrary file formats in OCaml sources like is done above. In my expectation tests I write results in separate files and diff that instead (vs surgically update sources). You do lose the nice closeness of the operation and the result though.)

@dra27
Copy link
Member

dra27 commented Aug 30, 2023

Looking back at #6165 again, the issue was of course outside strings, comments, etc. - i.e. the source file let x = 42\r\r\n started resulting in Error: Illegal character (\r) with betas of 4.01. For strings / comments / attributes, etc. covered here it certainly makes sense to have let x = {|foo\r\r\n|} result in x having the value foo\r\n. Can that be done while leaving the EOL action alone, rather than adding \013 to blank?

@dbuenzli
Copy link
Contributor

Looking back at #6165 again, the issue was of course outside strings, comments, etc. - i.e. the source file let x = 42\r\r\n started resulting in Error: Illegal character (\r) with betas of 4.01.

But that won't happen if you accept '\r' (again apparently) as a single newline. OCaml will see more newlines but that's blanks.

What are the problems that this poses ?

it certainly makes sense to have let x = {|foo\r\r\n|} result in x having the value foo\r\n

Did you mean foo\n\n ? Because that's what would make sense from a standard line normalization point of view.

@gasche
Copy link
Member Author

gasche commented Aug 30, 2023

I don't believe in this PR anymore with the proposed use-case of @ceastlund for the current worklow: it does not feel particularly wrong to me. Closing.

In #12502 I propose an alternative approach for the problem "life is hard for Windows users that may not realize the consequence of \r\n newlines in their string literals", introducing a new warning rather than normalization. A nice with with a new warning is that it can easily be disabled locally in odd cases.

I could possibly be convinced to go back to this normalization approach with command-line support for parametrization. But note that lexing-time logic is annoying to customize/configure, as inside-the-source mechanisms (attributes), which are much more pleasant than build-system-level tweaks, cannot be used.

@gasche gasche closed this Aug 30, 2023
@xavierleroy
Copy link
Contributor

I quite like this PR, to the point that I'm going to reopen it. (Sorry, @gasche!) To me it's a very reasonable and pragmatic solution to "the CRLF Windows problem". Let's keep it alive.

@xavierleroy xavierleroy reopened this Aug 30, 2023
@dra27
Copy link
Member

dra27 commented Aug 30, 2023

Ah, the joys of my eyesight - I had not seen/registered the return of \r in @rossberg's regexp. I'm somewhat meh about the idea of \r\r\n being normalised to two lines - are there actually editors which respect that? Mine, for example, would either display one or two CR glyphs depending on whether it had decided the entire file was meant to be using LF or CRLF line endings, but would it would only ever display one line? From our ecosystem perspective, we know that files with \r\r\n are accidentally generated but, for example, they report what I'd regard as sensible line numbers. I'm possibly happy to be the "weirdo" on not having \r\r\n ever imply two lines 🤷

@ceastlund
Copy link

I am not sure whether this is a show-stopper or whether a workaround should be proposed (Make it configurable with a command-line flag? ergh.

If a workaround for this use case is desired, it might be more appealing to anoit a specific cookie to use for string literals that should be exempted from newline normalization. Then such expect test strings could be written something like:

{raw_newlines|
...
|raw_newlines}

I am not saying that I think this is necessarily a good idea since the uniformity across the language argument for normalization takes a hit, but this seems better than a command-line flag for a very niche use case.

The point of having arbitrary "cookies" is to not force a fixed delimiter. If you fix the end delimiter as |raw_newlines}, then you can never express \r|raw_newlines} inside such a string.

@gasche
Copy link
Member Author

gasche commented Aug 30, 2023

@xavierleroy I don't mind reopening if there is a (weak) global consensus in the direction of this PR. I can make further changes if necessary. Can I guilt-trip you in making a decision on \r*\n vs. \r?\n ? I have no strong opinion.

@ceastlund
Copy link

I quite like this PR, to the point that I'm going to reopen it. (Sorry, @gasche!) To me it's a very reasonable and pragmatic solution to "the CRLF Windows problem". Let's keep it alive.

@xavierleroy, ppx_expect expects to be able to generate a string of either type -- " ... " or {| ... |} -- for arbitrary contents. It's not the only kind of code editor/generator where we might want to generate strings, just the first one I've thought of with this property. Are we content to break this property for all strings containing a \r character by making them inexpressible without changing delimiter and escaping formats? I would prefer we not do that. It's very handy to be able to use {| ... |} to express arbitrary strings, without internal escaping, and without having to sanity check the contents for certain characters.

@xavierleroy
Copy link
Contributor

xavierleroy commented Aug 30, 2023

ppx_expect expects to be able to generate a string of either type -- " ... " or {| ... |} -- for arbitrary contents

Sure! As a fallback case you can use "\x01\x02\x03" string literals, with a \x hex escape for each byte. It's not pretty but it should do the job. Now, if you know in advance that the string doesn't contain \r or other weird characters, you can use {|...|} or even "...". Is there a way for users to say that it advance? " I know the output is printable text so please use "..." " or "I know the output fits in one line so please use {|...|}" ?

@ceastlund
Copy link

Quoted, escaped strings can be significantly less readable for some contents, just as unescaped strings can be significantly less readable for others. Also having your whole string constant suddenly change from one format to another just because of a single byte change in its contents, is jarring and noisy for code reviewers/maintainers. It's been very handy to be able to pick one kind of delimiter and preserve it in repeated calls to code generators, including but not limited to ppx_expect.

@xavierleroy
Copy link
Contributor

Can I guilt-trip you in making a decision on \r*\n vs. \r?\n ? I have no strong opinion.

No guilt, my pleasure :-)

Looking at https://en.wikipedia.org/wiki/Newline , I see that the platform that use \r (CR) as a line ending are all dead: it's mainly 1970's and 1980's microcomputers that remind me of my teenage years (my beloved Apple ][ is there!), but are not (or no longer in the case of MacOS 9) and will not be supported by OCaml.

On the other hand, as I mentioned earlier, it's quite easy to get lines terminated by \r\r\n under Windows. (Just reading a text file in binary mode "because I know better", then print_string the data read will cause this, as standard output is in text mode by default.) It feels risky to read this back as two newlines. For example, my Emacs under Windows displays this as ^M and a newline. So, line numbers will disagree between OCaml and Emacs if \r\r\n in source files is treated as two newlines by OCaml.

So, I still think \r*\n is the most reasonable choice to recognize end of lines. It makes the lives of our Windows users slightly easier (they appreciate that), and doesn't break anything, since we've been using this regexp for many years.

That other languages make different choices doesn't move me. Perhaps they want to support MacOS 9 or another of the dead systems mentioned in the Wikipedia page. They are welcome.

utils/warnings.mli Outdated Show resolved Hide resolved
Changes Outdated Show resolved Hide resolved
Copy link
Member

@damiendoligez damiendoligez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good modulo a few suggestions.

Changes Outdated Show resolved Hide resolved
parsing/lexer.mll Outdated Show resolved Hide resolved
parsing/lexer.mll Outdated Show resolved Hide resolved
testsuite/tests/lexing/newlines.ml.gen Outdated Show resolved Hide resolved
utils/warnings.mli Outdated Show resolved Hide resolved
testsuite/tests/lexing/newlines.reference Outdated Show resolved Hide resolved
@gasche gasche force-pushed the crlf-normalization branch 2 times, most recently from febdf21 to bffe5dc Compare September 20, 2023 15:00
@gasche
Copy link
Member Author

gasche commented Sep 20, 2023

I applied @damiendoligez' suggestions.

@gasche gasche merged commit 8e30385 into ocaml:trunk Sep 20, 2023
9 checks passed
@gasche
Copy link
Member Author

gasche commented Sep 20, 2023

Merged. This was not an easy discussion (on a frustrating topic), but I think this is a good compromise: nobody likes it but there is a good consensus that it improves over the previous state. Thanks everyone!

@hhugo
Copy link
Contributor

hhugo commented Sep 25, 2023

shouldn't this be marked as breaking change ? I had to patch jsoo after this PR. jsoo uses the {||} syntax to embed marshaled value in generated source files.

@gasche
Copy link
Member Author

gasche commented Sep 25, 2023

Sure. Can you point to the fix that you applied on the jsoo side?

@hhugo
Copy link
Contributor

hhugo commented Sep 25, 2023

ocsigen/js_of_ocaml@ad2b300#diff-0ac51debf1030734450ff62eb2f237f3aadf07bcb0abc6a7b9437a882793c62cR88

The patch is changing the generated source to use the double quote syntax with a String.escaped content.

  ~name:%S
  ~content:{frag|%s|frag}
-  ~fragments:(Some {frag|%s|frag})
+  ~fragments:(Some %S)

@johnwhitington
Copy link
Contributor

johnwhitington commented Oct 2, 2023

shouldn't this be marked as breaking change ? I had to patch jsoo after this PR. jsoo uses the {||} syntax to embed marshaled value in generated source files.

+1. I have code in released projects which builds source files with data embedded in them as quoted strings. Will this data suddenly become corrupted under this change?

Is it at least backward-compatible? i.e if I fix it to work in 5.2, will the data be unchanged in 5.1 and earlier? Or do I have to ship two different versions of my source code?

@gasche
Copy link
Member Author

gasche commented Oct 2, 2023

@johnwhitington my recommendation, if you want to ship binary data rather than text, is to use non-quoted string literals with escaped characters.

(It is useful if people point out here that they are negatively affected by the change. Maybe if we were to find out that it breaks too much stuff people would want to revert it.)

@hhugo
Copy link
Contributor

hhugo commented Oct 2, 2023

I also want to point out a positive consequence of this PR.
As mentioned in janestreet/ppx_expect#40, with this PR, ppx_expect works out-of-the-box on windows. (one no longer need to configure git to disable nl translation)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet