Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synopsis & description characters are escaped #9728

Closed
toastal opened this issue Jan 14, 2024 · 10 comments · Fixed by ocaml/opam-repository#25615
Closed

Synopsis & description characters are escaped #9728

toastal opened this issue Jan 14, 2024 · 10 comments · Fixed by ocaml/opam-repository#25615
Assignees
Labels
bug requires-team-discussion This topic requires a team discussion

Comments

@toastal
Copy link

toastal commented Jan 14, 2024

Expected Behavior

Dune format does nothing:

(package
 (name cafe_menu_mgr)
 (synopsis "Menu manager for a café named, กาแฟ")
 (description "Add & remove items from กาแฟ’s menu, manage prices, etc."))

Actual Behavior

Dune formats does:

(package
 (name cafe_menu_mgr)
 (synopsis
  "Menu manager for a caf\195\169 named, \224\184\129\224\184\178\224\185\129\224\184\159")
 (description
  "Add & remove items from \224\184\129\224\184\178\224\185\129\224\184\159\226\128\153s menu, manage prices, etc."))

Reproduction

  • PR with a reproducing test:
  1. Init a project
  2. Add non-ASCII characters to synopsis / description
  3. Dune Format

Specifications

  • Version of dune (output of dune --version): 3.12.1
  • Version of ocaml (output of ocamlc --version): 4.14.1
  • Operating system (distribution and version): Fedora Linux 39

Additional information

Discovered here: https://discuss.ocaml.org/t/why-can-t-i-create-a-project-with-non-ascii-characters/13865/12

@yawaramin says this shouldn’t happen for certain fields so they can still be human-readable.

@yawaramin
Copy link
Contributor

I'd certainly expect dune to not touch any values inside string fields. opam doesn't.

@emillon
Copy link
Collaborator

emillon commented Jan 15, 2024

I had a look at this a while ago, so let me comment on this (actually, even some of the Dune authors have non-ascii characters in their names!).

In short, this is a CST vs AST issue: assuming a UTF-8 encoded dune file, Dune will parse "é" and "\195\169" the same. It will pass the same string down the program, and in particular it should produce the same opam file (probably one with escapes).

To support that properly in dune fmt, the formatter needs to be able to tell the difference between these concrete syntaxes. For example, its String constructor would contain the unescaped contents of strings.

We already have 2 parsers (for AST and CST), but instead of doing AST -> CST (which would be a lossy operation), we instead materialize CSTs from ASTs (where we "invent" which concrete syntax was used) for performance reasons.

The situation is described here:

https://github.com/ocaml/dune/blob/3.12.2/src/dune_sexp/parser.ml#L5-L15

It describes several options: do the conversion the other way around (probably too slow), write another parser, GADT tricks.

@yawaramin
Copy link
Contributor

Can we make dune not parse anything inside double-quotes ie treat it as opaque binary data?

@emillon
Copy link
Collaborator

emillon commented Jan 15, 2024

Just to clarify one thing: Dune is not mangling user data here. If you put "é" or "\195\169" in (synopsis) in dune-project, it will generate é in the synopsis: field of the generated opam file. In turn, opam should be able to display that, ocaml.org should be able to render it, etc.

The thing you're seeing is an issue with dune format-dune-file. We can't treat the inside of quoted strings as binary data because we have to find the closing quote, look for escaped quotes, etc. We could have a version that does not interpret the escapes. That's the CST thing I described in my previous message together with some solutions.

To make an analogy with ocaml code, print_endline "a b" or print_endline "a\x20b" are different programs but will do the same thing at runtime. The ocaml compiler does not care about the difference between these programs, ocamlformat does. That's why it has a dedicated parser.

@rgrinberg rgrinberg added the bug label Jan 15, 2024
@emillon
Copy link
Collaborator

emillon commented Jan 29, 2024

Here's a high-level overview of how I think it could be solved:

First, we need tests for this. For example, add the following to format-dune-file.t (or any of the strings in this issue)

  $ dune format-dune-file << EOF
  > ("Étienne")
  > EOF

Template payloads should be covered too, %{bin:é}, but probably not atoms.

Then, to have a working roundtrip through format-dune-file, we need 2 things:

  1. a CST parser that actually parses concrete syntax
  2. a CST formatter that actually formats concrete syntax

The first point is false at the moment: in dune_sexp/lexer.mll, when \t is found (for example), only the single character \t is added to the buffer. To make Encoded.t a CST, we need to store \ and t (two characters). Then, this requires running some conversion in Encoded.to_ast, which is currently the identity function.

(for performance, it is probably critical that to_ast remains the identity function. in a second implementation step, this can be done by passing a boolean argument to the lexer to encode if we're requiring escaped data or not)

This conversion (string -> string) can be implemented as a standalone function (it will require some tests) in Escape. For example, it would map "hello" to "hello" and {|x\ny|} (4 bytes) to "x\ny" (3 bytes). Some error handling needs to be done in this function, for example detecting short escapes like \x2. Actually, if we're doing it that way, it means that while running in CST mode, we'll now accept more programs. We'll need to determine if that's a problem or not (this means that ("\x2") could be formatted.

Some interesting tests we can write at that stage is something to ensure that (= "É" "\195\137") continues to hold. We also need to test the error paths in the lexer, such as "unterminated hexadecimal escape sequence" and other cases in the lexer.

So, at that stage we have step 1 implemented: a CST parser that hold unescaped strings in Template and Quoted_string, as well as a converter to an AST with these details removed. This is not completely enough because the formatter assumes that the contents of strings in Dune_sexp.T are quoted. In particular, Dune_sexp.T.pp calls Escape.quoted.
One thing that's a bit difficult with this new code path is that it's not clear if Dune_sexp.T.t values are meant to be quoted or not. The only code path where it's going to be unescaped is this in Dune_lang.Format:

let pp_simple t = Cst.abstract t |> Option.value_exn |> Ast.remove_locs |> Dune_sexp.pp

And it doesn't feel right to go back to Ast as we're trying to print a CST. So maybe this path could be replaced by a new pp function that operates on Cst directly, like the rest of the module does.

@moyodiallo moyodiallo self-assigned this Jan 29, 2024
@emillon
Copy link
Collaborator

emillon commented Jan 30, 2024

There's also an "easy" alternative to that, which is to only change Escape.quoted to recognize utf8 and emit utf8 codepoints as is (without backslash escapes). This is attractive but it doesn't preserve the input file, and also it mandates an encoding on the output file, so that's probably not a good solution.

What do you think @rgrinberg ?

@nojb
Copy link
Collaborator

nojb commented Jan 30, 2024

it mandates an encoding on the output file, so that's probably not a good solution.

Personally, I don't see a problem with this; what would be the downside with enforcing UTF-8 encoding of Dune files?

@emillon
Copy link
Collaborator

emillon commented Jan 30, 2024

I was thinking of Windows and especially #9396. But reading that conversation, it seems that we're ready to commit to just utf8, which is a great step forward.
Leaving the BOM issue (which is about relaxing the compatibility constraints we've made in the past), I think this means that we can just tweak the formatter to allow it to emit utf8 encoded strings when the contents are utf8-encoded, which is a lot simpler than the CST solution proposed. This probably requires some team discussion though so I'll add that to the next meeting.

@emillon emillon added the requires-team-discussion This topic requires a team discussion label Jan 30, 2024
moyodiallo added a commit to moyodiallo/dune that referenced this issue Jan 31, 2024
Signed-off-by: Alpha DIALLO <moyodiallo@gmail.com>
@rgrinberg
Copy link
Member

Sticking to utf8 for encoding seems fine to me. We can always retroactively add an explicit toggle if it doesn't work for someone.

emillon pushed a commit to moyodiallo/dune that referenced this issue Feb 5, 2024
Signed-off-by: Alpha DIALLO <moyodiallo@gmail.com>
emillon added a commit that referenced this issue Feb 5, 2024
* test: for non-ASCII characters (#9728)

Signed-off-by: Alpha DIALLO <moyodiallo@gmail.com>

* Update test/blackbox-tests/test-cases/formatting/non-ascii-characters.t

Co-authored-by: Etienne Millon <etienne.millon@gmail.com>
Signed-off-by: Alpha Issiaga DIALLO <alpha@tarides.com>

* test: evaluation through comparison of non-ascii

Signed-off-by: Alpha DIALLO <moyodiallo@gmail.com>

* fix output

Signed-off-by: Etienne Millon <me@emillon.org>

---------

Signed-off-by: Alpha DIALLO <moyodiallo@gmail.com>
Signed-off-by: Alpha Issiaga DIALLO <alpha@tarides.com>
Signed-off-by: Etienne Millon <me@emillon.org>
Co-authored-by: Etienne Millon <etienne.millon@gmail.com>
Co-authored-by: Etienne Millon <me@emillon.org>
moyodiallo added a commit to moyodiallo/dune that referenced this issue Feb 22, 2024
Signed-off-by: Alpha DIALLO <moyodiallo@gmail.com>
moyodiallo added a commit to moyodiallo/dune that referenced this issue Feb 22, 2024
Signed-off-by: Alpha DIALLO <moyodiallo@gmail.com>
moyodiallo added a commit to moyodiallo/dune that referenced this issue Feb 24, 2024
Signed-off-by: Alpha DIALLO <moyodiallo@gmail.com>
moyodiallo added a commit to moyodiallo/dune that referenced this issue Feb 24, 2024
Signed-off-by: Alpha DIALLO <moyodiallo@gmail.com>
moyodiallo added a commit to moyodiallo/dune that referenced this issue Mar 14, 2024
Signed-off-by: Alpha DIALLO <moyodiallo@gmail.com>
moyodiallo added a commit to moyodiallo/dune that referenced this issue Mar 14, 2024
Signed-off-by: Alpha DIALLO <moyodiallo@gmail.com>
moyodiallo added a commit to moyodiallo/dune that referenced this issue Mar 14, 2024
Signed-off-by: Alpha DIALLO <moyodiallo@gmail.com>
moyodiallo added a commit to moyodiallo/dune that referenced this issue Mar 15, 2024
Signed-off-by: Alpha DIALLO <moyodiallo@gmail.com>
moyodiallo added a commit to moyodiallo/dune that referenced this issue Mar 15, 2024
Signed-off-by: Alpha DIALLO <moyodiallo@gmail.com>
moyodiallo added a commit to moyodiallo/dune that referenced this issue Mar 19, 2024
Signed-off-by: Alpha DIALLO <moyodiallo@gmail.com>
emillon pushed a commit to moyodiallo/dune that referenced this issue Mar 19, 2024
Signed-off-by: Alpha DIALLO <moyodiallo@gmail.com>
Signed-off-by: Etienne Millon <me@emillon.org>
emillon added a commit that referenced this issue Mar 20, 2024
* fix: handle utf8 characters in the dune files(#9728)

Signed-off-by: Alpha DIALLO <moyodiallo@gmail.com>
Signed-off-by: Etienne Millon <me@emillon.org>

* Add the changes file and revert the modification of the lexer

Signed-off-by: Alpha DIALLO <moyodiallo@gmail.com>
Signed-off-by: Etienne Millon <me@emillon.org>

* Use uutf

Also leaving a note about using the stdlib later.

Signed-off-by: Etienne Millon <me@emillon.org>

* changelog

Signed-off-by: Etienne Millon <me@emillon.org>

---------

Signed-off-by: Alpha DIALLO <moyodiallo@gmail.com>
Signed-off-by: Etienne Millon <me@emillon.org>
Co-authored-by: Etienne Millon <me@emillon.org>
Leonidas-from-XIV added a commit to Leonidas-from-XIV/opam-repository that referenced this issue Mar 26, 2024
CHANGES:

### Added

- Add link flags to to `ocamlmklib` for ctypes stubs (ocaml/dune#8784, @frejsoya)

- Remove some unnecessary limitations in the expansions of percent forms in
  install stanza. For example, the `%{env:..}` form can be used to select files
  to be installed. (ocaml/dune#10160, @rgrinberg)

- Allow artifact expansion percent forms (`%{cma:..}`, `%{cmo:..}`, etc.) in
  more contexts. Previously, they would be randomly forbidden in some fields.
  (ocaml/dune#10169, @rgrinberg)

- Allow `%{inline_tests}` in more contexts (ocaml/dune#10191, @rgrinberg)

- Remove limitations on percent forms in the `(enabled_if ..)` field of
  libraries (ocaml/dune#10250, @rgrinberg)

- Support dialects in `dune describe pp` (ocaml/dune#10283, @emillon)

- Allow defining executables or melange emit stanzas with the same name in the
  same folder under different contexts. (ocaml/dune#10220, @rgrinberg, @jchavarri)

### Fixed

- coq: Delay Coq rule setup checks so OCaml-only packages can build in hybrid
  Coq/OCaml projects when `coqc` is not present. Thanks to @vzaliva for the
  test case and report (ocaml/dune#9845, fixes ocaml/dune#9818, @rgrinberg, @ejgallego)

- Fix conditional source selection with `select` on `bigarray` in OCaml 5
  (ocaml/dune#10011, @moyodiallo)

- melange: fix inconsistency in virtual library implementation. Concrete
  modules within a virtual library can now refer to its virtual modules too
  (ocaml/dune#10051, fixes ocaml/dune#7104, @anmonteiro)

- melange: fix a bug that would cause stale `import` paths to be emitted when
  moving source files within `(include_subdirs ..)` (ocaml/dune#10286, fixes ocaml/dune#9190,
  @anmonteiro)

- Dune file formatting: output utf8 if input is correctly encoded (ocaml/dune#10113,
  fixes ocaml/dune#9728, @moyodiallo)

- Fix expanding dependencies and locks specified in the cram stanza.
  Previously, they would be installed in the context of the cram test, rather
  than the cram stanza itself (ocaml/dune#10165, @rgrinberg)

- Fix bug with `dune exec --watch` where the working directory would always be
  set to the project root rather than the directory where the command was run
  (ocaml/dune#10262, @gridbugs)

- Regression fix: sign executables that are promoted into the source tree
  (ocaml/dune#10263, fixes ocaml/dune#9272, @emillon)

- Fix crash when decoding dune-package for libraries with `(include_subdirs
  qualified)` (ocaml/dune#10269, fixes ocaml/dune#10264, @emillon)

### Changed

- Remove the `--react-to-insignificant-changes` option. (ocaml/dune#10083, @rgrinberg)
@anmonteiro
Copy link
Collaborator

I believe #10113 fixed this.

Leonidas-from-XIV added a commit to Leonidas-from-XIV/opam-repository that referenced this issue Apr 3, 2024
CHANGES:

### Added

- Add link flags to to `ocamlmklib` for ctypes stubs (ocaml/dune#8784, @frejsoya)

- Remove some unnecessary limitations in the expansions of percent forms in
  install stanza. For example, the `%{env:..}` form can be used to select files
  to be installed. (ocaml/dune#10160, @rgrinberg)

- Allow artifact expansion percent forms (`%{cma:..}`, `%{cmo:..}`, etc.) in
  more contexts. Previously, they would be randomly forbidden in some fields.
  (ocaml/dune#10169, @rgrinberg)

- Allow `%{inline_tests}` in more contexts (ocaml/dune#10191, @rgrinberg)

- Remove limitations on percent forms in the `(enabled_if ..)` field of
  libraries (ocaml/dune#10250, @rgrinberg)

- Support dialects in `dune describe pp` (ocaml/dune#10283, @emillon)

- Allow defining executables or melange emit stanzas with the same name in the
  same folder under different contexts. (ocaml/dune#10220, @rgrinberg, @jchavarri)

### Fixed

- coq: Delay Coq rule setup checks so OCaml-only packages can build in hybrid
  Coq/OCaml projects when `coqc` is not present. Thanks to @vzaliva for the
  test case and report (ocaml/dune#9845, fixes ocaml/dune#9818, @rgrinberg, @ejgallego)

- Fix conditional source selection with `select` on `bigarray` in OCaml 5
  (ocaml/dune#10011, @moyodiallo)

- melange: fix inconsistency in virtual library implementation. Concrete
  modules within a virtual library can now refer to its virtual modules too
  (ocaml/dune#10051, fixes ocaml/dune#7104, @anmonteiro)

- melange: fix a bug that would cause stale `import` paths to be emitted when
  moving source files within `(include_subdirs ..)` (ocaml/dune#10286, fixes ocaml/dune#9190,
  @anmonteiro)

- Dune file formatting: output utf8 if input is correctly encoded (ocaml/dune#10113,
  fixes ocaml/dune#9728, @moyodiallo)

- Fix expanding dependencies and locks specified in the cram stanza.
  Previously, they would be installed in the context of the cram test, rather
  than the cram stanza itself (ocaml/dune#10165, @rgrinberg)

- Fix bug with `dune exec --watch` where the working directory would always be
  set to the project root rather than the directory where the command was run
  (ocaml/dune#10262, @gridbugs)

- Regression fix: sign executables that are promoted into the source tree
  (ocaml/dune#10263, fixes ocaml/dune#9272, @emillon)

- Fix crash when decoding dune-package for libraries with `(include_subdirs
  qualified)` (ocaml/dune#10269, fixes ocaml/dune#10264, @emillon)

### Changed

- Remove the `--react-to-insignificant-changes` option. (ocaml/dune#10083, @rgrinberg)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug requires-team-discussion This topic requires a team discussion
Projects
None yet
7 participants