Skip to content

Switch from camomile to uu*#74

Closed
nojb wants to merge 3 commits into
ocaml-community:masterfrom
nojb:master
Closed

Switch from camomile to uu*#74
nojb wants to merge 3 commits into
ocaml-community:masterfrom
nojb:master

Conversation

@nojb
Copy link
Copy Markdown
Contributor

@nojb nojb commented Jun 3, 2019

This is a companion PR to ocaml-community/zed#16, swapping out camomile for uu*. The main issue that needs working out is that uutf only supports conversion into

type encoding = [ `UTF_16 | `UTF_16BE | `UTF_16LE | `UTF_8 ]

This means that the "outgoing encoding" of lambda-term must be one of these. This is probably not a big deal on Linux where UTF-8 usage is pervasive, but on the Windows console Latin-1 is widespread and this would not be supported anymore.

/cc @Drup @diml

@nojb nojb requested a review from a user June 3, 2019 11:15
@pmetzger
Copy link
Copy Markdown
Member

pmetzger commented Jun 8, 2019

I wonder if @dbuenzli has any ideas here.

@dbuenzli
Copy link
Copy Markdown
Contributor

dbuenzli commented Jun 9, 2019

Not exactly sure about the context. But converting an Uchar.t to Latin1 is trivial: that's basically try Char.chr (Uchar.to_int u) with Invalid_argument _ -> ... so I don't think it should be too hard to support an additional path for that in this setting.

@pmetzger
Copy link
Copy Markdown
Member

pmetzger commented Jun 9, 2019

@dbuenzli Is there a usual best practice for what you substitute for a character that can't be represented? @nojb will need to pick something. (A ? seems like an obvious choice perhaps?)

@dbuenzli
Copy link
Copy Markdown
Contributor

dbuenzli commented Jun 9, 2019

Not that I'm aware of.

@pmetzger
Copy link
Copy Markdown
Member

pmetzger commented Jun 9, 2019

So @nojb, as @dbuenzli suggests, you can just handle the latin1 case by trying to convert the character into a latin1 char, and if it's out of range, replace it with a ?.

@nojb
Copy link
Copy Markdown
Contributor Author

nojb commented Jun 9, 2019

Just to be clear, even if we support Latin-1, we will no longer support any other Windows codepage, so this requires a decision.

@kandu
Copy link
Copy Markdown
Collaborator

kandu commented Jun 9, 2019

As what I've already known, it's not only Latin-1 that is widespread, a lot of encodings are in use among the world depending on the system locale. For example, there are several hundreds million windows computers of which their console's encoding is cp932(JP) or cp936(CN).

@kandu
Copy link
Copy Markdown
Collaborator

kandu commented Jun 9, 2019

To replace camomile in zed, an abstract editing engine, is a good idea. But to do so with lambda-term, in my opinion, should be considered twice.

@Drup
Copy link
Copy Markdown
Member

Drup commented Jun 9, 2019

Removing some windows support is really problematic, it's one of lambda-term's strength that everything is multiplateform out-of-the-box.

@nojb
Copy link
Copy Markdown
Contributor Author

nojb commented Jun 9, 2019

Removing some windows support is really problematic, it's one of lambda-term's strength that everything is multiplateform out-of-the-box.

I agree; if we still want to switch zed to uu*, this will require adding a layer uu* <-> camomile to be used in the boundary lambda-term <-> zed.

@dbuenzli
Copy link
Copy Markdown
Contributor

dbuenzli commented Jun 9, 2019

If the windows mappings are the only problem note that the actual data can be found here (I had the project to expose these mappings in the uucp package or in another package but somehow never got round to it).

@pmetzger
Copy link
Copy Markdown
Member

pmetzger commented Jun 9, 2019

Just to note: Windows users can choose to use Unicode in any console window if they want to. Not wanting to support all the world's encodings doesn't mean a Windows user is out of luck.

@nojb
Copy link
Copy Markdown
Contributor Author

nojb commented Jun 9, 2019

Just to note: Windows users can choose to use Unicode in any console window if they want to. Not wanting to support all the world's encodings doesn't mean a Windows user is out of luck.

Actually it is more complicated than that. cmd.exe does not support UTF-8 very well. The details are complicated, but you can find some pointers in https://stackoverflow.com/questions/388490/how-to-use-unicode-characters-in-windows-command-line.

@nojb
Copy link
Copy Markdown
Contributor Author

nojb commented Jun 9, 2019

I made some modifications to the patch so that camomile is still used for I/O but uu* is used for the rest (including interacting with zed). Getting rid of camomile completely would be left for a future PR.

@pmetzger
Copy link
Copy Markdown
Member

BTW, the CI build is failing.

@nojb
Copy link
Copy Markdown
Contributor Author

nojb commented Jun 10, 2019

BTW, the CI build is failing.

Yes, this requires ocaml-community/zed#16

@kandu
Copy link
Copy Markdown
Collaborator

kandu commented Jun 10, 2019

If the windows mappings are the only problem note that the actual data can be found here

With these mappings, to get rid of camomile is fairly straightforward.

But if to do so will introduce another burden which is of big size...
Or we ask another question: what's the disadvantage of depending on camomile that we want to get rid of it?

@pmetzger
Copy link
Copy Markdown
Member

Minor, unimportant: camomile's license is a bit more restrictive
More important: camomile isn't very well maintained. It's rarely up to date with the latest Unicode.

@dbuenzli
Copy link
Copy Markdown
Contributor

Unless that changed camomile relies on Unicode 3.2 which is getting pretty old, quite a few ten of thousands characters and dozens of scripts have been added since then. However this fact may not be a problem for what lambda term is doing (I don't know).

Actually it is more complicated than that. cmd.exe does not support UTF-8 very well. The details are complicated, but you can find some pointers in https://stackoverflow.com/questions/388490/how-to-use-unicode-characters-in-windows-command-line.

I certainly don't have a grasp of all these details but note that trying to make things work out-of-the-box will not necessarily lead to a better end-user experience.

If we take the toplevel as an example, PR like ocaml/ocaml#1231 firmly put the toplevel to expect an UTF-8 environment (you have to set OCAMLTOP_UTF_8=false to prevent this) and most OCaml libraries returning you strings as text will likely give them back to you in UTF-8.

The conjunction of these two facts may lead you to see more '?' than you would like (which usually entails issue report from character set beginners) or worse, unless utop fully reencodes the toplevel's output, that may even break your terminal session.

@pmetzger
Copy link
Copy Markdown
Member

BTW, it occurs to me that the real expert on Windows console support is @dra27 and it might be a good idea to ask his opinion.

@dra27
Copy link
Copy Markdown

dra27 commented Jun 11, 2019

Only glancing at this, but the output mode of the console is not something you have to live with on Windows, it's something you can define. So if lambda-term wishes to output UTF-8, then all it has to do is use SetConsoleMode to enable the UTF-8 codepage and you should be done. IIRC it gets restored automatically on process exit. I'm supposed to be (finally) finishing off ocaml/ocaml#1408 in time for 4.10 in the autumn, after which OCaml on Windows will enable UTF-8 output for the Windows Console by default anyway.

There are obscure gotchas with enabling UTF-8 on older Windows, especially for input, although it starts to become quite academic from next year when Windows 7 exits support.

@pmetzger
Copy link
Copy Markdown
Member

@dra27 Is there a way to make that call from within OCaml easily?

@nojb
Copy link
Copy Markdown
Contributor Author

nojb commented May 15, 2020

This needs updating, but I don't have the time at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants