Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not treat paths as encoded in ISO-8859-1 #6695

Closed
vicuna opened this Issue Dec 7, 2014 · 5 comments

Comments

Projects
None yet
1 participant
@vicuna
Copy link
Collaborator

commented Dec 7, 2014

Original bug ID: 6695
Reporter: @whitequark
Assigned to: @whitequark
Status: closed (set by @xavierleroy on 2016-12-07T10:37:18Z)
Resolution: fixed
Priority: normal
Severity: minor
Fixed in version: 4.03.0+dev / +beta1
Category: ~DO NOT USE (was: OCaml general)
Related to: #3771 #6692 #6694 #6697
Monitored by: @gasche @hcarty

Bug description

Currently, ocamlc uses String.capitalize and String.uncapitalize extensively when deriving filenames from module names and vice versa. These functions treat the strings as ISO-8859-1, and attempt to case-fold letters such as \248 (ø).

Today, no supported operating system where OCaml runs always encodes paths as ISO-8859-1. Rather, UTF-8 is used on sane platforms, and a locale-specific encoding on Windows. Thus, this case-folding is practically always broken and the derived name will contain garbage if the first letter is not included in US-ASCII.

This is a separate issue from #6694. Not only the impact in this case is very clear and the scope is limited to the compiler, but the current behavior is also more clearly broken.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

commented Dec 8, 2014

Comment author: @alainfrisch

This is related to #3771 as well, which would have the effect of representing filenames under Windows as utf8 strings.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

commented Dec 8, 2014

Comment author: @alainfrisch

Note that currently, source code is interpreted as a Latin1 stream, and Latin1 letters are allowed in module identifiers (although is deprecated and raises Warning 3). It's probably a good time to turn this into a proper error, otherwise we need to specify how these names are mapped to filenames, and I don't think we want to go into that.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

commented Dec 8, 2014

Comment author: @whitequark

Agreed. I will open another issue to track that.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

commented Dec 12, 2014

Comment author: @whitequark

#124

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

commented Dec 21, 2014

Comment author: @gasche

The patch of whitequark that uses the *_ascii functions everywhere inside the compiler has been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.