Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR6695: Do not treat paths as encoded in ISO-8859-1 #124

Closed
wants to merge 6 commits into from

Conversation

Projects
None yet
6 participants
@whitequark
Copy link
Member

commented Dec 12, 2014

@gasche

This comment has been minimized.

Copy link
Member

commented Dec 12, 2014

I think the merge of "Make sure the compiler only uses ASCII" would be a no-brainer if it didn't imply changing the stdlib, which is always more controversial (as it should be). I'd merge a commit with the ascii_foo functions in utils/misc (and let people discuss a second commit that moves them to the stdlib directly). Unfortunately, that would require duplicating them in ocamlbuild/my_std as well, as we don't want ocamlbuild to start depending on internal compiler libraries.

@gasche

This comment has been minimized.

Copy link
Member

commented Dec 12, 2014

I forgot to say: I do support adding the ascii_* to the stdlib (I would personally have just changed the semantics of the existing String....) and have reviewed the patch, I think it is fine.

(We may want to expose latin1_* functions and encourage people to use them if they explicitly rely on the accented-characters behavior.)

@whitequark

This comment has been minimized.

Copy link
Member Author

commented Dec 12, 2014

@gasche Yes, that was the reason I did not put them in Misc.

Should I move/duplicate them or wait for more input?

@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented Dec 12, 2014

It might be useful to check that no OPAM package uses compilation units whose name starts with a non-ASCII letter. I assume and hope the answer is no.

@gasche

This comment has been minimized.

Copy link
Member

commented Dec 12, 2014

Should I move/duplicate them or wait for more input?

Well it's as you prefer. If you value the "at least the compiler is fixed in trunk now" aspect enough to put micro-work in splitting the commits, you want to do that. Otherwise, just wait -- but be patient, these things can stay silent for months. Both are fine options.

@whitequark

This comment has been minimized.

Copy link
Member Author

commented Dec 12, 2014

@alainfrisch The answer is actually no. ocamldep does not print dependencies whose names start with non-'A'..'Z' -- this will be the topic of a later PR of mine.

Additionally, since Ubuntu, Github et al use UTF-8 and the compiler uses Latin-1, even if it did not, that would still fail the compilation.

@whitequark

This comment has been minimized.

Copy link
Member Author

commented Dec 12, 2014

@gasche I will put them in Misc. This will make it easier to decide on how/if change the semantics of String.

@damiendoligez

This comment has been minimized.

Copy link
Member

commented Dec 12, 2014

I'm in favor of adding them to stdlib, but we need to discuss the interface:

  1. I don't like the names, I think capitalize_ascii would look better than ascii_capitalize
  2. We need to discuss the idea of adding an optional argument to the existing functions instead of adding new functions (it looks cleaner but makes it harder to deprecate the old behavior).
@whitequark

This comment has been minimized.

Copy link
Member Author

commented Dec 12, 2014

@gasche I tried to put them in Misc. It is not easy. There's a never-ending hole of dependencies all around the compiler.

@planar

  1. I will rename them.
  2. How about adding ?(latin1=true) ?
@gasche

This comment has been minimized.

Copy link
Member

commented Dec 12, 2014

I think adding optional arguments is a bad way to evolve APIs. In this particular case, the default value would be the bad value for this parameter.

@dbuenzli

This comment has been minimized.

Copy link
Contributor

commented Dec 12, 2014

@gasche If that's the bad value for the optional parameter it means that you ready to break compatibility ?

If it's ok to break backward compatibility then why not move the old functions with the same name to a deprecated String.Latin1 module and keep the rest as is but now only acting on the ASCII set. Code relying on the old behaviour will only have to add Latin1 in front of a bunch of identifiers.

Another alternative (which I'm not very fond of) would be to have a full copy of the old String module in the deprecated String.Latin1 module and tell people to open String.Latin1 if they need the deprecated behaviour. The advantage and disadvantage of this solution is that backward compatibility can be maintained at the build system level with -open String.Latin1. But I'm not very fond of that command line option...

@dbuenzli

This comment has been minimized.

Copy link
Contributor

commented Dec 12, 2014

Oh btw. adding optional parameters to an API is an interface breaking change anyways.

@whitequark

This comment has been minimized.

Copy link
Member Author

commented Dec 12, 2014

I think the required change is by far not invasive enough to warrant a full copy. In any case it is never worse than adding ascii affix and is simple enough that you could even apply it with sed. (Not that you should.)

Personally I prefer changing the behavior of existing functions, given that it seems that breaking backwards compatibility is inevitable.

@lpw25

This comment has been minimized.

Copy link
Contributor

commented Dec 12, 2014

If we're going to break backwards compatibility, why not break it in the same way it was broken for Bytes. This would meaning adding a Latin1 module to stdlib with the old implementations, and a latin1 type to the predefined types. Then a compiler option would decide whether string was equal to latin1 and whether latin1 characters were accepted in string literals.

We could also add a "foo"l literal form to create latin1 strings (which could be a first step towards allowing Unicode literals as expressions one day).

The benefit of this approach is that we use types to properly separate latin1 encoded strings from ascii encoded strings -- which should hopefully make bugs around incorrect use of the encodings less likely.

@whitequark

This comment has been minimized.

Copy link
Member Author

commented Dec 12, 2014

@lpw25 I think this is the best solution. We also know migration pitfalls reasonably well.

@dbuenzli

This comment has been minimized.

Copy link
Contributor

commented Dec 12, 2014

@whitequark this as in this PR or this as in @lpw25 ? I personally think that @lpw25's solution is overkill for four functions acting on a legacy charset. If backward compatibility should not be broken on that I prefer deprecation of the current functions and the _ascii suffix way.

@whitequark

This comment has been minimized.

Copy link
Member Author

commented Dec 12, 2014

@dbuenzli I was originally talking about @lpw25 but after consideration I realized I do not understand the ramifications of either breaking or not breaking compatibility well enough to have an opinion, so I'll let someone else figure out how it should be done.

@damiendoligez

This comment has been minimized.

Copy link
Member

commented Dec 12, 2014

@dbuenzli Indeed, @lpw25's solution would be fine if Latin1 was here to stay, but we're trying to get rid of it so I don't see much point in adding that much complexity to the stdlib. Like you, I prefer the "deprecation and _ascii suffix" solution.

@gasche

This comment has been minimized.

Copy link
Member

commented Dec 13, 2014

Thirded. I think the current patch is optimal (I would maybe add latin1_* functions as well, but it's a detail).

PR6695: Add ASCII counterparts to case-mapping functions.
This updates Char, String, Bytes in the stdlib.

For now, they are hidden from documentation and are only for
internal compiler use.
@whitequark

This comment has been minimized.

Copy link
Member Author

commented Dec 14, 2014

@gasche I've updated the patch to include @planar's suggestion on naming. Otherwise it is unchanged. Anything else needs to be done?

@gasche

This comment has been minimized.

Copy link
Member

commented Dec 14, 2014

If you also marked the legacy functions as deprecated this patch would solve Mantis PR#6695 as well.

@whitequark

This comment has been minimized.

Copy link
Member Author

commented Dec 14, 2014

@gasche Did you mean PR6694?

@gasche

This comment has been minimized.

Copy link
Member

commented Dec 14, 2014

Yes indeed.

whitequark added some commits Dec 12, 2014

PR6695: Make sure the compiler only uses ASCII string functions.
This should cover all places involving filenames in the compiler.
There are a few more paths still using Latin-1 in other ways,
e.g. in ocamldoc.
PR6694: Deprecate Latin-1 string manipulation functions.
Also, add documentation for the US-ASCII variants.
PR6695: Make Filename use only US-ASCII functions.
The only place that includes changes is the code for checking
the suffix. It is highly unlikely that the change has any
impact at all.
@whitequark

This comment has been minimized.

Copy link
Member Author

commented Dec 14, 2014

Updated. The deprecation warnings actually uncovered a bug (a stray lowercase in Filename), and added some fallout, mainly in Str.

@gasche

This comment has been minimized.

Copy link
Member

commented Dec 21, 2014

Merged in trunk. Thanks!

@gasche gasche closed this Dec 21, 2014

@whitequark whitequark deleted the whitequark:utf8-paths branch Dec 21, 2014

@whitequark whitequark restored the whitequark:utf8-paths branch Dec 21, 2014

@whitequark whitequark deleted the whitequark:utf8-paths branch Dec 21, 2014

@dbuenzli

This comment has been minimized.

Copy link
Contributor

commented Dec 21, 2014

@gasche given the recent #131 you may want to add @SInCE directives to {Char,String}.{uppercase,lowercase}_ascii now.

@whitequark

This comment has been minimized.

Copy link
Member Author

commented Dec 21, 2014

@dbuenzli Already done in trunk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.