Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The accented characters in strings are automatically uppercased #5732

Closed
vicuna opened this Issue Aug 17, 2012 · 6 comments

Comments

Projects
None yet
1 participant
@vicuna
Copy link
Collaborator

vicuna commented Aug 17, 2012

Original bug ID: 5732
Reporter: Ted
Assigned to: @protz
Status: closed (set by @xavierleroy on 2016-12-07T10:37:03Z)
Resolution: not a bug
Priority: normal
Severity: minor
Platform: Laptop
OS: Debian Unstable
OS Version: 3.2.0-3-amd64
Version: 3.12.1
Category: ~DO NOT USE (was: OCaml general)
Child of: #6694

Bug description

(I have reproduced this bug to 3.10 version of OCaml too)

A little example is worth a long speech :

$ ocaml
Objective Caml version 3.12.1

"Ô, mon brûlant zéphyr doré";;

  • : string = "\195\148, mon br\195\187lant z\195\169phyr dor\195\169"

String.lowercase "Ô, mon brûlant zéphyr doré";;

  • : string = "\227\148, mon br\227\187lant z\227\169phyr dor\227\169"

String.uppercase "Ô, mon brûlant zéphyr doré";;

  • : string = "\195\148, MON BR\195\187LANT Z\195\169PHYR DOR\195\169"

I don't know if the encoding problem is normal, but I am pretty sure that this behaviour is not : String.uppercase does nothing, which means that the system automatically transforms the letter "é" into "É", etc. This bug is present for many accented letters :

String.uppercase "éèàâôû?ãõëäöÿçùò?" = "éèàâôû?ãõëäöÿçùò?";;

  • : bool = true

but, quite surprisingly, not for every one of them :

String.uppercase "?" = "?";;

  • : bool = false

String.uppercase "?" = "?";;

  • : bool = false

This problem happens even when I do not use my usual alias (ocaml="rlwrap ocaml") or my usual shell (zsh), and this bug occurs too when compiling ocaml code with ocamlc or ocamlopt.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Aug 17, 2012

Comment author: Ted

The two characters that I have found for which the problem does not appear are these ones :

http://fr.wikipedia.org/wiki/%E1%BA%80 (does not exist in english wikipédia)
http://en.wikipedia.org/wiki/%E1%BA%BC

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Aug 17, 2012

Comment author: @protz

From what OCaml prints, your Ô character uses two bytes, so I guess you're inputting utf-8. OCaml still lives in the former millenium and is not utf8-compatible, so I assume these uppercase and lowercase routines only work properly on latin1-encoded strings, unfortunately :).

I suggest you take a look at the Batteries project. It has a BatUTF8 module that provides some utf8 handling routines. If you need more advanced routines, Camomile is the Unicode library for OCaml.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Aug 17, 2012

Comment author: @protz

OCaml version 4.00.0

String.length "Ô";;

  • : int = 2

(If you get the same results on your machine, then you're inputting utf8).

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Aug 17, 2012

Comment author: Ted

It looks like I am inputting utf8 then. It does not surprise me that there is such encoding problems, but I really do not get why I got things like :

String.lowercase "é";;

  • : string = "\227\169"

"é";;

  • : string = "\195\169"

Could'nt String.lowercase just ignore accented letter characters when it does not recognize them ? As I do not need to actually print anything, the strange output does not bother me much, but the strange behaviour of String.lowercase does.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Aug 17, 2012

Comment author: @dbuenzli

But it does recognize them, the String module interprets strings as latin-1 encoded.

The behaviour is correct, in latin-1 \227\169 is the sequence 㩠which it correctly maps to \195\169 which is the sequence é.

Consult the table on this page http://en.wikipedia.org/wiki/ISO_8859-1

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

vicuna commented Aug 17, 2012

Comment author: Ted

Aah, I get it. Well, sorry for the "wrong" bug report, then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.