-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specification for character encoding in the Modelica Standard Library #136
Comments
Modified by dietmarw on 10 Jan 2009 16:39 UTC |
Comment by otter on 11 Jan 2009 11:22 UTC The character set of the Modelica language is not yet completely specified. However, in practice the currently available Modelica tools work well for code written in the 8-bit Latin-1 character set, which corresponds to the first 256 characters of the Unicode character set. Most of the first 128 characters of Latin-1 are equivalent to the 7-bit ASCII character set. On Wikepedia there is some discussion about the pros and cons of using UTF8. Especially Advantages:
Disadvantages:
Many Windows programs (including Windows Notepad) add the bytes 0xEF,0xBB,0xBF at the start of any document saved as UTF-8. This is the UTF-8 encoding of the Unicode byte-order mark. This causes interoperability problems with software that does not expect the BOM. In particular:
Some Windows software (including Notepad) will sometimes misidentify UTF-8 (and thus plain ASCII) documents as UTF-16LE if this BOM is missing, a bug commonly known as "Bush hid the facts" after a particular phrase that can trigger it. I am a bit worried that there are Windows programs (most notably Notepad) that have difficulties with UTF8. I tried also my preferred text editor (Ultra Edit). By default it uses the Latin 1 character set but it can read UTF8 and it seem possible to also write UTF8. It would be useful to fix the character set issue in Modelica 3.1. From the above discussion it is clear that UTF8 should be the goal. The open question is what drawbacks we get. Please, try to figure this out in your environments, especially with your preferred text editors. |
Comment by otter on 11 Jan 2009 12:27 UTC
To summarize, UTF-8 is used by email, web, java, xml. In Modelica, all language keywords and all user-defined names are ASCII characters. Additionally, html documentation is also ASCII. Therefore, in Modelica the encoding is only relevant for strings and for comments. In order to make it easier for non-Western countries to use Modelica, UTF-8 is certainly a good idea. The character encoding is very closely related to the tools and therefore all Modelica tool vendors should give their opinions before we make a decision. |
Comment by dietmarw on 11 Jan 2009 17:02 UTC
|
Comment by anonymous on 12 Jan 2009 12:46 UTC
From reading this FAQ, it would seem that the only type such conversion would be necessary would be on output and that, in general, the encoded strings could be stored and to a large extent processed in encoded form. So it isn't clear to me (although I could certainly be wrong) that anything would necessarily double.
Well, MATLAB, Perl and Python all support UTF-8 already. It looks like Excel has supported it since at least Office 2002 as well.
I think the issue is not what problem UTF-8 solves but what problem ASCII creates. It seems to me that the issue boils down to this. Either Modelica only supports ASCII or it supports something beyond ASCII. If it isn't limited to just ASCII then it seems to me we have to specify the encoding. Unless somebody has compelling arguments for another encoding then UTF-8 seems to be the best choice. So that boils it down to strictly ASCII or UTF-8 as being the choices. Does anybody really want strict ASCII? |
Comment by petar on 12 Jan 2009 13:36 UTC One of the problems UTF-8 solves for MathModelica is that it allows our SBML translator tool to work without any problems as SBML uses UTF-8. SBML is a modelling language for systems biology (see sbml.org). I would also like to comment that different tool vendors might use different storage formats. The storage format for Modelica is outside the scope of the specification, but arguing for efficient formats may be still valid to some extent, however I think it shouldn't get higher priority than ease of use of the language. |
Comment by HansOlsson on 19 Jan 2009 09:36 UTC The common "modified" variant is CESU-8 (i.e. applying UTF-8 encoding to UTF-16) which the standards make clear is only for internal use - and still it is used by several tools (e.g. Java's external interface is almost CESU-8 - and wasn't properly documented as such the first time I looked at it). We can also either standardize on leading Byte-order-mark (BOM) or not; in Windows a BOM is almost necessary to ensure that most programs correctly identify the file, and it also allows a smooth upgrade (i.e. files without BOM can be treated as iso-8859-1). I understand that the BOM can cause issues on Unix, but I don't see any relevant ones - if a program can open an UTF-8 file in an editor the editor should also open it with BOM. Note: the shebang "#!" is used to denote shell-scripts and thus not relevant here; the other common issue is concatenating files; which seems odd for .mo-files. It also implies that all internal string-processing routines in tools (and Modelica.Utilities) has to be at least verified for correctness with UTF-8 characters (and in several cases the routines have to be rewritten). We should also be consistent - if UTF-8 encoding for strings taking double memory is an issue, we might think twice about HTML-encoding the documentation (which more than doubles the length). |
Comment by dietmarw on 5 Jun 2009 08:32 UTC
|
Comment by dietmarw on 24 Nov 2009 23:21 UTC |
Comment by dietmarw on 20 Jan 2010 07:34 UTC Developers who are unsure if their system supports UTF8 should:
|
Comment by otter on 15 Mar 2010 08:34 UTC
|
Comment by anonymous on 18 Mar 2010 14:55 UTC |
Comment by otter on 19 Mar 2010 09:32 UTC All Modelica files are UTF-8. non-ASCII UTF-8 are only allowed in:
In particular, also quoted identifiers like '#456 abc' must be 7-bit ASCII. The reason for this restriction is:
All this means that identical visual appearances of UTF-8 character sequences might have different byte codes behind. The question is now whether two identifiers are identical if they have identical visual appearance or whether they have the same byte code sequence. A human would prefer the first solution, a computer program the second solution. The second solution can result in difficult to detect errors in a Modelica model. Another issue is, if Unicode would be allowed in identifiers, how the ordering of identifiers is organized by a tool. E.g. in the package browser, it is possible to order everything "alphabetically". However, there is no unique "alphabetic" ordering in Unicode, since this depends on the culture (see below). Since we do not feel confident to have understood completely the issue and its solution, the proposal is to be conservative and only allow 7-bit ASCII for identifiers in Modelica 3.2. Does someone know whether there are programming languages that allow UTF-8 characters? If yes, how is this handled in these languages. E.g. Phyton supports Unicode in strings, but identifiers are even more restricted as in Modelica (only letters and digits). The support of UTF-8 in strings also poses problems. Here is a list with possible solution strategies:
For operands of type String, str1 op str2 is for each relational operator, op, defined in terms of the C-function Obviously, this definition is no longer correct for UTF-8 (since strcmp only works on ASCII) and must be changed. A simple solution is to make the comparison according to the numeric equivalent of a Unicode character (this seems to be also how Java compares Unicode strings, see: http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html#compareTo%28java.lang.String%29). In Python the same rules are used. Additionally, a function "unicodedata.normalize(form,unistr)" http://docs.python.org/library/unicodedata.html#unicodedata.normalize is provided to transform a Unicode string in to a normalized form according to different normalization methods (NFC, NFKC, NFD, NFKD) and then make the comparison on this normalized string.
|
Comment by otter on 19 Mar 2010 11:57 UTC
If someone also would like to have UTF-8 support in strings, he/she needs to help to precisely define this in the specification and help to adapt the C-functions under Modelica/C-sources/ModelicaStrings.c I also recognized that the whole thing is even worse: in Modelica/C-sources/ModelicaInternal.c there are external functions to operate on file names. In Windows, file names can have Unicode characters. In some blogs this is discussed and that this is even a mess under Windows (because depending on the used API, different sets of file names can be generated. E.g., generating a file with one API and then this file can be only viewed in the Windows Explorer, but it can, e.g., no longer be renamed in Windows Explorer). |
Comment by otter on 19 Mar 2010 17:29 UTC
|
Modified by dietmarw on 30 Apr 2010 07:12 UTC |
Comment by dietmarw on 21 May 2010 11:09 UTC |
Comment by hansolsson on 30 Mar 2016 14:13 UTC Unfortunately I cannot close it with Milestone MLS 3.2. |
Changelog removed by hansolsson on 30 Mar 2016 14:13 UTC |
Modified by dietmarw on 10 Jan 2009 16:39 UTC
Currently it is not specified what encoding Modelica files should be using. For historical reasons (and because of bug No 1 ;-) ) Modelica files are using
Western ISO-8859-1
. This can lead to problems when opening on a "non-Western" environment (e.g., asian fonts).I'm opening this issue as follow up from the Modelica Design list discussion:
I also looked into the Modelica Design archive and also found these threads:
Just to summarise from the most recent discussion:
Michael Tiller::
At the Waterloo meeting a Kevin Ellis strongly suggested UTF-8 encoding for zip files so I would suggest we follow that advice for other forms as well just for consistency.
Francesco Casella::
So getting somebody to comment who actually has some experience with character encoding would be great.
As for the HTML part of Modelica files I agree with Francesco that only ASCII text should be allowed and HTML entities for special characters.
Reported by dietmarw on 10 Jan 2009 16:31 UTC
Currently it is not specified what encoding Modelica files should be using. For historical reasons (and because of bug #1 ;-) ) Modelica files are using
Western ISO-8859-1
. This can lead to problems when opening on a "non-Western" environment (e.g., asian fonts).I'm opening this issue as follow up from the Modelica Design list discussion:
I also looked into the Modelica Design archive and also found these threads:
Just to summarise from the most recent discussion:
Michael Tiller::
At the Waterloo meeting a Kevin Ellis strongly suggested UTF-8 encoding for zip files so I would suggest we follow that advice for other forms as well just for consistency.
Francesco Casella::
So getting somebody to comment who actually has some experience with character encoding would be great.
As for the HTML part of Modelica files I agree with Francesco that only ASCII text should be allowed and HTML entities for special characters.
Migrated-From: https://trac.modelica.org/Modelica/ticket/136
The text was updated successfully, but these errors were encountered: