Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specification for character encoding in the Modelica Standard Library #136

Closed
modelica-trac-importer opened this issue Nov 3, 2018 · 19 comments

Comments

@modelica-trac-importer
Copy link
Collaborator

modelica-trac-importer commented Nov 3, 2018

Modified by dietmarw on 10 Jan 2009 16:39 UTC
Currently it is not specified what encoding Modelica files should be using. For historical reasons (and because of bug No 1 ;-) ) Modelica files are using Western ISO-8859-1. This can lead to problems when opening on a "non-Western" environment (e.g., asian fonts).

I'm opening this issue as follow up from the Modelica Design list discussion:

I also looked into the Modelica Design archive and also found these threads:


Just to summarise from the most recent discussion:
Michael Tiller::
At the Waterloo meeting a Kevin Ellis strongly suggested UTF-8 encoding for zip files so I would suggest we follow that advice for other forms as well just for consistency.
Francesco Casella::

  • Possibility to include the information about the encoding in an annotation? The annotation would be written in ASCII, so it would be read no matter what, and then a tool might interpret the encoding correctly, and possibly convert it to a different one, if needed.
  • Might not be feasible as the tool would need char encoding informationbefore reading the file. (comment from Michael Tiller)
  • Something in the same spirit of MIME type to specify character encoding in e-mail.
  • I guess support of UTF-8 will become more and more important in the future (e.g. as the number of Chinese or Arabic end users of Modelica increase), but forcing to use UTF-8 might pose some problem to "traditional" western users. For documentation, I would anyway recommend to stick to HTML-entities, at least for the MSL

So getting somebody to comment who actually has some experience with character encoding would be great.

As for the HTML part of Modelica files I agree with Francesco that only ASCII text should be allowed and HTML entities for special characters.


Reported by dietmarw on 10 Jan 2009 16:31 UTC
Currently it is not specified what encoding Modelica files should be using. For historical reasons (and because of bug #1 ;-) ) Modelica files are using Western ISO-8859-1. This can lead to problems when opening on a "non-Western" environment (e.g., asian fonts).

I'm opening this issue as follow up from the Modelica Design list discussion:

I also looked into the Modelica Design archive and also found these threads:


Just to summarise from the most recent discussion:
Michael Tiller::
At the Waterloo meeting a Kevin Ellis strongly suggested UTF-8 encoding for zip files so I would suggest we follow that advice for other forms as well just for consistency.
Francesco Casella::

  • Possibility to include the information about the encoding in an annotation? The annotation would be written in ASCII, so it would be read no matter what, and then a tool might interpret the encoding correctly, and possibly convert it to a different one, if needed.
  • Might not be feasible as the tool would need char encoding informationbefore reading the file. (comment from Michael Tiller)
  • Something in the same spirit of MIME type to specify character encoding in e-mail.
  • I guess support of UTF-8 will become more and more important in the future (e.g. as the number of Chinese or Arabic end users of Modelica increase), but forcing to use UTF-8 might pose some problem to "traditional" western users. For documentation, I would anyway recommend to stick to HTML-entities, at least for the MSL

So getting somebody to comment who actually has some experience with character encoding would be great.

As for the HTML part of Modelica files I agree with Francesco that only ASCII text should be allowed and HTML entities for special characters.


Migrated-From: https://trac.modelica.org/Modelica/ticket/136

@modelica-trac-importer modelica-trac-importer added the discussion Indicates that there's a discussion; not clear if bug, enhancement, or working as intended label Nov 3, 2018
@modelica-trac-importer
Copy link
Collaborator Author

Modified by dietmarw on 10 Jan 2009 16:39 UTC

@modelica-trac-importer
Copy link
Collaborator Author

Comment by otter on 11 Jan 2009 11:22 UTC
In the Modelica Specification 3.0 it is stated in chapter 2.1:

The character set of the Modelica language is not yet completely specified. However, in practice the currently available Modelica tools work well for code written in the 8-bit Latin-1 character set, which corresponds to the first 256 characters of the Unicode character set. Most of the first 128 characters of Latin-1 are equivalent to the 7-bit ASCII character set.

On Wikepedia there is some discussion about the pros and cons of using UTF8. Especially

Advantages:

* UTF-8 is a superset of ASCII. Since a plain ASCII string is also a valid UTF-8 string, no conversion needs to be done for existing ASCII text. Software designed for traditional code-page-specific character sets can generally be used with UTF-8 with few or no changes.

* UTF-8 can encode any Unicode character, avoiding the need to figure out and set a "code page" or otherwise indicate what character set is in use, and allowing output in multiple languages at the same time.

* The Internet Engineering Task Force (IETF) requires all Internet protocols to identify the encoding used for character data, and the supported character encodings must include UTF-8. The Internet Mail Consortium (IMC) recommends that all email programs be able to display and create mail using UTF-8.

 * In normal usage, the Java programming language supports standard UTF-8 when reading and writing strings through InputStreamReader and OutputStreamWriter. However it uses modified UTF-8 for object serialization, for the Java Native Interface, and for embedding constants in class files. Tcl also uses the same modified UTF-8 as Java for internal representation of Unicode data.

Disadvantages:

  • UTF-8 encoded text is larger than the appropriate single-byte encoding except for plain ASCII characters. In the case of languages which commonly used 8-bit character sets with non-Latin alphabets encoded in the upper half (such as most Cyrillic and Greek alphabet code pages), UTF-8 text will be almost double the size of the same text in a single-byte encoding.

    • Single byte per character encodings make string cutting easy even with simple-minded APIs.

Many Windows programs (including Windows Notepad) add the bytes 0xEF,0xBB,0xBF at the start of any document saved as UTF-8. This is the UTF-8 encoding of the Unicode byte-order mark. This causes interoperability problems with software that does not expect the BOM. In particular:

* It removes the desirable feature that UTF-8 is identical to ASCII for ASCII-only text. For instance a text editor that does not recognize UTF-8 will display "" at the start of the document, even if the UTF-8 contains only ASCII and would otherwise display correctly. It is for example possible to introduce Unicode into an existing computer language and compiler by adding some API for input/output, and use an external UTF-8 text editor when editing text strings. However such a compiler would not accept the BOM, which must be removed manually.

* Programs that identify file types by special leading characters will fail to identify the UTF-8 files, even for file types that can otherwise contain UTF-8. A notable example is the Unix shebang syntax.

Some Windows software (including Notepad) will sometimes misidentify UTF-8 (and thus plain ASCII) documents as UTF-16LE if this BOM is missing, a bug commonly known as "Bush hid the facts" after a particular phrase that can trigger it.

I am a bit worried that there are Windows programs (most notably Notepad) that have difficulties with UTF8. I tried also my preferred text editor (Ultra Edit). By default it uses the Latin 1 character set but it can read UTF8 and it seem possible to also write UTF8.

It would be useful to fix the character set issue in Modelica 3.1. From the above discussion it is clear that UTF8 should be the goal. The open question is what drawbacks we get. Please, try to figure this out in your environments, especially with your preferred text editors.

@modelica-trac-importer
Copy link
Collaborator Author

Comment by otter on 11 Jan 2009 12:27 UTC
Forgot one advantage:

  • UTF-8 and UTF-16 are the standard encodings for XML documents. All other encodings must be specified explicitly either externally or through a text declaration

To summarize, UTF-8 is used by email, web, java, xml.

In Modelica, all language keywords and all user-defined names are ASCII characters. Additionally, html documentation is also ASCII. Therefore, in Modelica the encoding is only relevant for strings and for comments. In order to make it easier for non-Western countries to use Modelica, UTF-8 is certainly a good idea.

The character encoding is very closely related to the tools and therefore all Modelica tool vendors should give their opinions before we make a decision.

@modelica-trac-importer
Copy link
Collaborator Author

Comment by dietmarw on 11 Jan 2009 17:02 UTC
Martin Otter wrote:

It would be useful to fix the character set issue in Modelica 3.1.

@modelica-trac-importer modelica-trac-importer added this to the ModelicaSpec3.1 milestone Nov 3, 2018
@modelica-trac-importer
Copy link
Collaborator Author

Comment by anonymous on 12 Jan 2009 12:46 UTC
On the mailing list, Dag raised these issues:

I think we have to consider the impact on the entire tool chain. For example, Dymola will copy the variable comments into the generated C code. Unfortunately, UTF-x is not a native character encoding in C, so that would require conversion to 16-bit wide characters. That will not be fully compatible with all UTF-characters. Furthermore, we're talking about a lot of strings, which will double in size; an issue for large model or small computers (e.g. HILS).

From reading this FAQ, it would seem that the only type such conversion would be necessary would be on output and that, in general, the encoded strings could be stored and to a large extent processed in encoded form. So it isn't clear to me (although I could certainly be wrong) that anything would necessarily double.

Secondly, the simulation produces a result file, which also contains the strings for documentation purposes. Maybe the result file will not grow so much, but what about the tools users have built to read and process result files? All of these tools would have to be re-written to cope with the new format -- and also the old format if you ever want to compare your data with an old simulation run. So, what are the consequences for those tools written in Basic, C, Excel, Matlab, Perl or Python that manipulate result file data?

Well, MATLAB, Perl and Python all support UTF-8 already. It looks like Excel has supported it since at least Office 2002 as well.

Thirdly, what is the extent of the problem? Do we have a concrete list of problems that would be solved by UTF-x?

I think the issue is not what problem UTF-8 solves but what problem ASCII creates.

It seems to me that the issue boils down to this. Either Modelica only supports ASCII or it supports something beyond ASCII. If it isn't limited to just ASCII then it seems to me we have to specify the encoding. Unless somebody has compelling arguments for another encoding then UTF-8 seems to be the best choice. So that boils it down to strictly ASCII or UTF-8 as being the choices. Does anybody really want strict ASCII?

@modelica-trac-importer
Copy link
Collaborator Author

Comment by petar on 12 Jan 2009 13:36 UTC
I am in favor of using UTF-8.

One of the problems UTF-8 solves for MathModelica is that it allows our SBML translator tool to work without any problems as SBML uses UTF-8. SBML is a modelling language for systems biology (see sbml.org).

I would also like to comment that different tool vendors might use different storage formats. The storage format for Modelica is outside the scope of the specification, but arguing for efficient formats may be still valid to some extent, however I think it shouldn't get higher priority than ease of use of the language.

@modelica-trac-importer
Copy link
Collaborator Author

Comment by HansOlsson on 19 Jan 2009 09:36 UTC
If we standarize on UTF-8 we must be very clear that it is UTF-8, and not some variant thereof; I hope that is clear and we do not get any "modified UTF-8" allowed by the standard.

The common "modified" variant is CESU-8 (i.e. applying UTF-8 encoding to UTF-16) which the standards make clear is only for internal use - and still it is used by several tools (e.g. Java's external interface is almost CESU-8 - and wasn't properly documented as such the first time I looked at it).

We can also either standardize on leading Byte-order-mark (BOM) or not; in Windows a BOM is almost necessary to ensure that most programs correctly identify the file, and it also allows a smooth upgrade (i.e. files without BOM can be treated as iso-8859-1).

I understand that the BOM can cause issues on Unix, but I don't see any relevant ones - if a program can open an UTF-8 file in an editor the editor should also open it with BOM. Note: the shebang "#!" is used to denote shell-scripts and thus not relevant here; the other common issue is concatenating files; which seems odd for .mo-files.

It also implies that all internal string-processing routines in tools (and Modelica.Utilities) has to be at least verified for correctness with UTF-8 characters (and in several cases the routines have to be rewritten).
--
We also have to decide where they these non-ASCII characters can be used:
documentation - well that is is HTML and preferably use HTML-encodings
names of variables - preferably only in ASCII
descriptions - in MSL they should in English. Some use HTML-encoding for these as well to allow additional features
comments - in MSL they should in English. Allow HTML here?
strings - ?

We should also be consistent - if UTF-8 encoding for strings taking double memory is an issue, we might think twice about HTML-encoding the documentation (which more than doubles the length).

@modelica-trac-importer
Copy link
Collaborator Author

Comment by dietmarw on 5 Jun 2009 08:32 UTC
from [milestone:Design62 62nd Modelica Design Meeting]:

  • Problem: It is not defined in which encoding a Modelica library is stored. It is proposed to use UTF8, but no consensus, since problem is complex.

@modelica-trac-importer
Copy link
Collaborator Author

Comment by dietmarw on 24 Nov 2009 23:21 UTC
In [milestone:Design64] it was agreed to make the necessary changes for usage of UTF8 for [milestone:ModelicaSpec3.2]. This also means that for the [milestone:MSL3.2] the Modelica.Utilities.Strings need to be adapted(see #246).

@modelica-trac-importer modelica-trac-importer added task and removed discussion Indicates that there's a discussion; not clear if bug, enhancement, or working as intended labels Nov 3, 2018
@modelica-trac-importer
Copy link
Collaborator Author

Comment by dietmarw on 20 Jan 2010 07:34 UTC
As of r3287 (trunk) and r3290 (maintenance/3.1) all files containing text are converted to UTF8.

Developers who are unsure if their system supports UTF8 should:

* simply avoid non-ascii characters in their changes
* make use of html entities in the documentation layer (should be done anyway) 

@modelica-trac-importer
Copy link
Collaborator Author

Comment by otter on 15 Mar 2010 08:34 UTC
Included in Modelica 3.2 draft specification as decided at the 64th and 65th design meeting:

  • Strings and comments use the UTF8 character set.

  • All other parts of the language, including quoted identifiers, use ASCII.

@modelica-trac-importer
Copy link
Collaborator Author

Comment by anonymous on 18 Mar 2010 14:55 UTC
I want to make sure I understand the last statement. Is this saying that the file encoding is UTF-8 throughout the file but that the tools will flag the presence of a non-ASCII UTF-8 character as an error unless it is within double quotes?

@modelica-trac-importer
Copy link
Collaborator Author

Comment by otter on 19 Mar 2010 09:32 UTC
The proposal for Modelica 3.2 is:

All Modelica files are UTF-8. non-ASCII UTF-8 are only allowed in:

  • Strings, i.e., "...."
  • Line comments, i.e., // ....
  • General comments, i.e., /* ... */

In particular, also quoted identifiers like '#456 abc' must be 7-bit ASCII. The reason for this restriction is:

  • There are several UTF-8 characters that have an identical visual representation, such as "i" (U+0069 or U+2170)

  • There are characters that are constructed by combining several UTF-8 characters, e.g. letters with accents or Katakana characters. This process is not unique and two different sequences of characters might have the same visual appearance. There is a normalization proposal to remove this ambiguity, which means that sequences of characters are transformed in a "normal" form (for details see: http://www.unicode.org/reports/tr15/tr15-32.html).

All this means that identical visual appearances of UTF-8 character sequences might have different byte codes behind. The question is now whether two identifiers are identical if they have identical visual appearance or whether they have the same byte code sequence. A human would prefer the first solution, a computer program the second solution. The second solution can result in difficult to detect errors in a Modelica model.

Another issue is, if Unicode would be allowed in identifiers, how the ordering of identifiers is organized by a tool. E.g. in the package browser, it is possible to order everything "alphabetically". However, there is no unique "alphabetic" ordering in Unicode, since this depends on the culture (see below).

Since we do not feel confident to have understood completely the issue and its solution, the proposal is to be conservative and only allow 7-bit ASCII for identifiers in Modelica 3.2. Does someone know whether there are programming languages that allow UTF-8 characters? If yes, how is this handled in these languages. E.g. Phyton supports Unicode in strings, but identifiers are even more restricted as in Modelica (only letters and digits).

The support of UTF-8 in strings also poses problems. Here is a list with possible solution strategies:

  • The relational operators: <, <=, >, >=, <> are defined in the following way for Strings in Modelica 3.1:

For operands of type String, str1 op str2 is for each relational operator, op, defined in terms of the C-function
strcmp as strcmp(str1,str2) op 0.

Obviously, this definition is no longer correct for UTF-8 (since strcmp only works on ASCII) and must be changed. A simple solution is to make the comparison according to the numeric equivalent of a Unicode character (this seems to be also how Java compares Unicode strings, see: http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html#compareTo%28java.lang.String%29). In Python the same rules are used. Additionally, a function "unicodedata.normalize(form,unistr)" http://docs.python.org/library/unicodedata.html#unicodedata.normalize is provided to transform a Unicode string in to a normalized form according to different normalization methods (NFC, NFKC, NFD, NFKD) and then make the comparison on this normalized string.
The Unicode consortium defines a collating sequence and different ways to get an ordering depending on culture: http://unicode.org/reports/tr10/ and http://www.unicode.org/Public/UCA/latest/allkeys.txt. So, the only meaningful way seems to be that ordering of strings in Modelica is according to the equivalent numerical value and an additional function is provided to transform a string into different normal forms that according to culture. The details are not clear to me.

  • In Modelica.Utilities.Strings several string functions are present. Several of them have an input argument "caseSensitive". If "false", the operation is carried out not taking into account whether the string is "upper" or "lower" case, e.g., "compare, isEqual, count, find, ...". It is not clear to me how to modify these functions for UTF-8 encoding.

@modelica-trac-importer
Copy link
Collaborator Author

Comment by otter on 19 Mar 2010 11:57 UTC
A simple solution to the mentioned problems with UTF-8 strings could be to further restrict the usage of UTF-8 in Modelica 3.2 by not allowing it in strings, but only:

  • Description strings ("string_comment" in the grammar)
  • Line comments, i.e., // ....
  • General comments, i.e., /* ... */

If someone also would like to have UTF-8 support in strings, he/she needs to help to precisely define this in the specification and help to adapt the C-functions under Modelica/C-sources/ModelicaStrings.c

I also recognized that the whole thing is even worse: in Modelica/C-sources/ModelicaInternal.c there are external functions to operate on file names. In Windows, file names can have Unicode characters. In some blogs this is discussed and that this is even a mess under Windows (because depending on the used API, different sets of file names can be generated. E.g., generating a file with one API and then this file can be only viewed in the Windows Explorer, but it can, e.g., no longer be renamed in Windows Explorer).

@modelica-trac-importer
Copy link
Collaborator Author

Comment by otter on 19 Mar 2010 17:29 UTC
After a discussion with Hans, the conclusion is to include UTF-8 in Modelica 3.2, but only for "viewing", not for any operations (especially not for variables declared as String and not for identifiers). The main reason is "missing time". In future Modelica versions we might improve this, if all issues are clarified, a specification is available and the specification is tested by adapting the Modelica.Utilities C-functions to UTF-8. For Modelica 3.2 this means that UTF-8 is allowed in:

  • Description strings (grammar: "string_comment")
  • Literal strings in annotations (grammar: "STRING" in "annotation").
    This means that documentation can be in UTF-8.
  • Line comments, i.e., // ....
  • General comments, i.e., /* ... */

@modelica-trac-importer
Copy link
Collaborator Author

Modified by dietmarw on 30 Apr 2010 07:12 UTC

@modelica-trac-importer
Copy link
Collaborator Author

Comment by dietmarw on 21 May 2010 11:09 UTC
Milestone ModelicaSpec3.3 deleted

@modelica-trac-importer modelica-trac-importer removed this from the ModelicaSpec3.3 milestone Nov 3, 2018
@modelica-trac-importer
Copy link
Collaborator Author

Comment by hansolsson on 30 Mar 2016 14:13 UTC
Unicode (for strings and comments) and UTF-8 for storing on file-systems is already included in Modelica 3.2 - based on this ticket. If there additional suggestions a new ticket would be better.

Unfortunately I cannot close it with Milestone MLS 3.2.

@modelica-trac-importer
Copy link
Collaborator Author

Changelog removed by hansolsson on 30 Mar 2016 14:13 UTC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment