Skip to content
Izaak Beekman edited this page Mar 16, 2015 · 2 revisions

Unicode how-to

This page describes how to use json-fortran with Unicode support.

Table of contents

  1. Enabling Unicode
  2. System introspection and build configuration
  3. Why bother with system introspection?
  4. Ensuring your code works reliably and portably when linking and compiling against json-fortran
  5. Character named constants and Gfortran bug 65141
  6. Proper specification of character kinds (types)
  7. Dealing with Unicode characters that need to be embedded in source code

Enabling Unicode

At the time of this writing, the only compiler that has been tested with json-fortran and also supports ISO 10646/UCS4 characters is Gfortran >= 4.9.2. In fact, the Fortran 2008 standard does not require compilers to support UCS4/Unicode, and merely specifies how this support is to be implemented, should a compiler vendor choose to do so.

System introspection and build configuration

To attempt building the library with Unicode/UCS4 support enabled, with the build script in the project’s root directory, invoke it like this:

$ ./build.sh --enable-unicode [<additional build options>]

If building with CMake, set the ENABLE_UNICODE cache variable to TRUE using cmake-gui, ccmake or from the command line:

$ cmake <path-to-project-root> -DENABLE_UNICODE:BOOL=TRUE

At this point, the build performs some system introspection by trying to compile and run the test_iso_10646_support.f90 program. This program will print some informational messages to stderr and will exit with a return value of 2 if ISO_10646 characters are NOT supported, and return with a zero exit status and print ‘UCS4_SUPPORTED’ to stdout if the compiler DOES support UCS4/ISO 10646. Through this mechanism, both he build.sh and CMake builds will continue to build the library with Unicode support if available and without it otherwise.

Why bother with system introspection?

By testing the compiler’s ISO 10646 support, we relieve ourselves of the burden of tracking what systems and compilers support certain features and let the build determine this information for us. So, should a future version of Intel’s Fortran compiler start supporting ISO 10646—theoretically—no modifications will be required to start building the library with Unicode support using ifort. Similarly if the user builds the library with a custom compiler passed with the --compiler <custom_compiler> and requests Unicode support, then the proper preprocessor definitions will be added to the compilation flags to allow building with Unicode, if supported.

Ensuring your code works reliably and portably when linking and compiling against json-fortran

The following considerations must be made to ensure that code using the json-fortran library is robust, reliable and portable.

Character named constants and Gfortran bug 65141

This bug causes problems whenever a substring of a character named constant (parameter) of kind=selected_char_kind('ISO_10646') is referenced. Because of this, you should ALWAYS declare character named constants as 'DEFAULT' character kind by either not specifying the kind at all or specifying the kind as CDK. Since named constants may not be passed as actual arguments corresponding to intent(out) or intent(inout) dummy arguments, it is completely safe to use them with both Unicode enabled and disabled builds of the json-fortran library. Any intent(in) named character constants passed to the json-fortran library will be internally converted to UCS4, if necessary, by the library.

Proper specification of character kinds (types)

To ensure that your code will work with both the Unicode enabled and disabled versions of json-fortran, all character variables passed as actual arguments to json-fortran API calls, MUST specify the character kind to be CK, EXCEPT file names of json files to be opened, created or modified. CK is a named integer constant exported from the json-fortran library’s json_module.F90. If a name clash occurs, use statement entity renaming may be employed to circumvent this problem: CK => JSON_FORTRAN_CHARACTER_KIND

In the event that the library is not compiled with ISO 10646/UCS4 support, then CK corresponds to selected_char_kind('DEFAULT') and kind('a'), but, if Unicode IS supported then CK will be different. For Gfortran CK = 4 when Unicode support is available and requested, and CK = 1 otherwise.

Should you neglect to specify the character kind of your character variables passed to the json-fortran library, you will not encounter errors when compiling against the Unicode DISABLED version of the library, but will encounter compile time errors when trying to compile against the Unicode ENABLED version of the library. For this reason, it is STRONGLY recommended that you always specify the character kind of character variables passed to json-fortran.

Since default characters may always be safely converted to Unicode characters (but not vice-versa) all public APIs with intent(in) character arguments will accept both 'DEFAULT' and CK character arguments. This is especially convenient when specifying the ‘path’ of a json object within a json structure, or a string to be written and the value of a json object. Often one wants to use a character literal constant for these tasks, and the Fortran syntax for specifying the kind of these character literal constants can be both awkward and verbose. Consider, for example, fetching a value from a json structure, locating it using its path:

type(json_file) :: json
character(kind=CK,len=:),allocatable :: cval

! Open a json file with the json_file object, ‘json’ and do some other stuff

call json%get('hello world.Hebrew', cval)

JSON names and paths are allowed to contain Unicode characters, but in this example this is not the case. If the interfaces had not been overloaded to accept both default and Unicode intent(in) character kinds, then the get method would have had to have been called as:

call json%get(CK_'hello world.Hebrew', cval)

Note, also, that the kind of cval was specified as CK.

Dealing with Unicode characters that need to be embedded in source code

Currently, Gfortran does not support UTF-8 encoded source files with non ASCII characters. To embed non ASCII, Unicode characters in character literal constants within the source code, the C/Java \uXXXX may be used in combination with the -fbackslash gfortran flag. When using this flag, be careful that backslashes elsewhere in the code are properly escaped if they could otherwise be interpreted as unicode or control characters.

For example, assume that the ‘hello world’ translation table being read in the example above used an astronomical symbol for earth, ‘♁’, rather than the word ‘world.’ In that case, the only way of addressing the Hebrew translation would be to use the -fbackslash flag in combination with the C/Java like unicode character encoding:

! variable declaration, file read in, etc.
call json%get(CK_'hello \u2641.Hebrew', cval)

Upon compilation the Unicode character for the astronomical symbol of earth, ♁, will be substituted for \u2641. Also, please note that, in this case, the CK_ preceding the character literal constant (the json address) is MANDATORY since this character string contains a character which is NOT representable in the ASCII character set.