Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode support #35

Closed
jacobwilliams opened this issue Jan 8, 2015 · 28 comments
Closed

Unicode support #35

jacobwilliams opened this issue Jan 8, 2015 · 28 comments
Labels

Comments

@jacobwilliams
Copy link
Owner

The library does not currently support non-escaped Unicode characters (as required by the JSON standard). Presumably, this could be fixed by defining the character kind as:

integer,parameter :: CK = SELECTED_CHAR_KIND('ISO_10646')

However, this is not currently supported on the Intel compiler. So for now, I will leave this as a known bug. See also the comments at the end of 0066f7d

@zbeekman
Copy link
Contributor

I wonder if a blocked by vendor bug—or something similar—label/wafle.io column might appropriate for issues like this one, so that we can make it clear that there isn’t anything to be done, for the time being…

@jacobwilliams
Copy link
Owner Author

Issue #40 was something similar (except in that case, there was no bug, it was just something that would have been nice to clean up the code a bit). Maybe a column like tabled for now or something?

@zbeekman
Copy link
Contributor

I think the name ideally would include some context or explanation of why the issue needs to wait, not just that it is waiting. So vendor issue dependency or blocked by vendor issue or something a little more specific and descriptive.

@zbeekman
Copy link
Contributor

It seems actually, we are likely to have quite a few of these given the nascent state of compiler support… See also #35… although I think I have a solution for this, am working up a PR.

@zbeekman
Copy link
Contributor

I trying to implement this Unicode support in such a way (Using merge) that it will work if the compiler allows it, and fail over to system default character if ISO_10646 isn’t available. It’s a bit trickier than I anticipated, especially because gfortran and/or the standard doesn’t allow ISO_10646 in case statements… so I am trying to figure out how to convert bits to ascii/default.

Sadly it’s not quite as easy as:

    integer,parameter :: CK = merge(tsource=selected_char_kind('ISO_10646'), &
                                    fsource=selected_char_kind('DEFAULT'),   &
                                    mask=selected_char_kind('ISO_10646') /= -1)
    integer,parameter :: CDK = selected_char_kind('DEFAULT') ! Default character kind needed
                                                             ! for format stmnts etc.

@zbeekman
Copy link
Contributor

update

I just wanted to leave this update, and request comments (@jacobwilliams) on the following:

gfortran has some decent unicode support, although one challenge is dealing with the exception messages that print offending character(s) when working with unicode. The trick in my previous post means that in order to gracefully fall back on default encoding CK may end up being DEFAULT or ISO_10646. This means we can’t just create a set of overloaded interfaces, because in the event the compiler doesn’t support ISO_10646 the interfaces will be the same and will raise a compile time error. You can, however, convert safely from DEFAULT to ISO_10646 but this doesn’t really help that much. The conversion in the other direction will map the ISO_10646 character onto some (potentially invalid) default character. If the character exists in ascii/default and ISO_10646 then all is well. If not, it will be somewhat random what character in the DEFAULT set the ISO_10646 character is mapped to, potentially even invalid ascii characters. For the sake of the throw_error() routine, IF we have unicode support, I think it’s best, for the time being, to just do the unsafe conversion when the offending character(s) are to be printed in the error message. At a latter date we can try to solve the interface dependencies created by moving throw_error() to use Unicode inputs, unless absent on the system.

@zbeekman
Copy link
Contributor

I’m making good progress here, although I doubt it will work due to a number of compiler errors with gfortran. This can safely be moved into the Blocked by Vendor Bug category. Here are some related bugs:

But the big showstopper is:

I will get this to a point where I think it will work once the compiler vendors fix their bugs, but for the meantime please move this to blocked by vendor bug, @jacobwilliams.

@zbeekman
Copy link
Contributor

Also, just checked, latest intel compiler still doesn’t support this.

@zbeekman
Copy link
Contributor

I DID IT!!!! 💥

EDIT (5:05 PM EST): I need to take a look at the tests… they actually don’t all appear to be working as intended, I suspect I forgot to do something in terms of character kind in the json_example.f90 file…. will update when I know more…

The above issue is due to gfortran 4.9.2-0 being run on the travis-ci server, whereas you need 4.9.2_1 to get correct behavior.

well… kinda, except for a bunch of gfortran bugs.

@jacobwilliams take a look at the comparison of my feature branch and your master: zbeekman/json-fortran@jacobwilliams:master...feat-UTF-8-support-issue-35

Currently, this:

  • Builds using ifort 15.x AND gfortran 4.9 HEAD
    • Intel doesn’t support ISO 10646 character kinds yet, so the Intel build ends up being effectively the same as your master build
    • gfortran 4.9 nominally supports ISO 106464 character kinds, but I trust it about as far as I can throw it. One major problem is that substring index has a nasty habit of converting the kind (actual data and sometimes meta-data) of ISO 10646 back to ASCII/DEFAULT. This WILL be problematic should anyone actually try to use this for UCS characters supported by ISO 10646 but NOT included in the ASCII character set
  • PASSES ALL TESTS, despite the problems mentioned above for BOTH Intel and GFortrant, building using CMake AND the build.sh script. (However build.sh won’t build the library/example using ifort… I don’t know if this is a FoBiS.py bug, a bug in build.sh or what)

If you want, you could try to make a feature branch and then pull in these changes so you can clean it up and hack at it… although I’m certainly happy to help with this, just let me know what problems or concerns you have.

😎

@zbeekman
Copy link
Contributor

OK, it seems that the version of gfortran you use will highly influence whether or not the code behaves as expected. 4.9.2_1 does WORK and PASS THE TESTS. Sorry for the confusion but it seems that the version running on the travis-ci server doesn’t work, so running the example program produces spurious errors. I’m going to call this done for now. Take a look and tell me what you think. zbeekman/json-fortran@jacobwilliams:master...feat-UTF-8-support-issue-35

@zbeekman
Copy link
Contributor

I looked into upgrading the version of gfortran on the travis-ci server to 4.9.2-1 (it’s currently 4.9.2-0) but can’t seem to make this happen. This feature, however, should be completely ready to go, just waiting on compiler bug fixes.

In particular it does safely fall back on default character encoding when ISO_10646 support is unavailable and passes all the tests. I’ll try to put together some tests which use characters that are in ISO 10646 and not in ASCII to further explore the gfortran substring conversion issue.

@zbeekman
Copy link
Contributor

As you can see here: https://travis-ci.org/zbeekman/json-fortran/builds/52560669 the latest build of this branch is passing the tests on travis-ci. It is still possible that if you actually use unicode characters not in the ASCII set that you will have problems, however, this is due to compiler bugs.

I propose that we add some preprocessing #ifdef TRY_UNICODE_SUPPORT conditional compilation directives so that we can merge my feat-UTF-8-support-issue-35 branch with your master branch, but have UTF support turned off by default. This will allow adventurous users (and maybe gfortran developers) to test the UTF-8 support, while falling back on the default ASCII implementation for those who don’t want to test the new feature. This will also allow users to start modifying their client code so that once the compilers support ISO_10646 characters they can seamlessly just add -DTRY_UNICODE_SUPPORT to their compilation flags to start using ISO_10646 characters in their json files.

#ifdef is supported by both ifort and gfortran and worst case scenario, users can use the C preprocessor to perform the code transformation, if for some reason they are using a compiler that won’t do it with the -D flag… What do you think, @jacobwilliams?

@jacobwilliams
Copy link
Owner Author

I'm OK with proceeding along those lines... I haven't had a chance to look at this yet. I just wanted to make sure it didn't break any of my stuff. If the default behavior is exactly the same as it was before, it should be OK.
I generally try to avoid # compiler directives if I can help it...but maybe in this case there is no other way.

@zbeekman
Copy link
Contributor

Well, short of having users manually edit the library code (change something like logical, parameter :: USE_ISO_106464_IF_PRESENT = .true.) or only implementing it for the CMake build, I can't think of any other good way (than #ifdefs) to conditionally use ISO_10646 characters when the compiler DOES support them.

The other issue is that right now client code will need to set the kind in character/string literal constants, which is a giant pain...all library calls with string literal constants need to prepend a JCK_ to the string, i.e., 'file[1].entry' --> JCK_'file[1].entry'. I don't like this because:

  1. it's a pain in the neck to write code like that
  2. it will break current client code because they'll need to go in and make this modification everywhere.

It is possible to avoid this, and I don't think we should merge the UTF-8 support feature branch until this has been implemented. It isn't hard to write wrapper procedures for any procedure that takes character arguments, and then create an overloaded interface, however doing this, while supporting compilers that haven't implemented ISO_10646 character kinds (ahem, Intel) in such a way that the library will fall back on just using the 'DEFAULT' character kind presents a problem. When the compiler DOES support ISO_10646 JCK--"json-fortran character kind"--the constant holding the kind for ISO_10646 will be different the JCDK, "json-fortran character default kind". So when an overloaded interface is provided, the signatures of the two procedures that the generic resolves to will be different; one accepts JCK characters and one accepts JCDK characters. However, when the compiler does NOT support ISO_10646, both JCK and JCDK now are the same, the 'DEFAULT' character kind(=1). If a generic/overloaded interface is provided, the two specific procedures are no longer distinguishable, and a compile time error is generated.

The only way I can think of to avoid this, and be able to provide the overloaded interface when necessary is to have the build scripts first do some system introspection to determine if the compiler supports ISO_10646 and then based on the result, choose to conditionally include or ignore the overloaded interfaces with the wrapper procedures.

@zbeekman
Copy link
Contributor

zbeekman commented Mar 3, 2015

OK, I just did some more digging, and two of the three gfortran bugs I reported and listed above weren’t actually bugs! (PEBCAK) The only remaining bug is the one concerning substrings of parameters spurious changing kind from UCS4/ISO_10646 to DEFAULT, which has been confirmed. Fortunately there is a simple work around for this bug, which I have already implemented. All that’s left is to make the changes discussed above regarding preprocessing and overloaded interfaces.

@zbeekman
Copy link
Contributor

zbeekman commented Mar 5, 2015

I just did a count, and ~73 procedures have dummy arguments which are character strings that may be ISO 10646 if supported or DEFAULT if not. Fortunately, many of these procedures are private, which means that it is possible that far few interfaces need to be created to allow client code to pass DEFAULT character actual arguments or ISO 10646 actual arguments. Now to investigate, which of the ~73 are public, and whether or not there is an easy way to create the required interfaces.

I just rebased the feat-UTF-8-support-issue-35 branch onto the current master branch and the comparison may be seen here: https://github.com/jacobwilliams/json-fortran/compare/master...zbeekman:feat-UTF-8-support-issue-35?expand=1

Look at the test files to see how the client code currently needs to deal with strings—quite ugly, hence the need for some overloaded interfaces.

FYI I didn’t bother updating the example1 and example2 programs in the README.md so the travis-ci build currently fails for this branch, but all local tests with gfortran and intel indicate that there are no problems and the library runs and passes all tests.

@jacobwilliams
Copy link
Owner Author

Will try to look at it this weekend if I can.

@zbeekman
Copy link
Contributor

zbeekman commented Mar 7, 2015

OK, I think I have a somewhat sane approach to wrap the public procedures that take character arguments with some help from preprocessor macro expansion. Once that is done, a lot of changes should be able to be rolled back…e.g. all the tests should be able to be reverted and run without leading JFC_’…’ prepended to all the strings. I’m hoping to have a decent looking PR later today

@jacobwilliams
Copy link
Owner Author

OK. I hope it's not too complicated for a poor Fortran man to understand!! Ha ha!

Also: I was wondering (as long as we're going to have to use the preprocessor anyway), if we might want to consider other preprocessor flags for some of the workarounds that are in there for various compiler bugs. Not for minor things, but for things that can impact the runtime efficiency (for example, there is one workaround for a bug in gfortran allocatable character strings that means additional allocations are taking place, which can slow things down).

The idea being, if your compiler fully supports the standard, you will be able to compile the fastest version. But, if you can't compile it otherwise, you can enable some of the workarounds with one or more preprocessor flags. What do you think?

@zbeekman
Copy link
Contributor

zbeekman commented Mar 7, 2015

I think that if profiling indicates that there is indeed a discrenable performance difference then it could be a good idea, but it does increase the complexity and means that some duplicate code needs to be maintained.

@jacobwilliams
Copy link
Owner Author

Agree. It would only be for something that was really slowing it down.

@zbeekman
Copy link
Contributor

zbeekman commented Mar 7, 2015

if they are gfortran bugs that are slowing it down, we can just do something like this:

# ifdef __GFORTRAN__
     ! gfortran bug workaround code here
# else
     ! Intel compiler (and NAG!) code here
# fi

and rename the file extension to .F90 instead of .f90

I think Intel has been mucking around with their predefined preprocessor symbols in recent releases and I’m not confident in our ability to detect Intel in this manner, so hopefully there are no slow intel bug work arounds…

@zbeekman
Copy link
Contributor

zbeekman commented Mar 7, 2015

btw I don’t think my approach is too complicated, even for “poor fortran man.” The issue that the preprocessing solves is allowing overloaded interfaces when ISO 10646 is supported and requested so normal ‘default’ strings (especially literal constants) ISO_10646 strings may be passed to the public API. If it’s not compiled with ISO_10646 support then we can keep the wrapper routines in the code, but the generic/overloaded interfaces must no longer include the wrappers, since now they are indistinguishable from the originals and this will cause a compile time error.

@jacobwilliams
Copy link
Owner Author

I guess the ideal solution should account for all these cases:

  1. Compiler doesn't support Unicode so I can't use it anyway.
  2. Compiler does support Unicode, but I don't want to use it.
  3. Compiler does support Unicode and I do want to use it.

Do you envision it working like this? So, for Case 3, there are Unicode and non-unicode interfaces? For Case 2, I guess they can still be there, but you're not forced to use them? And for case 1, it reduces to what we have now?

@zbeekman
Copy link
Contributor

zbeekman commented Mar 8, 2015

I completely agree and have structured it so that it conforms to the three cases you’ve outlined. This the functionality that I am working towards.

Regarding unicode vs non-unicode interfaces:
I want the interfaces to remain as easy to use as possible to the end user. Right now it appears to the end user that the interfaces are the same for all three cases and unchanged from the current interfaces. (Caveat concerning intent(out) character arguments discussed further on…)

This is what I am currently doing:

  • overloading all public interfaces with intent(in) string arguments so that the user can pass unicode or non-unicode string arguments
  • Internally all strings (i.e. the object name and the value of string objects) are stored in unicode, if available and requested since all default characters are representable as unicode
  • publicizing the json-fortran character kind to client code. This is the kind for ISO 10646 if present and requested, or the default kind otherwise

The only difficulty I foresee is how to treat the case where there are intent(out) deferred length strings used in the get methods when the variable is a string. Here, there are two possibilities:

  • expose the json-fortran character string kind to client code, but also convert to ‘default’ character strings when those are passed as intent(out) arguments, and hope for the best. If all the characters in the string are representable in the ‘default’ character set gfortran seems to have no trouble converting them.
    • pros: client code requires NO modification to link against unicode enabled json-fortran
    • cons: If client code does start to work with characters that are in the unicode character set but doesn’t update the kind of their variables that they pass as intent(out) actual string variables, then the characters that aren’t representable will get garbled. A warning message could be added to the library, to detect when the text is not faithfully converted back to ASCII
  • Force the user to always pass in actual string arguments of the correct kind. This would mean exposing the json-fortran character kind to client code, which is a good idea anyway.
    • pros: no (possibly silent) runtime conversion errors
    • cons: all client code SHOULD be updated so that character variables passed to the json-fortran library are the json-fortran character kind advertised in the library, but only code linking against the library with unicode support turned on MUST update the kind of their character variables passed as intent(out) arguments.
    • This is my preferred method, and whether or not the compiler supports unicode or unicode is requested, client code written like this will always avoid conversion errors and compile successfully. If client code isn’t updated to use the publicized character kind, and they try to link against unicode enabled json-fortran they’ll get a compile time error.

@zbeekman
Copy link
Contributor

zbeekman commented Mar 9, 2015

Looking at the tests here: master...zbeekman:feat-UTF-8-support-issue-35 you can see that very few changes are required to start using unicode support as I currently have it configured.

All thats left is to:

  1. Add cleaner/more user friendly option to select unicode or non-unicode in the build systems
  2. Finish updating the documentation. TBH, I have been updating some as I go, but it’s a little bit lacking.

@zbeekman
Copy link
Contributor

Now that PR #84 has been merged should we close this? Or at least remove the bug tag and the in progress tag? (and potentially leave open with a blocked by vendor issue?)

@jacobwilliams
Copy link
Owner Author

Closing this. Great job! Follow-on unicode issues can be separate tickets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants