-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode support #35
Comments
I wonder if a |
Issue #40 was something similar (except in that case, there was no bug, it was just something that would have been nice to clean up the code a bit). Maybe a column like |
I think the name ideally would include some context or explanation of why the issue needs to wait, not just that it is waiting. So |
It seems actually, we are likely to have quite a few of these given the nascent state of compiler support… See also #35… although I think I have a solution for this, am working up a PR. |
I trying to implement this Unicode support in such a way (Using Sadly it’s not quite as easy as: integer,parameter :: CK = merge(tsource=selected_char_kind('ISO_10646'), &
fsource=selected_char_kind('DEFAULT'), &
mask=selected_char_kind('ISO_10646') /= -1)
integer,parameter :: CDK = selected_char_kind('DEFAULT') ! Default character kind needed
! for format stmnts etc. |
updateI just wanted to leave this update, and request comments (@jacobwilliams) on the following: gfortran has some decent unicode support, although one challenge is dealing with the exception messages that print offending character(s) when working with unicode. The trick in my previous post means that in order to gracefully fall back on default encoding CK may end up being DEFAULT or ISO_10646. This means we can’t just create a set of overloaded interfaces, because in the event the compiler doesn’t support ISO_10646 the interfaces will be the same and will raise a compile time error. You can, however, convert safely from DEFAULT to ISO_10646 but this doesn’t really help that much. The conversion in the other direction will map the ISO_10646 character onto some (potentially invalid) default character. If the character exists in ascii/default and ISO_10646 then all is well. If not, it will be somewhat random what character in the DEFAULT set the ISO_10646 character is mapped to, potentially even invalid ascii characters. For the sake of the |
I’m making good progress here, although I doubt it will work due to a number of compiler errors with gfortran. This can safely be moved into the
But the big showstopper is:
I will get this to a point where I think it will work once the compiler vendors fix their bugs, but for the meantime please move this to blocked by vendor bug, @jacobwilliams. |
Also, just checked, latest intel compiler still doesn’t support this. |
I DID IT!!!! 💥
The above issue is due to gfortran 4.9.2-0 being run on the travis-ci server, whereas you need 4.9.2_1 to get correct behavior. well… kinda, except for a bunch of gfortran bugs. @jacobwilliams take a look at the comparison of my feature branch and your master: zbeekman/json-fortran@jacobwilliams:master...feat-UTF-8-support-issue-35 Currently, this:
If you want, you could try to make a feature branch and then pull in these changes so you can clean it up and hack at it… although I’m certainly happy to help with this, just let me know what problems or concerns you have. 😎 |
OK, it seems that the version of gfortran you use will highly influence whether or not the code behaves as expected. 4.9.2_1 does WORK and PASS THE TESTS. Sorry for the confusion but it seems that the version running on the travis-ci server doesn’t work, so running the example program produces spurious errors. I’m going to call this done for now. Take a look and tell me what you think. zbeekman/json-fortran@jacobwilliams:master...feat-UTF-8-support-issue-35 |
I looked into upgrading the version of gfortran on the travis-ci server to 4.9.2-1 (it’s currently 4.9.2-0) but can’t seem to make this happen. This feature, however, should be completely ready to go, just waiting on compiler bug fixes. In particular it does safely fall back on default character encoding when ISO_10646 support is unavailable and passes all the tests. I’ll try to put together some tests which use characters that are in ISO 10646 and not in ASCII to further explore the gfortran substring conversion issue. |
As you can see here: https://travis-ci.org/zbeekman/json-fortran/builds/52560669 the latest build of this branch is passing the tests on travis-ci. It is still possible that if you actually use unicode characters not in the ASCII set that you will have problems, however, this is due to compiler bugs. I propose that we add some preprocessing
|
I'm OK with proceeding along those lines... I haven't had a chance to look at this yet. I just wanted to make sure it didn't break any of my stuff. If the default behavior is exactly the same as it was before, it should be OK. |
Well, short of having users manually edit the library code (change something like The other issue is that right now client code will need to set the kind in character/string literal constants, which is a giant pain...all library calls with string literal constants need to prepend a
It is possible to avoid this, and I don't think we should merge the UTF-8 support feature branch until this has been implemented. It isn't hard to write wrapper procedures for any procedure that takes character arguments, and then create an overloaded interface, however doing this, while supporting compilers that haven't implemented ISO_10646 character kinds (ahem, Intel) in such a way that the library will fall back on just using the 'DEFAULT' character kind presents a problem. When the compiler DOES support ISO_10646 The only way I can think of to avoid this, and be able to provide the overloaded interface when necessary is to have the build scripts first do some system introspection to determine if the compiler supports ISO_10646 and then based on the result, choose to conditionally include or ignore the overloaded interfaces with the wrapper procedures. |
OK, I just did some more digging, and two of the three gfortran bugs I reported and listed above weren’t actually bugs! (PEBCAK) The only remaining bug is the one concerning substrings of parameters spurious changing kind from UCS4/ISO_10646 to DEFAULT, which has been confirmed. Fortunately there is a simple work around for this bug, which I have already implemented. All that’s left is to make the changes discussed above regarding preprocessing and overloaded interfaces. |
I just did a count, and ~73 procedures have dummy arguments which are character strings that may be ISO 10646 if supported or DEFAULT if not. Fortunately, many of these procedures are private, which means that it is possible that far few interfaces need to be created to allow client code to pass DEFAULT character actual arguments or ISO 10646 actual arguments. Now to investigate, which of the ~73 are public, and whether or not there is an easy way to create the required interfaces. I just rebased the Look at the test files to see how the client code currently needs to deal with strings—quite ugly, hence the need for some overloaded interfaces. FYI I didn’t bother updating the example1 and example2 programs in the README.md so the travis-ci build currently fails for this branch, but all local tests with gfortran and intel indicate that there are no problems and the library runs and passes all tests. |
Will try to look at it this weekend if I can. |
OK, I think I have a somewhat sane approach to wrap the public procedures that take character arguments with some help from preprocessor macro expansion. Once that is done, a lot of changes should be able to be rolled back…e.g. all the tests should be able to be reverted and run without leading JFC_’…’ prepended to all the strings. I’m hoping to have a decent looking PR later today |
OK. I hope it's not too complicated for a poor Fortran man to understand!! Ha ha! Also: I was wondering (as long as we're going to have to use the preprocessor anyway), if we might want to consider other preprocessor flags for some of the workarounds that are in there for various compiler bugs. Not for minor things, but for things that can impact the runtime efficiency (for example, there is one workaround for a bug in gfortran allocatable character strings that means additional allocations are taking place, which can slow things down). The idea being, if your compiler fully supports the standard, you will be able to compile the fastest version. But, if you can't compile it otherwise, you can enable some of the workarounds with one or more preprocessor flags. What do you think? |
I think that if profiling indicates that there is indeed a discrenable performance difference then it could be a good idea, but it does increase the complexity and means that some duplicate code needs to be maintained. |
Agree. It would only be for something that was really slowing it down. |
if they are gfortran bugs that are slowing it down, we can just do something like this: # ifdef __GFORTRAN__
! gfortran bug workaround code here
# else
! Intel compiler (and NAG!) code here
# fi and rename the file extension to I think Intel has been mucking around with their predefined preprocessor symbols in recent releases and I’m not confident in our ability to detect Intel in this manner, so hopefully there are no slow intel bug work arounds… |
btw I don’t think my approach is too complicated, even for “poor fortran man.” The issue that the preprocessing solves is allowing overloaded interfaces when ISO 10646 is supported and requested so normal ‘default’ strings (especially literal constants) ISO_10646 strings may be passed to the public API. If it’s not compiled with ISO_10646 support then we can keep the wrapper routines in the code, but the generic/overloaded interfaces must no longer include the wrappers, since now they are indistinguishable from the originals and this will cause a compile time error. |
I guess the ideal solution should account for all these cases:
Do you envision it working like this? So, for Case 3, there are Unicode and non-unicode interfaces? For Case 2, I guess they can still be there, but you're not forced to use them? And for case 1, it reduces to what we have now? |
I completely agree and have structured it so that it conforms to the three cases you’ve outlined. This the functionality that I am working towards. Regarding unicode vs non-unicode interfaces: This is what I am currently doing:
The only difficulty I foresee is how to treat the case where there are
|
Looking at the tests here: master...zbeekman:feat-UTF-8-support-issue-35 you can see that very few changes are required to start using unicode support as I currently have it configured. All thats left is to:
|
Now that PR #84 has been merged should we close this? Or at least remove the bug tag and the in progress tag? (and potentially leave open with a blocked by vendor issue?) |
Closing this. Great job! Follow-on unicode issues can be separate tickets. |
The library does not currently support non-escaped Unicode characters (as required by the JSON standard). Presumably, this could be fixed by defining the character kind as:
However, this is not currently supported on the Intel compiler. So for now, I will leave this as a known bug. See also the comments at the end of 0066f7d
The text was updated successfully, but these errors were encountered: