Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addition of UPPER, LOWER, STRLEN, SUBSTR, ... #10

Open
gmantele opened this issue Jul 5, 2018 · 2 comments
Open

Addition of UPPER, LOWER, STRLEN, SUBSTR, ... #10

gmantele opened this issue Jul 5, 2018 · 2 comments

Comments

@gmantele
Copy link
Member

gmantele commented Jul 5, 2018

I've noticed that ADQL-2.1 offers a function lower(VARCHAR)->VARCHAR.
Why not also having the function upper(VARCHAR)->VARCHAR?

Both functions are supported in all DBMS that I have checked (Postgres, Oracle, MySQL, MS-SQLServer, SQLite, H2, ...).

If, yes, why not also supporting other very common string functions like length (or strlen), trim, replace, substr, instr, ...?

I am asking because now that my ADQL Library (v1.4, but still supporting only ADQL-2.0) really supports reserved keywords, a lot of users and implementers are complaining about the fact that it is not any more possible to have UDF named lower or upper (which are reserved SQL keywords).
In some way, it forces implementers to use a different name which will then make all TAP implementations have a different name for the same function. I do not think it is a good idea to encourage such thing.

Anyway, I am now speaking about these common string functions but I am quite sure that other so-called reserved keywords of ADQL-2.x forbid TAP implementers to have some common useful functions. Maybe it would worth checking these names and see if functions associated with these names exist in most/all DBMS and if yes, integrate them in ADQL-2.1 (or 2.2 if already too late for 2.1).

@gmantele
Copy link
Member Author

gmantele commented Jul 5, 2018

@Zarquan answered:

Because lower() works (mostly) with Unicode, but upper() does not work so well for Unicode ?
http://unicode.org/faq/casemap_charprop.html#10

Both are implemented in all DBMS that I have checked. After a quick research with our friend Google, it seems there are issues with unicode characters in most of the DBMS when using upper(...) and even sometimes also with lower(...). From what I have seen, each time it can be fixed by setting the appropriate collate and/or character set in the used database. I think that as long as databases are created in UTF-8, there should not be any important issue.

Besides, it seems it is not a MUST in the ADQL-2.1 draft to be able to support nicely unicode characters (see 4.4.1. LOWER). Already for lower(...), there is a recommendation - and not a requirement - for the algorithm to use when unicode characters are encountered (probably because all DBMS surely have a different - usually unknown - way to deal with such characters):

Since case folding is a nontrivial operation in a multi-encoding world, ADQL requires standard behaviour for the ASCII characters,
and recommends following algorithm R2 described in Section 3.13, "Default Case Algorithms" of The Unicode Consortium (2012)
for characters outside the ASCII set.

So, I am not sure it is the role of ADQL to define how to deal with such special characters. Although I agree, having these functions returning exactly the same string for the same input would be what we want. Then, it would be interested to try with the same string containing several unicode and non unicode characters on all DBMS we want to be supported by ADQL.

There is a science case for having one of the transforms, either upper() or lower(), to transform two strings into the same case before comparing them. There isn't such a strong case for having both upper() and lower().

I agree, but it is still weird to have one but not the other while both are always supported by DBMS. Users will ask at some point why ADQL defines only one of them....and I am not sure they will be very comprehensive if we say it is because some DBMS may not support correctly Unicode characters.

Anyway, if ADQL-2.1 definitely defines only lower(...), I would suggest to add in the IVOA document a comment explaining why only lower(...) and not upper(...)

Maybe, other IVOA members have an opinion on this. I don't know what you planned to speak about at the interop, but maybe you can squeeze this small discussion topic (maybe along with the other possible common functions).

However, as long as someone does the work to prove they are available in some form on all the target platforms then I have no objection to them being included.

I'm sure there will be a few similar name related issues, but in general, yep. For example, I'd prefer to use strlen() rather than length(), to avoid future confusion between string length, array length and byte length.

I agree. strlen seems much more appropriate for language evolutions.

They can be proposed for 2.1, but if they have issues I don't want to hold up 2.1 to include them, so they might get bounced to 2.2 or 2.3.

I completely agree on that. I just ask all of that now because of the delay we generally have before a new version of an IVOA standard ^^ I just wanted to raise this point before 2.1 is released just in case it is a minor thing to add (especially for adding a comment explaining why not upper(...) as well).

@msdemlei
Copy link
Contributor

msdemlei commented Jul 5, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants