Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CQL2 Escaping #717

Closed
jerstlouis opened this issue Jun 28, 2022 · 18 comments · Fixed by #802
Closed

CQL2 Escaping #717

jerstlouis opened this issue Jun 28, 2022 · 18 comments · Fixed by #802

Comments

@jerstlouis
Copy link
Member

jerstlouis commented Jun 28, 2022

Regarding escaping characters, currently CQL2 says:

  • Single quotes ' are repeated to escape them within string literals
  • Backslash \ is used to escape _ and % in LIKE patterns

Comments / Questions:

@pvretano
Copy link
Contributor

@cportele
Copy link
Member

cportele commented Jul 18, 2022

Meeting 2022-07-22: What does SQL do? Also to search for a newline in a string? @pvretano will check implementations.

@cportele
Copy link
Member

cportele commented Aug 1, 2022

Meeting 2022-08-01:

  • @pvretano investigated and it turns out that a double quote is the only safe mechanism that works across database implementations.
  • Use of \n and \t are typically not used in filter expressions, they will be relevant for a full expression language. So we would keep this open for the future revision for this.

We want more feedback on this (@aaime - any experience from your use of CQL for feature queries?). We plan to make a decision in the next meeting on August 15.

@cportele
Copy link
Member

Meeting 2022-08-15: There was no feedback, but we really need feedback to make such a decision at this stage. We will leave this issue open for a few more weeks (likely until after the Code Sprint September 14-16) and then make a decision.

@cportele cportele moved this from In progress to To be drafted in Features Part 3: Filtering / Common Query Language (CQL2) Nov 21, 2022
@cportele
Copy link
Member

Meeting 2022-11-21: @pvretano will create a PR to close this issue. If anyone has an opinion, please add a comment now.

@cportele
Copy link
Member

Meeting 2023-01-30: We will leave the escaping as it is (single quote within a string, \% or \_ in a LIKE string). @pvretano will check, if it creates any issues to explicitly allow \n and \t as escaped line feed / tab characters in strings.

@pvretano
Copy link
Contributor

I checked and the inclusion on the sequence \n and \t is a string will not cause any adverse issues in MariaDB/MySQL, Oracle or Postgres. However, how that sequence is interpreted by these systems is different. Oracle will simply ignore the sequence and leave it embeded in the string without interpretation. If you want to embed an actualy new line in an Oracle string you need to use the CHR(10) function (e.g. 'This is a'||CHR(10)||'new line'). MariaDB/MySQL will simply interpret the sequence and embed an actual new line in the string. Postgres will do the same as long as the string is an extended string.

So, if we allow \n and \t it will mean that some implementations will need to implement string scanning in order to handle these special sequences. In the case of Oracle as the back end, for example, they will need to replace all \n and \t with a concatenated CHR(10) or CHR(9) in order that the string be interpreted correctly. The complete set of ASCII sequences is: CHR(07) '\aBELL, CHR(08)\bBACKSPACE, CHR(09)\tHORZ TAB, CHR(10)\nNEW LINE, CHR(11)\vVERT TAB, CHR(12)\fFORM FEED, CHR(13)\r` CARRIAGE RETURN.

Do we really want to do this? If yes, do we want to handle all the control sequences of just new line and tab?

Do people really compare strings with new lines and tabs in them? In my experience that capability is reserved for TEXT SEARCH capabilities not string comparisons.

Anyone interested please chime in ASAP so that I can create and merge a PR for this before the TC meeting.

@pvretano pvretano moved this from To be drafted to Waiting for input in Features Part 3: Filtering / Common Query Language (CQL2) Feb 19, 2023
@cportele
Copy link
Member

Do people really compare strings with new lines and tabs in them?

That is a valid point. In addition, there are all the different rules depending on the OS (e.g., "\n" vs. "\r" vs. "\r\n"). Adding special rules about control characters is probably unnecessary.

If we would add them, we should support all of them.

@pvretano
Copy link
Contributor

@cportele given that you can encode the control characters directly into a UNICODE character sequence I don't think it is necessary to mess with handling "special" character sequences that represent control characters. I think my vote is to just leave it out.

@cportele
Copy link
Member

@pvretano - Yes, I agree.

@pvretano
Copy link
Contributor

@cportele OK, lets leave this for now and we can discuss closing either a the TC or the next SWG meeting. OK?

@cportele
Copy link
Member

Yes, sounds good to me. If anyone has a good argument for special rules for control characters, please add a comment.

@jerstlouis
Copy link
Member Author

jerstlouis commented Feb 19, 2023

The use case I suggested really is about the ability to express new lines / tabs in CQL2 expressions.
I don't think control characters are very relevant outside of TTYs.

Do people really compare strings with new lines and tabs in them? In my experience that capability is reserved for TEXT SEARCH capabilities not string comparisons.

given that you can encode the control characters directly into a UNICODE character sequence I don't think it is necessary to mess with handling "special" character sequences that represent control characters

@pvretano
I just saw that Tab, New-Line, Line Feed and spaces were added to the BNF alpha production rule.

While technically this could be done for cql2-text, some form of escaping would need to be defined for cql2-json, otherwise you can encode a new line that will break JSON syntax, which has no built-in way to encode multi-line strings.
Also realizing that the double quote needs an escaping mechanism to be defined in JSON as well.
Are escaping mechanisms already defined for double-quotes and/or new lines in cql2-json?
This is probably the main thing that is missing.

A main use case of cql2-text being the ability to include in URL queries, possibly one could use % escape codes, but if an escaping is defined in cql2-json it might make sense to support the same escaping mechanism as well (e.g., \n).
(NOTE: We are also using extended cql2-text for Styles & Symbology CSS-like encodings in [selectors] and parameter values, and not sure whether that will allow cql2-text breaking unto the next line either).

I think it would also make sense to have a separate rule whitespace that includes those white spaces which is then included in the character rule, instead of including it in alpha.

@pvretano
Copy link
Contributor

@jerstlouis new line and tab are control characters according to the UNCODE code blocks ... they are "C0 control codes".

JSON has its own escaping mechanism and I believe it is to use the backslash to escape double quotes. We use the same mechanism when necessary in cql2-text. Specifically in one special case, when a character string is used in a LIKE expression an escape mechanism is required to escape the wildcards. The requirement for LIKE specifically allocates the backslash for that purpose ... but only in this one case.

I guess the question here is whether we need special escape sequences for new line, tab, etc. for general strings?

My feeling is this ...

  1. There is no compelling reason to have an escape sequence for new line or tab (or the other control characters) for general strings since you can simply embed those codes directly into a sequence of UNICODE characters.
  2. Downstream handling of these characters is inconsistent and depends on the backend being used (even within a specific class of backends like RDBMS) so it might be best to simply leave the string as raw as possible and let the CQL2 processor adjust it as necessary for the target back end.

If you need to compare URLs then you do myProperty='http://www.some%20server.com/some/path/images' OR in a LIKE expression myProperty LIKE 'http://www.some\%20server.com/some/path/images%'.

The only place the alpha production is used right now is for the character but I can break out the space and the other control characters into a whitespace production just for clarity. I'll do that in the PR I created for issue #787.

@jerstlouis
Copy link
Member Author

https://www.json.org/json-en.html does say:

A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes. A character is represented as a single character string. A string is very much like a C or Java string.

So in CQL2-JSON, newline and double quotes would be escaped with a \.

In CQL2-TEXT, I guess we can assume it to be the raw values, and the transport mechanism can potentially define another mechanism if it does not support the raw values (e.g., the % encodes in a URL).

@pvretano
Copy link
Contributor

@jerstlouis, yes, in JSON you escape with the \.

If we had an XML encoding of a CQL filter, then newlines would be embeded in strings using &#xA (since the string \n has no meaning in XML) and that is my point. Anything we do would be different depending on a number of factors so perhaps the best thing to do is to just use the raw string in the cql-text encoding and then let the CQL processor handle the string as required for downstream processing.

Anyway, lets let others chime in too. I'm not that religious about my current position so if the will of the group is to use the JSON-mechanism in the text encoding then I'm OK with that.

In the mean time I will separate out the whitespaces into a whitespace production just to make the grammar a little more readable.

@cportele cportele moved this from Waiting for input to To be drafted in Features Part 3: Filtering / Common Query Language (CQL2) Feb 27, 2023
@cportele
Copy link
Member

Meeting 2023-02-27:

  • In principle, we will leave the escaping as it is, as decided in the last meeting.
  • We will add a paragraph where we explain that any encoding will use the encoding conventions of the language/platform. See the XML and JSON examples in the discussion. Depending on the backend, some parsing will have to be done.
  • For CQL2 Text we will add a convention to be explicit about encoding/escaping the control characters.
    • CHR(07) \a BELL
    • CHR(08) \b BACKSPACE
    • CHR(09) \t HORZ TAB
    • CHR(10) \n NEW LINE
    • CHR(11) \v VERT TAB
    • CHR(12) \f FORM FEED
    • CHR(13) \r CARRIAGE RETURN.
  • The whitespace rule will be separated.

@pvretano will create a PR or add this to an existing one.

@pvretano
Copy link
Contributor

I will create a new issue to implement the "whitespace" rule. There are a lot of outstanding PRs right now against CQL2. Lets merge them all then then I can break out the whitespace rule ... that is mostly an editorial change just for the sake of clarity. Otherwise we may plunge into merge conflict purgatory! ;)

Features Part 3: Filtering / Common Query Language (CQL2) automation moved this from In review to Done Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

3 participants