Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugs in integration of oniguruma #2663

Open
pkoppstein opened this issue Jul 6, 2023 · 19 comments
Open

bugs in integration of oniguruma #2663

pkoppstein opened this issue Jul 6, 2023 · 19 comments

Comments

@pkoppstein
Copy link
Contributor

pkoppstein commented Jul 6, 2023

[EDIT: erroneous remarks about the "m" FLAG have been corrected.]

There are several outstanding bug reports regarding jq's support for regular expression functionality (e.g. #2562), but
it's beginning to dawn on me that there might be some significant issues with the integration of the Oniguruma library, so I thought it could be helpful to have a central "hub" for identifying these issues so that hopefully they can be more efficiently resolved.

(1) Both the manual and builtin.c envision FLAGS (as in test(_; FLAGS)) as being any of the characters in [gilmnpsx], but in practice, "m", "p", and "s" do not work as advertised in the jq manual:

According to the manual:

      * `m` - Multi line mode (`.` will match newlines)
      * `p` - Both s and m modes are enabled
      * `s` - Single line mode (`^` -> `\A`, `$` -> `\Z`)

But in practice:

      * `m` - . matches \n; ^ matches at start or after \n; $ matches at end or before \n
      * `p` -  only causes `.` to match `\n`
      * `s` - does not seem to have any effect at all

Here are some examples:

$ ./jq -nM '"a\nb" | test("^b"; "p")'
false

$ ./jq -nM '"a\nb" | test("^b"; "s")'
false

(2) There seems to be a general problem in the support for "Extended Groups" (EGs) of the form (?-OFF) or (?-OFF:regex).

Consider for example:

$ jaq -n '"TEst" | test("(?-i:te)st"; "i") '
false
$ gojq -n '"TEst" | test("(?-i:te)st"; "i") '
false
$ jq -n '"TEst" | test("(?-i:te)st"; "i") '
jq: error (at <unknown>): Regex failure: invalid group name <>

(I have not uncovered problems with EGs of the form (?ON) and (?ON-OFF),
where ON is understood to be a string of letters chosen from [imsxWDSPy].)

It's hard to imagine that this is an Oniguruma issue (I have checked at https://github.com/kkos/oniguruma/issues?q=is%3Aissue+extended).

pkoppstein added a commit to pkoppstein/jq that referenced this issue Jul 6, 2023
Several of the FLAGS (as in `test(_; FLAGS)`) simply do not work as advertised, whereas the functionality they promise is available via "Extended Groups".  To make the documentation more useful while allowing for future improvements, it has been revised to focus on what is and ought to be the case.

(Specifically, references to "m" and "s" as possible FLAGS have been dropped, and a concession to reality regarding "p" has been made.)

See Issues jqlang#2562 and jqlang#2663
@annettejanewilson
Copy link

As noted in #2562, the "m" flag does have an effect. The tests above seem to use \b for backspace - I would expect \n for newline. Here's the description I drew up over there, and following that the tests I used to verify it:

Default behaviour
^ matches only at start of string, . does not match newlines
s flag
No effect
m flag
Causes . to match newlines
(?s) option
Causes . to match newlines
(?m) option
Causes ^ to match after newlines
def test($options; $flags):
    def matches($regex):
        (if $options == "" then "" else "(?\($options))" end) as $optstring|
        [match("\($optstring)\($regex)"; $flags)]
    ;
    "Options: \($options), Flags: \($flags)",
    "Does . match newlines? \("\n"|matches(".")|length==1)",
    "Does ^ match at line start? \("\na\nb"|matches("^b")|length==1)",
    ""
;

test(""; "g"),
test("s"; "g"),
test("m"; "g"),
test("sm"; "g"),
test(""; "sg"),
test("s"; "sg"),
test("m"; "sg"),
test("sm"; "sg"),
test(""; "mg"),
test("s"; "mg"),
test("m"; "mg"),
test("sm"; "mg"),
test(""; "smg"),
test("s"; "smg"),
test("m"; "smg"),
test("sm"; "smg")

As I understand it, the reason "s" and "p" do not change the behaviour of ^ and $ is that the Perl mode already turns on this behaviour, that Oniguruma calls "ONIG_OPTION_SINGLELINE" and I shall call "anchors-don't-match-at-newlines". If you read the description in the jq documentation for "s" you will see that it is accurate but surprising. It causes ^ to be interpreted the same as \A, which is already true.

@nicowilliams
Copy link
Contributor

nicowilliams commented Jul 6, 2023

@annettejanewilson thanks for that.

jq -nr 'def test($options; $flags):                                                                                                                [28/1842]
    def matches($regex):
        (if $options == "" then "" else "(?\($options))" end) as $optstring|
        [match("\($optstring)\($regex)"; $flags)]
    ;
    try
    if ("\n"|matches(".")|length==1) and ("\na\nb"|matches("^b")|length==1)
    then "Options \"\($options)\" and flags \"\($flags)\" allow . to match newline and ^ to match line start (as opposed to string start)"
    else "FAIL: options \"\($options)\" and flags \"\($flags)\""
    end
    catch "ONIGURUMA EXCEPTION: \(.)";

test(""; ""),
test(""; "g"),
test("s"; "g"),
test("m"; "g"),
test("p"; "g"),
test("sm"; "g"),
test(""; "sg"),
test("s"; "sg"),
test("m"; "sg"),
test("p"; "sg"),
test("sm"; "sg"),
test(""; "mg"),
test("s"; "mg"),
test("m"; "mg"),
test("p"; "mg"),
test("sm"; "mg"),
test(""; "smg"),
test("s"; "smg"),
test("m"; "smg"),
test("p"; "smg"),
test("sm"; "smg"),
test("s"; ""),
test("m"; ""),
test("p"; ""),
test("sm"; "")

produces

FAIL: options "" and flags ""
FAIL: options "" and flags "g"
FAIL: options "s" and flags "g"
FAIL: options "m" and flags "g"
ONIGURUMA EXCEPTION: Regex failure: undefined group option
Options "sm" and flags "g" allow . to match newline and ^ to match line start (as opposed to string start)
FAIL: options "" and flags "sg"
FAIL: options "s" and flags "sg"
FAIL: options "m" and flags "sg"
ONIGURUMA EXCEPTION: Regex failure: undefined group option
Options "sm" and flags "sg" allow . to match newline and ^ to match line start (as opposed to string start)
FAIL: options "" and flags "mg"
FAIL: options "s" and flags "mg"
Options "m" and flags "mg" allow . to match newline and ^ to match line start (as opposed to string start)
ONIGURUMA EXCEPTION: Regex failure: undefined group option
Options "sm" and flags "mg" allow . to match newline and ^ to match line start (as opposed to string start)
FAIL: options "" and flags "smg"
FAIL: options "s" and flags "smg"
Options "m" and flags "smg" allow . to match newline and ^ to match line start (as opposed to string start)
ONIGURUMA EXCEPTION: Regex failure: undefined group option
Options "sm" and flags "smg" allow . to match newline and ^ to match line start (as opposed to string start)
FAIL: options "s" and flags ""
FAIL: options "m" and flags ""
ONIGURUMA EXCEPTION: Regex failure: undefined group option
Options "sm" and flags "" allow . to match newline and ^ to match line start (as opposed to string start)

@nicowilliams
Copy link
Contributor

nicowilliams commented Jul 6, 2023

The option "p" doesn't work even though "sm" does, and that's because the test programs above are prepending the options to the regex while Oniguruma doesn't know "p". The "p" is supposed to be in the second argument to match, but there it definitely doesn't work either, nor does "sm".

So the mode argument to match is a bit busted. And Oniguruma does give us a workaround.

@nicowilliams
Copy link
Contributor

Also, fmatch() has a number of leaks.

@nicowilliams
Copy link
Contributor

nicowilliams commented Jul 6, 2023

Near as I can tell onig_new() doesn't honor ONIG_OPTION_MULTILINE and ONIG_OPTION_SINGLELINE being set in its option argument.

For Oniguruma to understand the "p" option we need to define USE_POSIXLINE_OPTION:

8262 #ifdef USE_POSIXLINE_OPTION
8263           case 'p':
8264             OPTION_NEGATE(option, ONIG_OPTION_MULTILINE|ONIG_OPTION_SINGLELINE, neg);
8265             break;
8266 #endif

(in src/regparse.c).

@nicowilliams
Copy link
Contributor

So I think in fact we're looking at an Oniguruma bug. I've single stepped through it a bit and I can't quite find the bug yet.

@pkoppstein
Copy link
Contributor Author

@nicowilliams wrote:

Oniguruma doesn't know "p".
For Oniguruma to understand the "p" option we need to define USE_POSIXLINE_OPTION.

As I mentioned, using e.g. jq 1.6, the "p" FLAG is recognized, but not as advertised:

"p" does cause . to match \n

$ jq -n '"a\nb" | test("a.b")'
false
$ jq -n '"a\nb" | test("a.b"; "p")'
true

"p" does not cause ^ to behave differently

$ jq -n '"a\nb" | test("^b")'
false
$ jq -n '"a\nb" | test("^b"; "p")'
false

For contrast:

$ jaq -n '"a\nb" | test("^b")'
false
$ jaq -n '"a\nb" | test("^b"; "p")'
true

@nicowilliams
Copy link
Contributor

nicowilliams commented Jul 6, 2023

@pkoppstein well, I was trying jq from master :), and the thing about "p" and Oniguruma is about regexes that start with ?(p). The flag "p" in your example is handled by jq adding the relevant values to the option bitmask, and Oniguruma seems to be ignoring that.

@pkoppstein
Copy link
Contributor Author

pkoppstein commented Jul 6, 2023

@nicowilliams - Yes, we have to be careful about distinguishing between FLAGS and the options allowed in "extended groups". Regarding the latter, lower-case "p" is not one of the options.

The Oniguruma documentation gives these syntax summaries for extended groups:

(?imxWDSPy-imxWDSP:subexp) 
(?imxWDSPy-imxWDSP) 

@nicowilliams
Copy link
Contributor

Regarding the latter, lower-case "p" is not one of the options.

But you can enable it to be one of them by adding -DUSE_POSIXLINE_OPTION to the build's C flags.

@pkoppstein
Copy link
Contributor Author

@nicowilliams wrote:

you can enable it to be one of them by adding -DUSE_POSIXLINE_OPTION to the build's C flags.

I would have expected to see something about that on the Oniguruma configuration page: https://github.com/kkos/oniguruma/blob/master/doc/SYNTAX.md

Can you see it mentioned there?

@nicowilliams
Copy link
Contributor

I would have expected to see something about that on the Oniguruma configuration page: https://github.com/kkos/oniguruma/blob/master/doc/SYNTAX.md

Can you see it mentioned there?

No, I found it by code inspection. Clearly it's not meant to be supported. Or maybe it is but they failed to document it. IMO the should just add it unconditionally, but anyways.

@pkoppstein
Copy link
Contributor Author

pkoppstein commented Jul 7, 2023

(1) Re FLAGS (second argument to test/1):

It may be helpful to remember that:

(a) The jq documentation is muddled in that the descriptions of the "m" and "s" have been incorrectly swapped (that is, "m" should (in accordance with universal agreement) refer to ^ and $; and "s" to .)

(b) jq's "m" FLAG behaves in accordance with the (correct) description of the "p" FLAG (i.e. "s" + "m")

(c) jq's "p" FLAG behaves in accordance with the description: . matches \n

Thus the missing functionality is multiline-only behavior.

(2) Re the "p" Extended Group option (as in (?p:subexp))

Since the Oniguruma documentation is consistent (to the point of being emphatic) that p is NOT an allowed Extended Group letter, I think it would be unwise to make it so.

Could the "p" FLAG be implemented in terms of (?ms) ?

@pkoppstein
Copy link
Contributor Author

pkoppstein commented Jul 9, 2023

@nicowilliams @itchyny @annettejanewilson - What are your thoughts about dealing with this for jq 1.7, assuming no fixes will be available for this release? It seems to me that the main options involve some combination of these elements:

  1. Add info about Extended Groups ((?options:regex), etc);
  2. Describe the actual behavior of FLAGS (second arg of test), with brief indications of those FLAGS whose behavior will (or are expected to) change;
  3. Include details about both the "intended" (i.e. ideal or standard) behaviors of FLAGS and their actual behaviors.

@nicowilliams
Copy link
Contributor

@pkoppstein and maybe

  1. fix this code so that the intended behaviors of flags actually get implemented.

Does (1) wed us to Oniguruma? Probably not. We could parse those from any regex if we had to. So I'm certainly happy with (1), but I'd like to do the others too.

@pkoppstein
Copy link
Contributor Author

@nicowilliams asked:

Does (1) wed us to Oniguruma?

No, at least in the sense that the following engines (*) all understand(?i:a):

  • PCRE2
  • PRCRE
  • Go
  • Java 8
  • .NET (C#)
  • Rust

Please note that your point (4) is essentially my point (0). That is, there seems to be a high likelihood that not all the major problems will be resolved in the near term (for whatever reason), so it makes sense to prepare for that eventuality.

How would it be if I reworked what I have to make it clear which sentences are purely descriptive of "buggy" behavior that will hopefully change?

Another thought is that it should be possible to rejigger the FLAGS and use the Extended Group functionality to
overcome most of the major issues. This would definitely be hokey - e.g., mapping the user's "p" FLAG to "m", and the user's "m" flag to (?ms:_) along the lines that @annettejanewilson illustrated. Possible, but maybe not so easy.
What do you think?


(*) Footnote: https://regex101.com/

pkoppstein added a commit to pkoppstein/jq that referenced this issue Jul 11, 2023
Several of the FLAGS (as in test(_; FLAGS)) simply do not work as advertised, whereas the functionality they promise is available via "Extended Groups". To make the documentation more useful while allowing for future improvements, it has been revised to clarify what is and what ought to be the case.

See Issues jqlang#2562 and jqlang#2663
@annettejanewilson
Copy link

annettejanewilson commented Jul 11, 2023

pkoppstein:

(a) The jq documentation is muddled in that the descriptions of the "m" and "s" have been incorrectly swapped (that is, "m" should (in accordance with universal agreement) refer to ^ and $; and "s" to .)

I'm uncertain this is accurate. My current understanding is this:

  1. Most regex libraries/tools (e.g. regex101, Python, Perl, Rust) use "m"/"multi-line" mode to mean that ^ and $ match at every newline (as opposed to only at the start and end of the string), and use "s"/"single-line" mode to mean that . matches \n (when normally it wouldn't).
  2. Some don't. For example, Ruby uses "m" to mean that . matches \n.
  3. In particular, Oniguruma uses ONIG_OPTION_MULTILINE to mean that . matches \n and ONIG_OPTION_SINGLELINE to mean that ^ and $ DO NOT match at newlines. Note that this is not a simple inversion. ONIG_OPTION_SINGLELINE is the negation of Perl's multi-line mode.
  4. The jq documentation accurately states that the m flag causes . to match newlines.
  5. The jq documentation accurately, if tersely and misleadingly, states that the s flag causes ^ and $ to behave the same as \A and \Z and only match at the beginning and end of the string. This is misleading because they already have that behaviour - the flag has no effect.
  6. The jq documentation accurately states that the p flag has the effects of both the m and s flags. As the s flag has no observable effect, this means it has the same effect as the m flag.
  7. The jq documentation notes that "certain" flags may be specified within the regex, but does not elaborate. This obscures the surprising fact that (?m) and (?s) inside the regex mean different things. (?s) means what the m flag does, while (?m) means the opposite of what the s flag does.

If my understanding here is correct, I don't think it's accurate to say the descriptions of "m" and "s" have been incorrectly swapped in the jq documentation. The documentation is misleading but it's not incorrect with respect to jq's behaviour.

nicowilliams, earlier in the conversation:

Near as I can tell onig_new() doesn't honor ONIG_OPTION_MULTILINE and ONIG_OPTION_SINGLELINE being set in its option argument.

This surprises me, and I wonder whether this is borne of confusion over the meaning and polarity of Oniguruma's options. When I first looked into this I got confused myself. At present I don't believe I've seen evidence that Oniguruma has a bug, but I am definitely feeling like I need to be more thorough in my testing, because it's clear there are areas I've come to different conclusions, and I want to be more certain I'm not working on faulty assumptions.

pkoppstein:

What are your thoughts about dealing with this for jq 1.7, assuming no fixes will be available for this release?

I don't have recommendations right now, I'm too uncertain until I've had time to research further. I'm a bit wary of inadvertently making the docs worse by compounding misunderstandings.

@pkoppstein
Copy link
Contributor Author

pkoppstein commented Jul 12, 2023

@annettejanewilson - Yes, "universal" was a bit of an overstatement,
but in my defense it agrees not only with the explicit terminology used at
https://regex101.com, but also with the tacit agreement of all the regex
engines listed there (PCRE2, PCRE, ECMAScript, Python, Golang, Java8,
.NET, and Rust), in that when given the FLAGS "m" and "s", they behave
in accordance with the description.

At about the same time as you were writing your most recent post, I was
revising the text in manual.yml and would ask that you review it again, with a view to:

(1) retaining the distinction between:

a) FLAGS as they appear in the second arg of jq's test, or as they
appear on the RHS of the delimiters of regexs (e.g. /regex/m); and

b) single-letter options as used in "Extended Groups" (e.g. (?i:regex)).

(2) for the terms "multi-line" and "single line", and the FLAGS "m" and
"s", using the conventions of https://regex101.com, which agree
with your point (1):

  • the "multi-line" FLAG is "m" - "^ and $ match start/end of line"
    e.g. "a\nb"~/^b/m

  • the "single-line" FLAG is "s" - "dot matches newline"
    e.g. "a\nb"~/a.b/s

(3) avoiding any reference to the words used in Oniguruma's defined symbols.

In other words, I would like to focus on what jqMaster DOES and what we anticipate jq should do in the future if there is a discrepancy.

Thanks.

p.s. p.s. I thought it would be useful if we could agree on a table showing the
various FLAG values, and the behavior that actually goes with them:

  • In the header, the "m" and "s" refer to the FLAG as defined above;

  • In the body, "yes" means the behavior of the engine is in accordance with the descriptions above.

                      FLAG          FLAG          Ext Gp option    Ext Gp option      Ext Gp option   
                       m              s               i                m                  s
                   -----------   -------------   --------------    -------------      -------------   
example           "a\nb"~/^b/m   "a\nb"~/a.b/s     "A"~(?i:a)       "a\nb"~(?m:^b)     "a\nb"~(?s:a.b) 
                                                                                                      
PCRE2                 yes            yes             yes             yes                yes           
PCRE                  yes            yes             yes             yes                yes           
ECMAScript            yes            yes             error           error              error         
Python                yes            yes             error           error              error         
Golang                yes            yes             yes             yes                yes
Java8                 yes            yes             yes             yes                yes
.NET                  yes            yes             yes             yes                yes           
Rust                  yes            yes             yes             yes                yes           

jaq                   true           true            true            true               true           
                                                                                                      
jq                    false!         false!          true            true               true          

@BarnacleBob
Copy link

I just wanted to explicitly add there is currently no way to change the behavior of ^/$ even if the documentation is also confusing and/or wrong, and that i would think is definitely a bug somewhere.

1504:0:karl@Amalthea:~$ ~/bin/jq --version
jq-1.7
1505:0:karl@Amalthea:~$ echo -e "a\nb\nc" | ~/bin/jq -sR 'test("^b"; "")'
false
1506:0:karl@Amalthea:~$ echo -e "a\nb\nc" | ~/bin/jq -sR 'test("^b"; "m")'
false
1507:0:karl@Amalthea:~$ echo -e "a\nb\nc" | ~/bin/jq -sR 'test("^b"; "s")'
false
1508:0:karl@Amalthea:~$ echo -e "a\nb\nc" | ~/bin/jq -sR 'test("^b"; "p")'
false
1509:0:karl@Amalthea:~$ echo -e "a\nb\nc" | ~/bin/jq -sR 'test("^a"; "")'
true
1510:0:karl@Amalthea:~$ echo -e "a\nb\nc" | ~/bin/jq -sR 'test("^a"; "m")'
true
1511:0:karl@Amalthea:~$ echo -e "a\nb\nc" | ~/bin/jq -sR 'test("^a"; "s")'
true
1512:0:karl@Amalthea:~$ echo -e "a\nb\nc" | ~/bin/jq -sR 'test("^a"; "p")'
true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants