Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

malformed boolean query for valid regex search that includes () #1500

Closed
2 of 7 tasks
claremacrae opened this issue Jan 8, 2023 · 25 comments · Fixed by #2759
Closed
2 of 7 tasks

malformed boolean query for valid regex search that includes () #1500

claremacrae opened this issue Jan 8, 2023 · 25 comments · Fixed by #2759
Assignees
Labels
scope: query logic Boolean combinations of filters - and, or, not scope: regex filters Anything to do with regular expression text searches type: bug Something isn't working

Comments

@claremacrae
Copy link
Collaborator

claremacrae commented Jan 8, 2023

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

That the following search should work:

( description regex matches /(buy|order|voucher|lakeland|purchase|\spresent)/i ) OR ( path includes Home/Shopping )

Similar issue

This is similar to #1068, but the workaround there was to ignore tasks blocks in template files, whereas in this case the search is meant to be valid.

Current behaviour

It gives:

Tasks query: malformed boolean query -- Invalid token (check the documentation for guidelines)

Steps to reproduce

Paste the following

( description regex matches /(buy|order)/ ) OR ( path includes Shopping )

And preview the results. The error will be seen.

Note: This will work, which is why I believe that the problem is in the boolean-parsing code:

description regex matches /(buy|order)/

Which Operating Systems are you using?

  • Android
  • iPhone/iPad
  • Linux
  • macOS
  • Windows

Obsidian Version

1.1.9

Tasks Plugin Version

1.22.0

Checks

  • I have tried it with all other plugins disabled and the error still occurs

Possible solution

It seems that the regex-parsing code is finding brackets inside search strings.

A workaround is to break down the regular expression to:

( description includes buy ) OR ( description includes buy )  OR ( path includes Shopping )

But with my example above, with more query strings, that gets a bit onerous.

@claremacrae claremacrae added type: bug Something isn't working scope: query logic Boolean combinations of filters - and, or, not labels Jan 8, 2023
@Cito
Copy link
Contributor

Cito commented Feb 1, 2023

I just noticed a similar quirk when trying to filter for a filename that included parens: The filter filename includes (something) gives that error, while the same filter with path instead of filename works. I would expect them to behave the same.

@claremacrae
Copy link
Collaborator Author

The relevant source files are:

Halo @esm7, I imagine any time you might have for tasks would go on CSS stuff right now, but if you had any insights on this in the future, they would be appreciated.

@claremacrae
Copy link
Collaborator Author

claremacrae commented Feb 1, 2023

Interesting... This suggests that the basicBooleanRegexp regex is fine and not picking up the paren in the description search:

https://regex101.com/r/29YxYr/1

@claremacrae
Copy link
Collaborator Author

I'm on mobile currently so can't debug this. If anyone were to work on it, I would suggest adding more info to the error messages along the way.

@claremacrae
Copy link
Collaborator Author

Note to self:
Another candidate line to look at is:

return line.replace(/\(([^()]+)\)/g, '("$1")');

Which adds quotes around the component expressions. And may be applied to the entire line???

I wonder whether the quotes should only be added around the individual filters that are obtained by the results from applying basicBooleanRegexp?

I also wonder what will happen if there are " inside any search strings?

@esm7
Copy link
Contributor

esm7 commented Feb 2, 2023

Interesting issue. I can work on it if you prefer (probably next week), or just point out that what I'd do next to understand the issue is to add some debug prints to see how the line looks like after the preprocessing (e.g. the line.replace in line 115 of BooleanField you quoted above) and then print out the components of the postfix expression that is generated from the query.

@claremacrae
Copy link
Collaborator Author

Thank you for the reply @esm7 - all helpful.

If you can work on it, that would be great - but absolutely no pressure. 😄

@claremacrae claremacrae self-assigned this Feb 5, 2023
@claremacrae
Copy link
Collaborator Author

I'm having an initial look at this.

@claremacrae
Copy link
Collaborator Author

I added this test:

        it('should work with filter containing parenthesis', () => {
            // This tests the fix for
            //  https://github.com/obsidian-tasks-group/obsidian-tasks/issues/1500
            //  malformed boolean query for valid regex search that includes ()
            const filter = createValidFilter(
                '( description regex matches /(buy|order)/i ) OR ( path includes Home/Shopping )',
            );

            testWithDescription(filter, 'buy stuff', true);
        });

It failed because filter was undefined, as expected from this bug.

I added this debug output:

Index: src/Query/Filter/BooleanField.ts
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/Query/Filter/BooleanField.ts b/src/Query/Filter/BooleanField.ts
--- a/src/Query/Filter/BooleanField.ts	(revision 8f3612552c5f8605e36fc0e6971e9b54b430c44b)
+++ b/src/Query/Filter/BooleanField.ts	(date 1675617479202)
@@ -112,7 +112,11 @@
         // Prepare the query to be processed by boon-js.
         // Boon doesn't process expression with spaces unless they are surrounded by quotes, so replace
         // (due today) by ("due today").
-        return line.replace(/\(([^()]+)\)/g, '("$1")');
+
+        const result_old = line.replace(/\(([^()]+)\)/g, '("$1")');
+        console.log(`preprocessExpression: in  ${line}
+                    : ->  ${result_old}`);
+        return result_old;
     }
 
     /*

It produced this output for the query in the new test:

  console.log
    preprocessExpression: in  ( description regex matches /(buy|order)/i ) OR ( path includes Home/Shopping )
                        : ->  ( description regex matches /("buy|order")/i ) OR (" path includes Home/Shopping ")

      at BooleanField.preprocessExpression (src/Query/Filter/BooleanField.ts:117:17)

So the double-quotes have been put inside the regular expression, instead of around the whole description...

Other examples of the output look like this, where you can see that the " are placed nicely inside the `().:

  console.log
    preprocessExpression: in   (description includes #context/location1) OR (description includes #context/location2 ) OR (  description includes #context/location3 ) OR   (  description includes #context/location4 )
                        : ->   ("description includes #context/location1") OR ("description includes #context/location2 ") OR ("  description includes #context/location3 ") OR   ("  description includes #context/location4 ")

      at BooleanField.preprocessExpression (src/Query/Filter/BooleanField.ts:117:17)

I had thought that this would be fixable by applying the basicBooleanRegexp regex and adding " around the expressions it extracted - but it turns out that it only finds the last of any AND/OR etc on the line... rather than all of them.

That's all I've got for now. I can't see a general fix for this, given that " and ( and ) can appear in any text query, any number of times - as can AND etc.

@claremacrae claremacrae removed their assignment Feb 5, 2023
@esm7
Copy link
Contributor

esm7 commented Feb 7, 2023

Indeed the preprocessing code is too simplistic to handle these internal regular expressions.
I believe that a direction for a fix would be to match pairs of (...) only if followed by the end of the line or a valid operator, e.g. (...)$, (...) AND, (...) OR, (...) XOR (with extra whitespaces allowed).

@claremacrae
Copy link
Collaborator Author

Yes I do agree with that. I couldn’t get my head around how to implement it when there are nested filters - the matching brackets could be a long way apart.

@TomWol
Copy link

TomWol commented Apr 9, 2023

I just noticed a similar quirk when trying to filter for a filename that included parens: The filter filename includes (something) gives that error, while the same filter with path instead of filename works. I would expect them to behave the same.

I am seeing the same issue with filenames containing parentheses:
On a separate line, filename does not include 2023-01-06 (Friday).md throws an error, but path does not include 2023-01-06 (Friday).md does not.

But using path does not include 2023-01-06 (Friday).md inside a filter, such as (path does not include 2023-01-06 (Friday).md) AND (path does not include test.md) will also throw an error.

I was really confused as to why a working Tasks query from months ago (granted, I didn't test it since February) suddenly didn't work anymore...

@claremacrae
Copy link
Collaborator Author

I am seeing the same issue with filenames containing parentheses: On a separate line, filename does not include 2023-01-06 (Friday).md throws an error, but path does not include 2023-01-06 (Friday).md does not.

Moved to #1852

@claremacrae
Copy link
Collaborator Author

Indeed the preprocessing code is too simplistic to handle these internal regular expressions.
I believe that a direction for a fix would be to match pairs of (...) only if followed by the end of the line or a valid operator, e.g. (...)$, (...) AND, (...) OR, (...) XOR (with extra whitespaces allowed).

Just leaving this here, in case it is useful to my future self....

I recently learned about Lazy quantifiers, from:
https://www.youtube.com/watch?v=B9H0TyApBtU

More succinct reference:
https://stackoverflow.com/a/1919995

More detailed reference:
https://learn.microsoft.com/en-us/dotnet/standard/base-types/quantifiers-in-regular-expressions?redirectedfrom=MSDN

JavaScript article:
https://javascript.info/regexp-greedy-and-lazy

I am wildly speculating that esm7's suggestion (quoted above) in combination with lazy quantifiers to match the fewest number of times, may help with this.

Caveat: I have not checked the code to see if it is already using lazy quantifiers...

@claremacrae
Copy link
Collaborator Author

@claremacrae
Copy link
Collaborator Author

I've been thinking about this a lot, and have an alternative idea that I think will be easier to reason about and to explain to users - and easier on developers too.

(Also, I remain of the feeling that complex regular expressions usually mask bugs... such as this recently discovered example: 1be22e3 - so would prefer not to create any more 🤣 )

New proposal

Change the pre-processing step so that instead of adding quotes it does the following:

1 Split the line at the Boolean boundaries

Split it the operators and include all adjacent parens.

So for example, divide this:

NOT ( (tags include #context/loc1) OR (tags include #context/loc2) OR (tags include #context/loc3) )

Into this:

NOT ( (
tags include #context/loc1
) OR (
tags include #context/loc2
) OR (
tags include #context/loc3
) )

2 Give some kind of name or symbol to each of the non-operator lines

So the example becomes something like:

NOT ( (
f1
) OR (
f2
) OR (
f3
) )

and save a lookup table to map f1, f2 and f3 to the corresponding filter

3 Reassemble the line and give it to boon-js

NOT ( ( f1 ) OR ( f2 ) OR ( f3 ) )

And this is what gets parsed to boon-js - so there are not quotes and no () to deal with.

4 Adjust the search code

This kind of adds another level of indirection, so the Boolean search code will need to lookup which filter to apply.

5 Help users with lines that still fail to parse

There will still be cases that go wrong, for example:

(
description includes ) AND (
) OR (
description includes ) NOT (
)

will be split as something like

(
description includes
) AND (
) OR (
description includes
) NOT (
)

6 If there are still parsing errors after this fix...

Add a help message telling the user:

  • You may have mismatched brackets
  • Or you may have a text filter in this Boolean line with text such as ) AND ( that looks like part of the boolean logic.
    • To fix this, reword the query.
    • If it's using includes or does not include, write AND, OR, NOT and XOR in lower case
    • If it's a regular expression, write it differently, for example [A]ND
  • Or you may have a search string that ends in ) and that is confusing Tasks
    • So surround your search text with ' ...' or " ...."
    • Implementation notes to remember......
      • This won't affect regex searches because of their trailing slash or flag
      • First allow users to surround includes and does not include searches with ' or " (and ignore them)
        • Some users do this anyway, and don't understand why their searches don't find anything
  • Or you may have a custom filter or grouping line that ends in )
    • Erm, I don't have an answer for that yet

@claremacrae claremacrae added the scope: regex filters Anything to do with regular expression text searches label Jul 24, 2023
@esm7
Copy link
Contributor

esm7 commented Jul 25, 2023

If I understand correctly, your suggestion is equivalent to mine here, in its basic notion of gluing the parenthesis to their adjacent operators.
I think it steps up the way I phrased it, in the sense that it gives a more structured form to the same basic idea.
However, it also suffers from the same weakness, which is in step 1: since we do these splits and searches using very simplistic forms, and not actual syntax trees, it is difficult to differentiate between parenthesis and operator-like text that appears inside expressions.
It can be done, and I think you took it to a more practical level. However, I think that any idea that is based on simple textual preprocessing and not actual syntax trees/grammers, will inevitably be a compromise, and as such, may not be a worthwhile investment.
In other words, if we say the current parsing code is at 95%, and there's a complicated path to bring it to 97% and a completely different complicated path to bring it to 100%, we rather aim for the 100% and not take the code deeper into cryptic-regex-land :)

@claremacrae
Copy link
Collaborator Author

Yes, my proposal goes:

  1. from every Boolean with any filter containing ( or ) is broken
    • and there is no workaround
  2. To every Boolean with any filter containing ) AND (, ) OR ( etc is broken
    • and the workaround is to rephrase your filter prevent matching those phrase, such as lower-casing them or splitting them up in some way
    • I don’t think the regex would be too complicated

If it’s not hard to implement, I think it more goes from 80% or 90% to > 99%… As in, I doubt many real world search strings will contain those Boolean operators.

There is another option I am considering. I really want to provide syntax highlighting in Tasks code blocks, and so I have been looking at how that works. This has involved looking at code that others have written to parse programming languages for syntax highlighting.

There is a small chance I may eventually understand enough about the CodeMirror parsing mechanisms to come up with a better parsing solution for this issue too.

@claremacrae
Copy link
Collaborator Author

PS You’re correct, I hadn’t spotted that I was restating your earlier suggestion. Thanks.

@claremacrae
Copy link
Collaborator Author

Hi @aubreyz,

Moving #1852 (comment) to here...

@claremacrae
Copy link
Collaborator Author

Hi @aubreyz,

Re the following causing error messages because of this issue, when the file name contains ( or )...

```tasks
(description includes [[{{query.file.filenameWithoutExtension}}]]) OR (description includes [[{{query.file.filenameWithoutExtension}}|)
```

Is there any way to recode this to avoid this problem (or is this a different problem given the report that it is fixed)?

Sure.

```tasks
filter by function \
    task.description.includes('[[{{query.file.filenameWithoutExtension}}]]') || \
    task.description.includes('[[{{query.file.filenameWithoutExtension}}|')
```

@aubreyz
Copy link

aubreyz commented Oct 29, 2023

Hi @aubreyz,

Sure.

```tasks
filter by function \
    task.description.includes('[[{{query.file.filenameWithoutExtension}}]]') || \
    task.description.includes('[[{{query.file.filenameWithoutExtension}}|')
```

Wow, thank you so much Clare (didn't work with the slashes though - but great if all on one line, at least with the current non-beta plugin version). \ not either

@claremacrae
Copy link
Collaborator Author

claremacrae commented Oct 29, 2023

Hi @aubreyz,

Wow, thank you so much Clare (didn't work with the slashes though - but great if all on one line, at least with the current non-beta plugin version). \ not either

(There aren't supposed to be any beta versions... if you can see one, could you please send a link?)

That's weird. I just:

  • copied it multiple ways ('select + copy and paste' - click on the 'Copy' icon on the code block in my comment above....
  • pasted it both ways (Cmd+V and Shift+Cmd+V)

and it worked fine each time.

If you're inclined to investigate:

  • please check you're definitely on Tasks 5.0.0
  • and that there are no space characters after the \ - the \ needs to be precisely that last character on the lines..

If it still doesn't work, and you have time, a new bug report, including the error message, would be much appreciated.

@aubreyz
Copy link

aubreyz commented Oct 29, 2023

Hi @aubreyz,

Oops my bad. I just realised that particular vault had not updated to 5.0.0...
No beta I know of :) - just covering all bases in case there is a secret one...
Sorry about the noise

@claremacrae
Copy link
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
scope: query logic Boolean combinations of filters - and, or, not scope: regex filters Anything to do with regular expression text searches type: bug Something isn't working
Projects
Status: 🎉 Released
Development

Successfully merging a pull request may close this issue.

5 participants