Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ctrl+Shift+F with regex show wrong results #131507

Closed
huangqinjin opened this issue Aug 24, 2021 · 13 comments · Fixed by #160665
Closed

Ctrl+Shift+F with regex show wrong results #131507

huangqinjin opened this issue Aug 24, 2021 · 13 comments · Fixed by #160665
Assignees
Labels
bug Issue identified by VS Code Team member as probable bug insiders-released Patch has been released in VS Code Insiders search Search widget and operation issues upstream-issue-linked This is an upstream issue that has been reported upstream verified Verification succeeded

Comments

@huangqinjin
Copy link

huangqinjin commented Aug 24, 2021

Issue Type: Bug

  1. Open folder which contains only one file main.cpp:
#include <iostream>
#include <string>
  1. Close all files and search all files using regex #include\s*<\w+>, show the following result which is wrong:
#include <iostream>
#include <iostream>
  1. Open main.cpp and search again, show correct result:
#include <iostream>
#include <string>

VS Code version: Code 1.59.1 (3866c35, 2021-08-19T11:56:46.957Z)
OS version: Windows_NT x64 10.0.19042
Restricted Mode: No

System Info
Item Value
CPUs Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz (8 x 1992)
GPU Status 2d_canvas: enabled
gpu_compositing: enabled
multiple_raster_threads: enabled_on
oop_rasterization: enabled
opengl: enabled_on
rasterization: enabled
skia_renderer: enabled_on
video_decode: enabled
vulkan: disabled_off
webgl: enabled
webgl2: enabled
Load (avg) undefined
Memory (System) 15.89GB (7.27GB free)
Process Argv --crash-reporter-id 9f97c5f0-6cf3-4b21-a725-c5b519b77ecd --crash-reporter-id 9f97c5f0-6cf3-4b21-a725-c5b519b77ecd
Screen Reader no
VM 0%
test.mp4
@roblourens roblourens added bug Issue identified by VS Code Team member as probable bug search Search widget and operation issues labels Aug 24, 2021
@ashidaharo
Copy link

ashidaharo commented Sep 8, 2021

BurntSushi/ripgrep#1983 (comment)
...So, as an outsider, I don't know if this should be the behavior of ripgrep, but at least I don't think there are that many people who use VSCode who want to match \W, \D, \s, etc. "in multiline mode" to me. I checked the behavior of the search in the editor (i.e. probably the behavior of the regular expression engine on the Javascript engine side) and it seems that only \W matches \n in this case. (On an unrelated note, this seems very strange to me, why doesn't \D or \s contain \n?)
So when VSCode calls ripgrep, if the regular expression does not explicitly include '\n' (or even if it explicitly includes \W to match the editor's behavior), it seems more appropriate to use ripgrep with '--no-multiline'. However, this behavior is absurd for those who intentionally want to include \n, so the VSCode UI needs to have an option to explicitly select whether to search in multiline mode.
I don't know whether the output of the ripgrep side is inappropriate for the case of explicit "multiline mode" that explicitly includes \n, etc. or whether the VSCode side should handle it.

@ashidaharo
Copy link

Well... I've been thinking about this again while looking at #65281 and so on, and it's a problem that not only clear \n but also all negative matches like [^a] are hit, so it's not good to try to automatically decide whether to search in multiline mode on the VSCode side...
Isn't the proper behavior to add a function that allows the user to choose whether to operate in multiline mode, and to normally search with --no-multiline? I think it would be more user-friendly for VSCode to warn the user to use multiline mode when \n is included in the regular expression.
Again, I don't know how it interacts with ripgrep when in multiline mode.

At the very least, it seems to me that this issue would be resolved if VSCode did not try to search in multiline mode on its own, regardless of the user's intentions.
Well, #65281 is an issue about "correct multiline mode output"...

@roblourens
Copy link
Member

You are right that our current handling will not be totally correct and an explicit button would be better. We try really hard not to add buttons for every single feature because people also value a minimal UI. So this is a tough one.

@ashidaharo
Copy link

We try really hard not to add buttons for every single feature because people also value a minimal UI.

Based on that policy, I think the following is a more accurate (at least better than the current VSCode) way to determine if the regular expression should be searched in multiline mode.

  1. I think we need to revert this fix, which is that '\w' was added here. That wasn't the cause of the problem.
    if (nextChCode === CharCode.n || nextChCode === CharCode.r || nextChCode === CharCode.W || nextChCode === CharCode.w) {
    return true;
    }

    note: But I think there is another additional Issue here: ' \D' and '\s' are also supposed to contain newline codes by definition, but if we include the similar '\W' here, why aren't they included here? (Of course, including '\s' in particular would be equivalent to making most regular expression searches always work in multiline mode.)
  2. The expression we truly had to detect as the cause of Multiline regex returns incorrect results #65281 is '[^_characters_]'.
    The regex can include '\n' as a result of negation, so the regex should be searched as multiline mode. (We shouldn't have to worry about backslashes in this case, because '\[^' is a meaningless regular expression.)
    Warning: I'm sure this addition will cause a lot of pain for people who use negation without intending to search as multiline mode, but if you don't want to add a multiline mode switch to the UI, I think this is a necessary behavior, especially for cases like Multiline regex returns incorrect results #65281.
    So... based on the current code, it would look like this?
    if (chCode === CharCode.OpenSquareBracket) {
        // move to next char
        i++;
        if (i >= len) {
            // string ends with a [
            break;
        }
        const nextChCode = searchString.charCodeAt(i);
        if (nextChCode === CharCode.Caret) {
            return true;
        }
    }
    note: If the expression is something like '[^\n]', this fix will unnecessarily determine the multiline mode, but this is still considered multiline mode anyway because it contains '\n', so the behavior is the same as it is now.

@ashidaharo
Copy link

ashidaharo commented Sep 10, 2021

Uh... sorry, there's a bug in the code.
The 'i--' is needed at the end of the if block, because it allows for the correct detection of [\n], etc.
It's like this.

if (chCode === CharCode.OpenSquareBracket) {
    // move to next char
    i++;
    if (i >= len) {
        // string ends with a [
        break;
    }
    const nextChCode = searchString.charCodeAt(i);
    if (nextChCode === CharCode.Caret) {
        return true;
    }
    // roll back the forward reading
    i--;
}

@ashidaharo
Copy link

ashidaharo commented Sep 10, 2021

Ah... I forgot to mention another method that is quite the opposite of the previous addition. (it doesn't bring pain to the user who intends the negation to work in single-line mode).
The way to do this is to declare to the user that VSCode basically treats negation as not including newline codes, and in cases like #65281, tell the user that if they want the search to include newline codes, they need to include them in the explicit as 'this/\w*(? :[^}]|\n)*' to tell VSCode to search in multiline mode.
To me this seems like a more horrible way than adding a multiline mode switch to the UI.

@Daijobou
Copy link

Is there any news on this? With a search it is a pity that it does not search correctly here, but who replaces text at the same time, gets serious problems here. For me hundreds of files are destroyed. Fortunately I had made a backup before replacing multiple files.

If you test the code beforehand in the search CTRL+F and Regex101 and it works, you do not assume that VSCode fails when replacing multiple files with the same pattern.

@maddes-b
Copy link

maddes-b commented Aug 25, 2022

Sorry for the noise, the error was sitting in front of the keyboard.

Solution:
For [^\S\n]*[^\n{]{1} I assumed that the first part will consume all 4 spaces of line 22 in the example, but it does not.
It will only consume 3, so the remaining space can fulfill [^\n{]{1}.
I could verify this on https://regex101.com/ by making a group of the last part ([^\n{]{1}), see https://regex101.com/r/fU9lmb/1
Correct regex: if[^\S\n]+\(+[^\n]*[\n]+[^\S\n]*[^\s{]{1}, where all white space including \n is excluded in the last part

Original "Issue":
There are still wrong matches in multi-line regex which were already reported in #65281 of December 2018.

Not working example in VS Code 1.70.2:
RegEx: if[^\S\n]+\(+[^\n]*[\n]+[^\S\n]*[^\n{]{1}
Wanted result: find if statements that do no have curly brackets.

  • "if", some white space excluding \n, some brackets, then anything excluding \n (EoL) = if[^\S\n]+\(+[^\n]*
  • Then at least one \n (EoL) = [\n]+
  • "second" line with optional white space excluding \n, then one occurrence of anything excluding \n (EoL) or a curly bracket = [^\S\n]*[^\n{]{1}

Correct results: lines 5+6, 8+9, 15+16
Wrong: results: 21+22

Current workaround: if[^\S\n]+\(+[^\n]*[\n]+[^\S\n]*[a-zA-Z0-9]{1}

  • "second" line with optional white space excluding \n, then a char or digit = [^\S\n]*[a-zA-Z0-9]{1}

image

void() t_movetarget =
{
    local entity temp;

    if (other.movetarget != self)
        return;

    if (other.enemy)
        return; // fighting, not following a path

    temp = self;
    self = other;
    other = temp;

    if (self.classname == "monster_ogre")
        sound(self, CHAN_VOICE, "ogre/ogdrag.wav", 1, ATTN_IDLE); // play chainsaw drag sound

    //dprint("t_movetarget\n");
    self.goalentity = self.movetarget = find(world, targetname, other.target);
    self.ideal_yaw = vectoyaw(self.goalentity.origin - self.origin);
    if (!self.movetarget)
    {
        self.pausetime = time + 999999;
        self.th_stand();
        return;
    }
};

Second my opinions about multi-line regex and \n:

  • \n in any form can stay as a trigger to do multi-line regex (e.g. \n, [\n], [^\n])
  • as EoL \n is special for multi-line regex it should not be used as white space in multi-line context
    • users can still define [\s\n} or [^\s\n] to handle it as white space
    • if it is treated as white space, then users have to define [^\S\n} or [\S\n] to handle it as non white space
  • EoL \n should still be a boundary in multi-line regex
  • I may not recognized other issues with other use cases

[1] https://www.pcre.org/current/doc/html/pcre2pattern.html

@bvschaik
Copy link
Contributor

bvschaik commented Sep 9, 2022

Is there any news on this? With a search it is a pity that it does not search correctly here, but who replaces text at the same time, gets serious problems here. For me hundreds of files are destroyed. Fortunately I had made a backup before replacing multiple files.

Same thing here, I'm cleaning up hundreds of HTML files with thousands of replaces per regex: I lost a couple of days of work because I have no idea how many files I corrupted with multiple search/replace actions involving \n, so I had to start over. I can't believe this serious bug is still open after a year.

I'd love to help fix this but I have no clue where to start. Can anyone give me some pointers? And why would the bug only show up for unopened buffers?

@roblourens
Copy link
Member

Sorry for the trouble. This issue describes the upstream change in ripgrep: BurntSushi/ripgrep#1983. Here is where we process results from ripgrep:

I'm not sure where in the process it falls apart. Something is not handling multiple submatches correctly. You can trace the model through from that code to the search tree and and see where it goes wrong.

@maddes-b
Copy link

#1
Just a note to the discussion in September 2021 by @ashidaharo and @roblourens:

Ah... I forgot to mention another method that is quite the opposite of the previous addition. (it doesn't bring pain to the user who intends the negation to work in single-line mode). The way to do this is to declare to the user that VSCode basically treats negation as not including newline codes, and in cases like #65281, tell the user that if they want the search to include newline codes, they need to include them in the explicit as 'this/\w*(? :[^}]|\n)*' to tell VSCode to search in multiline mode. To me this seems like a more horrible way than adding a multiline mode switch to the UI.

To avoid having \n as a flag for multiline searches and not adding UI flags, then maybe switch to typical regex syntax: /<regex>/<flags>. So in regex mode a leading slash allows you to define any regex flag, greedy/non-greedy, multiline, case insensitive, etc.
Making a delimiter mandatory is also possible, so that like in other tools as sed and grep the user defines the delimiter with the first character, e.g. #<regex>#<flags> instead of /<regex>/<flags>. Which is often convenient when you change file paths which include slashes, as you do not have to escape them all.

#2
Side note: I couldn't reproduce the original issue by huangqinjin in Version: 1.71.0 (user setup), Commit: 784b017, OS: Windows_NT x64 10.0.19044

@bvschaik
Copy link
Contributor

I found the cause and fixed the bug: see PR #160665. The RipGrep parser was not converting the absolute position submatches to correct (line number, column number) pairs, which lead to wrong display in the search widget and wrong replacements. Once I fixed it, examples from both #160005 and my issue #159733 replaced properly.

@roblourens I looked into writing tests for this function, but since it's a private function and there are no tests for the class itself, what's the best approach here?

@roblourens
Copy link
Member

For some reason, some of the repro steps in this issue and linked issues are not reproducing the issue for me. Here's a simple one that illustrates the problem, for verification later.

Open a file with

foo1
foo2
foo3

Search with foo(1|2\n) and you get these incorrect results.

image

@roblourens roblourens modified the milestones: Backlog, September 2022 Sep 21, 2022
roblourens pushed a commit that referenced this issue Sep 21, 2022
* Fix wrong matches in multiline file search

When multiple consecutive lines match a regular expression, RipGrep will return them  as one match with multiple submatches. The parser that converts the absolute positions of the submatches to (line number, column number) pairs was not taking into account that `inBetweenChars` may contain newlines, leading to wrong matches, and search/replace actions that gave unexpected results.

Fixes #131507

* Add tests for multiple submatches from ripgrep
@vscodenpa vscodenpa added unreleased Patch has not yet been released in VS Code Insiders insiders-released Patch has been released in VS Code Insiders and removed unreleased Patch has not yet been released in VS Code Insiders labels Sep 21, 2022
@lramos15 lramos15 added the verified Verification succeeded label Sep 29, 2022
@github-actions github-actions bot locked and limited conversation to collaborators Nov 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Issue identified by VS Code Team member as probable bug insiders-released Patch has been released in VS Code Insiders search Search widget and operation issues upstream-issue-linked This is an upstream issue that has been reported upstream verified Verification succeeded
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants