Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex lookahead causes entire file to be selected #4761

Open
MadSpaniard opened this Issue Aug 12, 2018 · 8 comments

Comments

Projects
None yet
6 participants
@MadSpaniard
Copy link

MadSpaniard commented Aug 12, 2018

Description of the Issue

Notepad++ erroneously selects the entire file when searching using a regex that includes a lookahead that scans over 1000+ lines to determine a match.

For example, the following regex will select the entire file when the amount of text traversed in order to find the next successful match exceeds approx. ~1000 lines:

<record>((?!</record>).)*?text-string-to-find.*?</record>

Basically, there seems to be a limit or constraint in how far ahead notepad++ can traverse when searching with a regex that includes a lookahead. When it exceeds that limit, notepad++ selects the entire file.

Steps to Reproduce the Issue

  1. To illustrate, consider a long XML file with hundreds or thousands of multi-line "records" (e.g. 10-50 lines per record) defined between opening/closing <record>/</record> tags. Some of these records contain a particular "text-string-to-find", others do not.
  2. Search the records file using the regex above, and hit Find Next repeatedly to traverse the matches.
  3. The regex above is trying to match records that contain "text-string-to-find". The negative lookahead ensures that we don't return matches that span multiple records when doing a Find Next from the current cursor position in the file.

Expected Behavior

The regex should match the entire text between (and including) the opening/closing record tags for only those records that contain "text-string-to-find". It should match from the opening "<record>" through the closing "</record>" tag for each such record. Each "find next" should find and select the entire text of the next record that contains the string.

Actual Behavior

  1. If the # of lines between one match and the next match is large, i.e. greater than approx. 1000 lines, then upon "find next" notepad++ erroneously selects the entire file, rather than properly finding and selecting the very next match.
  2. However, if the # of lines between matches is small, i.e. approx. 50-200 lines, then notepad++ properly finds and matches only the records that contain "text-string-to-find". The match is limited to a single record (and does not span across multiple records).
  3. If there are no matches at all in the entire file (i.e. no records contain "text-string-to-find"), notepad++ selects the entire file when it should select nothing.
  4. In summary: "find next" works if the number of lines to the next match is small, but selects the entire file if the number of lines to the next match is fairly large (approx. > 1000 or so).

Related Issue & Other Info

This may be related to Issue #683 .

@ClaudiaFrank

This comment has been minimized.

Copy link
Contributor

ClaudiaFrank commented Aug 13, 2018

Note sure if you can blame npp for doing possible catastrophic backtracking regexes.
It would be nice if npp would throw an error instead of selecting everything but the regex itself
should be changed imho, although not an regex expert at all.

I guess this one does what you want.

(?is)<record>(?>(?!</record>).)*?text-string-to-find.*?</record>

@MadSpaniard

This comment has been minimized.

Copy link
Author

MadSpaniard commented Aug 14, 2018

Hello Claudia -

Thank you for your reply. Yes, I had previously tried adding the atomic group structure around the lookahead portion as you suggest, i.e. adding the atomic grouping "(?>" in the construct: "(?>(?!</record>).)*?". This definitely helps in many, perhaps even most circumstances.

But even then, I am still finding cases where notepad++ ends up selecting the entire file.

For example, if the element "record" is nested inside other elements in the XML file and there are no matches at all for "text-string-to-find" in the file (and the file is long, in my case 500,000+ lines), notepad++ still sometimes ends up selecting the entire file and claiming there is 1 match (when in fact there are none).

So it seems that there may still be cases where notepad++ may be hitting an internal issue/limit/constraint somewhere causing it to end up selecting the entire file when it doesn't find a match. Perhaps file length has something to do with it, perhaps number of lines between search matches, perhaps if it finds no matches when it reaches the end of the file causing it to select the entire file in certain situations, etc.

The other seemingly related issue #683 may indicate some such scenarios where this potential issue may be occurring.

Thanks again for your helpful comment and help.

@ClaudiaFrank

This comment has been minimized.

Copy link
Contributor

ClaudiaFrank commented Aug 14, 2018

Looks like I haven't made myself clear - I absolutely agree that in case of an error, like a stack overflow,
npp should return an error instead of matching the whole text - that is definitely a misbehavior of npp.

It was about "your expected behavior" that the regex should match, even in such cases,
which led me to propose another regex solution, but now I see that your concern is the same as mine.
Npp needs to take care that errors like stack overflows can happen and should not result in returning
the whole document as the only match.

Sorry for confusion.

@ClaudiaFrank

This comment has been minimized.

Copy link
Contributor

ClaudiaFrank commented Aug 17, 2018

What I found out so far - an exception is thrown but searchInTarget function returns 0, which is, of ourse,
a valid position aka the first position in a document.
One possible solution might be to call SCI_GETSTATUS because this is set to 1 in this case.

@thatgdfox

This comment has been minimized.

Copy link

thatgdfox commented Sep 18, 2018

Hi. I also encountered this issue, so I thought I'd pass along what I found in case it's useful. I'm using a negative lookahead:
^(.(?!zyx))*
I'm attaching a file that I created to test. I found this expression works fine on lines up to 19,988 characters long (including cr & lf), but when npp attempts to apply this expression to the line with 19,989 characters, it returns the entire file as a match. Note that is also affects the results you get when you click the Count button; it returns a count of 6, but it should return a count of 7.
Line with 19989 chars causes neg lookahead to fail.txt
notepad _2018-09-18_12-22-42

@neffmallon

This comment has been minimized.

Copy link

neffmallon commented Dec 11, 2018

This happens without lookarounds as well. I have a csv that is 134MB but only 1600 lines and consists entirely of single digit numbers (mostly 0 and 1) and commas. Searching for ,[23456789], selects the entire document. Likewise, doing a replace all on (0,)+ to nothing replaces one instances and clicking the Replace All button again causes the program to say there were 0 matches. However, Replace still replaces the next matching string.

@sasumner

This comment has been minimized.

Copy link

sasumner commented Dec 11, 2018

@neffmallon What is happening is that the regular expression engine is being overwhelmed and it coughs up an error but apparently Notepad++ is ignoring that error in an interesting way. See @ClaudiaFrank 's comments above.

sasumner referenced this issue Feb 10, 2019

Add "Remove Duplicate Lines" feature
Remove duplicate consecutive lines from whole document.

sasumner referenced this issue Mar 29, 2019

Fix a bug in command "Remove Consecutive Duplicate Lines"
...while the last line's prefix is the content of its previous line.

Fix #5462
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.