Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More performance optimization for the Remove Empty Lines command #12544

Conversation

ArkadiuszMichalski
Copy link
Contributor

@ArkadiuszMichalski ArkadiuszMichalski commented Nov 21, 2022

I found regular expressions even more efficient than the last ones (and simpler).

^[\r\n]+

^(?>[\t ]*[\r\n]+)+

This can be checked with https://regex101.com/ (PCRE) and file for test regex.txt.

I also play with this // remove the last line if it's an empty line. with some big file. There is no point to checking the entire file when we want to remove the last blank line. We need only check the last two (or one) lines, which can be done directly with Scintilla commands. From now on, this last empty line removal will not slow as the multi-line document grows.

This PR also cover two more cases:

  • also works with one empty line #12512 (comment)
  • add some condition for selection to avoid some unexpected situations (like this)

This is just a performance improvement. In particular, it doesn't change the behavior of how to treat the last EOL with or without selection. It works as it has been until now. I suggest discussing changing this behavior elsewhere (#12545, #12535).

@donho
Copy link
Member

donho commented Nov 24, 2022

@ArkadiuszMichalski
There is another PR #12548 # #12535 which addresses the same issues - on which one should I focus?

@ArkadiuszMichalski
Copy link
Contributor Author

@donho
This one basically improves the performance. So it could be approved without any additional problems.

On #12548 I'm trying to change the behavior of this command (ie how to treat the last empty line when doing a selection, how to set selections after doing operation, etc.). So it could be accepted separately (if this new approach is better than the current one), so then in case of any problems in the future it would be easier rejected.

But the performance improvement from this PR will be preserved.

@donho
Copy link
Member

donho commented Nov 24, 2022

Could you provide any test (file + instructions) to prove this PR has optimized the performance?

@donho
Copy link
Member

donho commented Nov 24, 2022

And please close #12548 #12535

@ArkadiuszMichalski
Copy link
Contributor Author

ArkadiuszMichalski commented Nov 24, 2022

@donho And please close #12548

But #12548 is related to other command Remove Consecutive Duplicate Lines. I plan solve it after close this PR and also after next PR #12535 (because I need to know earlier what behawior is more excepted, what will be accepted or not, this command is very similar).

This PR does not have a dedicated issue. There was one #12462, but you accepted PR #12512 too quickly (while the discussion/work was still ongoing).

In the case of tests:

For the first improve you can play with https://regex101.com/ (PCRE or PCRE2) and file for test regex.txt. You can see that this new regex are shorter and faster (takes less time and do less steps).
Just compare two scenerio:

For empty line:

^[\r\n]+

vs 

^$(\r?\n|\r)(\1)*

image
image

For empty line with white space:

^(?>[\t ]*[\r\n]+)+

vs

^([\t ]*$(\r?\n|\r))(\1)*

image
image
Here we also grab more data in one shot that why we see 33 vs 37 matches (it's not a bug).

For second improving // remove the last line if it's an empty line you need more detail? As from first post:

There is no point to checking the entire file when we want to remove the last blank line. We need only check the last two (or one) lines, which can be done directly with Scintilla commands. From now on, this last empty line removal will not slow as the multi-line document grows.

This can be observed on larger files. We don't once again scan all data (all lines) to remove potentially last. We check directly only last, so basically it's instant.

@donho
Copy link
Member

donho commented Nov 24, 2022

But #12548 is related to other command Remove Consecutive Duplicate Lines. I plan solve it after close this PR and also after next PR #12535 (because I need to know earlier what behawior is more excepted, what will be accepted or not, this command is very similar).

Sorry, I meant "close #12535"

For the first improve you can play with https://regex101.com/ (PCRE or PCRE2) and file for test regex.txt. You can see that this new regex are shorter and faster (takes less time and do less steps).

We cannot rely on https://regex101.com/ because we don't know about their PCRE implementation. Even they use the same version of boost PCRE in their engine, it's not the same binary code as in Notepad++.

So as far as I don't have the tangible proof that this PR improves the performance, the PR cannot be accepted.
All we need is a file which can show us the difference of time (BEFORE and AFTER applying the PR) to execute the command in question.

@ArkadiuszMichalski
Copy link
Contributor Author

ArkadiuszMichalski commented Nov 24, 2022

@donho

We cannot rely on https://regex101.com/ because we don't know about their PCRE implementation. Even they use the same version of boost PCRE in their engine, it's not the same binary code as in Notepad++.

I know this is not the same implementation as in Notepad++. But it makes the presentation easier why these new Regex are more efficient. We don't use capturing group (which is always more efficient), we grab more data in one matching.

Hypothetical simple test case:

space_space_space CRLF
space_space CRLF
CR
LF

This new regex ^(?>[\t ]*[\r\n]+)+ make only one match (grab all this lines) and make one replace. This previous version ^([\t ]*$(\r?\n|\r))(\1)* makes 4 matches and make 4 replace. So this newer version will definitely be faster (in any implementation), if so why not use it?

In the worst case, when consecutive empty lines differ (they have different white space, for example as above), the last accepted PR doesn't improve anything, but this one does.

So as far as I don't have the tangible proof that this PR improves the performance, the PR cannot be accepted.
All we need is a file which can show us the difference of time (BEFORE and AFTER applying the PR) to execute the command in question.

Well, but it's necessary to make a file with 1 million lines to prove it (and add time logging)? Please consider this practice case (use only Notepad++):

                 
             
		

This new one (only 1 match)
image

This old one (4 matches)

image

OK, I have a bigger files for testing:
test20k.txt - for both commands
test20k_white.txt - for Remove Empty Lines (Containing Blank characters) command

I can also make clearer example for improving // remove the last line if it's an empty line, but if the above does not convince you ...

@Yaron10
Copy link

Yaron10 commented Nov 24, 2022

@donho,

Sorry, I meant "close #12535"

May I ask why?
@ArkadiuszMichalski has been doing a great job there.
Obviously, we would appreciate your opinion too. :)

@donho
Copy link
Member

donho commented Nov 24, 2022

@Yaron10

@ArkadiuszMichalski has been doing a great job there.

I didn't say he does not do a good job.
There are 2 performance optimization PR for the Remove Empty Lines command: #12544 (this one) and #12535
I just don't see why we need 2 PR for the same optimization.

@Yaron10
Copy link

Yaron10 commented Nov 24, 2022

@donho,

#12535 started as an optimization PR but turned into fixing some issues in "Line Operations" commands.
The discussion and work there are about to conclude.
I suppose @ArkadiuszMichalski will ping you when it's done.

Thank you.

@donho
Copy link
Member

donho commented Nov 24, 2022

Thank you for your explanation @Yaron10 .
@ArkadiuszMichalski In this case, could you update the title Change for Remove Empty Lines command how operate on selection and last empty line to somethings else in order to make the description of PR #12535 more accurate?

@ArkadiuszMichalski
Copy link
Contributor Author

ArkadiuszMichalski commented Nov 25, 2022

@donho
But this title just hints at what that PR is about. In short:

#12544 (this PR) - is only for performance optimization for these two commands (better than the recently approved PR).

#12535 - that PR change behavior for selection and last empty line for this two commands. I don't know if this change will be accepted (is expected), so I did it separately. It has optimizations from #12544, because I assume it will be accepted after all. But if not then I will revert to the previous approach, keeping the changes for selection and last empty line made following the discussion. What's wrong with the title Change for Remove Empty Lines command how operate on selection and last empty line?
Please don't focus on this PR because it's not finished, I'm still changing something there, when I finish I will let you know.

It all depends on what will be done with this PR. I added a file to test #12544 (comment), so if you need anything else (additional files) let me know.

@donho donho self-assigned this Nov 25, 2022
@donho donho added the accepted label Nov 25, 2022
@donho donho closed this in cffdf79 Nov 25, 2022
@ArkadiuszMichalski ArkadiuszMichalski deleted the optimize_12462 branch December 3, 2022 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants