Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite 'Remove consecutive duplicate lines' to not use regex #13558

Conversation

ArkadiuszMichalski
Copy link
Contributor

Fix #5538 and #12548.

Files for test and results:

  1. uniq_list.zip from #5538 (comment) - something around 2.7 million lines

    In my machine I got 20s (new implementation) vs 70s (old implementation).

  2. Unix (LF) 44073 lines file from #5538 (comment)

    In this example we have problem with processing many lines at once when using regex (the line with EXECUTE as content). The new implementation does not have such problem and is also faster.

I tried to recreate the current behavior of this method. Now we have more control over the whole thing, so if something is not working properly, we can correct it. It might be even faster if we delete all the lines at once, but that would affect the state of the lines that are supposed to stay, so I don't do this (as in the previous commands).

Other commands that also use regex with a limit can be modified in a similar way. But so far we don't have any reported issues for them so that's a task for the future.

@chcg chcg added enhancement Proposed enhancements of existing features performance issue labels Apr 22, 2023
@donho donho self-assigned this Apr 23, 2023
@donho donho added the accepted label May 1, 2023
@donho donho closed this in ecb1071 May 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted enhancement Proposed enhancements of existing features performance issue
Projects
None yet
3 participants