Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the front anchor optimization in multiline mode #26

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

4X3L
Copy link

@4X3L 4X3L commented Jun 27, 2021

This pull request fixes issue #12.

replicate 1000000 'f' =~ "\\`g" :: Bool -- returns quickly
replicate 1000000 'f' =~ "^g" :: Bool -- returns slowly as it needs to scan for newlines
('g':replicate 1000000 'f') =~ "^g" :: Bool -- returns quickly, even though it could match multiple things, we only ask for a bool
('g':replicate 1000000 'f') =~ "^g" :: [[String]] -- returns slowly, as it needs to gather all possible matches

All of the standard test cases pass, and I did some ad-hoc testing at the repl. multiline/singleline modes, new/old syntax, and `/^ all seem to be working correctly together. Let me know if you want me to add them to the test cases formally before you accept the pull request.

It seemed like what happened was that originally tdfa had only both single-line and multi-line modes, with only the "^" anchor. Since in multi-line mode the entire string needed to be scanned for all newlines, the front anchored optimization was disabled in multi-line mode. But when the "`" anchor was added, the logic to set the front-anchor optimization wasn't updated.

Hopefully the code change itself is self-explanatory. Instead of "&&", I pass (multiline co) into isDFAFrontAnchored, and in there if it is in multi-line mode it only treats "`" as an anchor, otherwise, it treats both "`" and "^" as anchors.

@andreasabel
Copy link
Member

Thanks for the PR!

Let me know if you want me to add them to the test cases formally

Yes, please add test cases that are not already subsumed by the existing test cases.
Can you also add a test that demonstrates the performance improvement?

It seemed like what happened was that originally tdfa had only both single-line and multi-line modes, with only the "^" anchor. Since in multi-line mode the entire string needed to be scanned for all newlines, the front anchored optimization was disabled in multi-line mode. But when the "`" anchor was added, the logic to set the front-anchor optimization wasn't updated.

Hopefully the code change itself is self-explanatory. Instead of "&&", I pass (multiline co) into isDFAFrontAnchored, and in there if it is in multi-line mode it only treats "" as an anchor, otherwise, it treats both "" and "^" as anchors.

Thanks for the explanation! My take is to add such explanations also as comments to the source code, so that intentions are documented and can be matched against the actual implementation.

@andreasabel andreasabel added this to the 1.3.1.2 milestone Jul 6, 2021
@4X3L
Copy link
Author

4X3L commented Jul 15, 2021

I made a commit to the branch to show I am still working on this, but it is very much a work in progress. It seems that a lot of the newer features don't have corresponding tests. (Testing multi-line versus single line mode, new syntax versus old syntax, as well as all of the anchor and boundary characters. ) So I am also adding all of those tests, as well as the ones that strictly cover my change. (Right now the tests I added just sit there and use up memory.)

The reason I added this comment is because I am about to go on vacation for a week with limited internet, and wanted to give a quick update before I left. Hopefully I'll have a final version of the tests ready in two weeks.

@andreasabel andreasabel modified the milestones: 1.3.1.2, 1.4.0.0 Jul 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants