Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex Automata #185

Closed
kirawi opened this issue Jun 8, 2021 · 8 comments · Fixed by #9422
Closed

Regex Automata #185

kirawi opened this issue Jun 8, 2021 · 8 comments · Fixed by #9422
Assignees
Labels
A-core Area: Helix core improvements C-enhancement Category: Improvements E-hard Call for participation: Experience needed to fix: Hard / a lot

Comments

@kirawi
Copy link
Member

kirawi commented Jun 8, 2021

In the future, we should switch to something like Alacritty's regex functionality so we don't waste memory in our current implementation.

@archseer
Copy link
Member

archseer commented Jun 9, 2021

This regex thread is also relevant.

In particular:

Before getting too excited, you'll want to check out the list of differences between regex and regex-automata to see whether it's possible to use it. It's sadly probably not good enough for supporting a user facing text editor, for example, likely because of poor regex compile times.

and

As a lower level crate, this library does not do literal optimizations. In exchange, you get predictable performance regardless of input. The philosophy here is that literal optimizations should be applied at a higher level, although there is no easy support for this in the ecosystem yet.

This last one is especially limiting, since often you're just searching for a literal.

I think we need to give it a shot, but benchmark the performance hit. An optimization: we need to compile 4 DFAs, two for each search direction. But / always searches forward, so we can simply compile the forward DFAs while the user is inputting the regex in the prompt, then delay compiling the reverse DFAs until Enter is pressed.

@archseer archseer self-assigned this Jun 9, 2021
@sudormrfbin sudormrfbin linked a pull request Aug 6, 2021 that will close this issue
@kirawi kirawi added A-core Area: Helix core improvements C-enhancement Category: Improvements labels Aug 19, 2021
@pascalkuthe
Copy link
Member

I don't think using regex-automata is the path forward here, DFA compilation fundamentally has exponential time/space complexity.
That is why the regex engines basically always compiles them lazily and intelligently fall back to other engines.

What do you guys think about using a fork of the regex crate to finally resolve this issue?
I think I could fork regex that work with ropey chunk iterators instead of slices.
An initial patch could be pretty easy by just transparently replacing byte slices with ropey chunks iterator by writing a smart indexing replacement. Basically I would check if we moved past the chunk boundary on every index and intelligently advance to the correct chunk until we hit the correct chunk (advancing either forward or backwards).
This approach introduces some overhead but since chunk boundaries are pretty rare I think it will be pretty small thanks to branch prediction (a couple cycels per index). I am pretty confident this performance tradeoff ispreferable (both performance/memory consumption wise) to converting the entire Rope to a string.
I am not 100% sure this will work but I would certainly like to give it a try.

In the future I might be able to optimize this further by actually handling chunk boundaries in the engines itself but that would be much much more effort.
I don't think the first approach would be accepted upstream (the latter one maybe but even that I don' see happening in the near future) so it would require a fork.

This would require some effort (although I think it would be pretty interesting project) so I wouldn't want to work on something like that if we don't feel comfortable using a fork of the regex crate (I would publish to crates.io of-course).

Would you be ok with such a fork and if I assign this issue to myself @archseer?
I have quite a lot of other stuff I want to work on first so it would be a while before I work on this but it doesn't seem like there is a lot of time pressure here.

@pascalkuthe pascalkuthe added the E-hard Call for participation: Experience needed to fix: Hard / a lot label Dec 10, 2022
@dead10ck
Copy link
Member

Sorry, I'm a little lost, is there a problem with regex? Does it use a lot of memory?

@pascalkuthe
Copy link
Member

Sorry, I'm a little lost, is there a problem with regex? Does it use a lot of memory?

The problem is that regex can only operate on contiguous memory strings while ropey is segmented into chunks. Currently anytime we use regex we have to convert the entire rope to a contiguous String which sort of defeats the point of ropey and can cause quite a bit of overhead when working with large files.

@dead10ck
Copy link
Member

Ohh, interesting, thanks for the explanation

@TornaxO7
Copy link
Contributor

TornaxO7 commented Dec 2, 2023

Is this issue still up to date since helix uses nucleo?

@pascalkuthe
Copy link
Member

Nucleo is only a fuzzy matcher and has nothing g to fo either regex search. I am going to assign this to myself since I have been working on a streaming regex engine for a while

@pascalkuthe pascalkuthe assigned pascalkuthe and unassigned archseer Dec 2, 2023
@x10an14
Copy link

x10an14 commented Dec 27, 2023

A link explaining the ropey reference of #185 (comment), in case anyone else wonders (like I did) =)

https://github.com/cessen/ropey

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-core Area: Helix core improvements C-enhancement Category: Improvements E-hard Call for participation: Experience needed to fix: Hard / a lot
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants