Regex Automata #185

kirawi · 2021-06-08T16:44:41Z

In the future, we should switch to something like Alacritty's regex functionality so we don't waste memory in our current implementation.

archseer · 2021-06-09T01:53:04Z

This regex thread is also relevant.

In particular:

Before getting too excited, you'll want to check out the list of differences between regex and regex-automata to see whether it's possible to use it. It's sadly probably not good enough for supporting a user facing text editor, for example, likely because of poor regex compile times.

and

As a lower level crate, this library does not do literal optimizations. In exchange, you get predictable performance regardless of input. The philosophy here is that literal optimizations should be applied at a higher level, although there is no easy support for this in the ecosystem yet.

This last one is especially limiting, since often you're just searching for a literal.

I think we need to give it a shot, but benchmark the performance hit. An optimization: we need to compile 4 DFAs, two for each search direction. But / always searches forward, so we can simply compile the forward DFAs while the user is inputting the regex in the prompt, then delay compiling the reverse DFAs until Enter is pressed.

pascalkuthe · 2022-12-10T15:15:34Z

I don't think using regex-automata is the path forward here, DFA compilation fundamentally has exponential time/space complexity.
That is why the regex engines basically always compiles them lazily and intelligently fall back to other engines.

What do you guys think about using a fork of the regex crate to finally resolve this issue?
I think I could fork regex that work with ropey chunk iterators instead of slices.
An initial patch could be pretty easy by just transparently replacing byte slices with ropey chunks iterator by writing a smart indexing replacement. Basically I would check if we moved past the chunk boundary on every index and intelligently advance to the correct chunk until we hit the correct chunk (advancing either forward or backwards).
This approach introduces some overhead but since chunk boundaries are pretty rare I think it will be pretty small thanks to branch prediction (a couple cycels per index). I am pretty confident this performance tradeoff ispreferable (both performance/memory consumption wise) to converting the entire Rope to a string.
I am not 100% sure this will work but I would certainly like to give it a try.

In the future I might be able to optimize this further by actually handling chunk boundaries in the engines itself but that would be much much more effort.
I don't think the first approach would be accepted upstream (the latter one maybe but even that I don' see happening in the near future) so it would require a fork.

This would require some effort (although I think it would be pretty interesting project) so I wouldn't want to work on something like that if we don't feel comfortable using a fork of the regex crate (I would publish to crates.io of-course).

Would you be ok with such a fork and if I assign this issue to myself @archseer?
I have quite a lot of other stuff I want to work on first so it would be a while before I work on this but it doesn't seem like there is a lot of time pressure here.

dead10ck · 2022-12-10T20:28:11Z

Sorry, I'm a little lost, is there a problem with regex? Does it use a lot of memory?

pascalkuthe · 2022-12-10T20:30:55Z

Sorry, I'm a little lost, is there a problem with regex? Does it use a lot of memory?

The problem is that regex can only operate on contiguous memory strings while ropey is segmented into chunks. Currently anytime we use regex we have to convert the entire rope to a contiguous String which sort of defeats the point of ropey and can cause quite a bit of overhead when working with large files.

dead10ck · 2022-12-10T20:33:24Z

Ohh, interesting, thanks for the explanation

TornaxO7 · 2023-12-02T09:09:09Z

Is this issue still up to date since helix uses nucleo?

pascalkuthe · 2023-12-02T14:57:56Z

Nucleo is only a fuzzy matcher and has nothing g to fo either regex search. I am going to assign this to myself since I have been working on a streaming regex engine for a while

x10an14 · 2023-12-27T07:54:51Z

A link explaining the ropey reference of #185 (comment), in case anyone else wonders (like I did) =)

https://github.com/cessen/ropey

archseer self-assigned this Jun 9, 2021

sudormrfbin linked a pull request Aug 6, 2021 that will close this issue

Search via regex-automata #211

Closed

kirawi added A-core Area: Helix core improvements C-enhancement Category: Improvements labels Aug 19, 2021

pascalkuthe added the E-hard Call for participation: Experience needed to fix: Hard / a lot label Dec 10, 2022

pascalkuthe mentioned this issue Jan 30, 2023

Highlight all search matches in document #5702

Open

pascalkuthe mentioned this issue Sep 15, 2023

OOM crash when performing search on large input file #8288

Closed

pascalkuthe assigned pascalkuthe and unassigned archseer Dec 2, 2023

pascalkuthe mentioned this issue Jan 25, 2024

Switch to regex-cursor #9422

Merged

archseer closed this as completed in #9422 Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regex Automata #185

Regex Automata #185

kirawi commented Jun 8, 2021 •

edited

archseer commented Jun 9, 2021

pascalkuthe commented Dec 10, 2022

dead10ck commented Dec 10, 2022

pascalkuthe commented Dec 10, 2022

dead10ck commented Dec 10, 2022

TornaxO7 commented Dec 2, 2023

pascalkuthe commented Dec 2, 2023

x10an14 commented Dec 27, 2023

Regex Automata #185

Regex Automata #185

Comments

kirawi commented Jun 8, 2021 • edited

archseer commented Jun 9, 2021

pascalkuthe commented Dec 10, 2022

dead10ck commented Dec 10, 2022

pascalkuthe commented Dec 10, 2022

dead10ck commented Dec 10, 2022

TornaxO7 commented Dec 2, 2023

pascalkuthe commented Dec 2, 2023

x10an14 commented Dec 27, 2023

kirawi commented Jun 8, 2021 •

edited