Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do you think it is possible to support streams without a maximum length? #251

Closed
ghost opened this issue Mar 20, 2020 · 7 comments
Closed

Comments

@ghost
Copy link

ghost commented Mar 20, 2020

I mean the usual solution for streams is setting a maximum size for the match and using a circular buffer with twice the size and running the pattern matching on the buffer. While this works, I am curious if there is a better solution.

I think it is possible to solve pattern matching without storing the whole string or even the matching parts in memory. All we need is the candidate matches we haven't closed yet. At least this works on a simple example. We have a buffer size of 3 characters, a pattern like a\d+b a string like "aa23436bx" -> if we go through this:

begin stream
buffer = "aa234"
[0,"a"] -> candid[0] = {begin: 0, next:"\d"}
[1,"a"] -> candid[0] failed, candid[1] = {begin: 1, next:"\d"}
[2,"2"] -> candid[1] = {begin: 1, next:"\d|b"}
[3,"3"] -> candid[1] = {begin: 1, next:"\d|b"}
[4,"4"] -> candid[1] = {begin: 1, next:"\d|b"}
buffer = "36bx"
[5,"3"] -> candid[1] = {begin: 1, next:"\d|b"}
[6,"6"] -> candid[1] = {begin: 1, next:"\d|b"}
[7,"b"] -> candid[1] = {begin: 1, end: 7}, candid[1] matched
[8,"x"] -> nothing
buffer = ""
end of stream

Here I start the candid[1] in the first chunk and finish it in the second chunk without keeping the first chunk. All I kept is the beginning position. Ofc. a real engine would need more info than that, especially in the case of complicated patterns, capturing groups, etc. but it would be still better than keeping the whole string in memory imho. And if we allow a second round on the stream we could extract the matching substrings too. Maybe I am wrong the re2 already supports this, but I found no mention of it... Any opinions?

@junyer
Copy link
Contributor

junyer commented Mar 20, 2020

Sorry, RE2 does not support streaming. There was some discussion about this at the end of 2016 on #126 and #127.

Depending on your use case, you might want to look into using Hyperscan or lightgrep. You could also use RE2 for parsing and compiling only and then execute the RE2 bytecode however you like.

@junyer junyer closed this as completed Mar 20, 2020
@BurntSushi
Copy link

See also #126, #127 and #213.

I've given a lot of thought to this and it is very hard. I've written more about it here: rust-lang/regex#425 For example, a key thing you're missing is accounting for how the DFA works. It has to run backwards to find the starting location.

@ghost
Copy link
Author

ghost commented Mar 20, 2020

@BurntSushi Thanks!

@junyer
Copy link
Contributor

junyer commented Mar 20, 2020

+1 to what @BurntSushi wrote in rust-lang/regex#425 (comment).

A further note about Hyperscan and tracking where matches begin:

* SoM: "Start of Match": we do accurate, streamable start of match
tracking through our system. This is an option and occasionally an expensive one, and some
patterns we support in non-SoM mode we don't support in SoM mode. Anyone who thinks
SoM is cheap or easy is either smarter than we are, not doing things in streaming mode,
trimming down the definition of SoM (possibly defensibly), or misunderstands the problem.
There is a certain amount of weird structure allowing SoM information to flow through our
system; this is an area that needs work.

(source)

@ghost
Copy link
Author

ghost commented Mar 20, 2020

@junyer Yes, I don't expect it to be easy to implement, just that it is possible at a cetain level with certain patterns. The upper dummy example is an evidence of that.

@ghost
Copy link
Author

ghost commented Mar 20, 2020

See also #126, #127 and #213.

I've given a lot of thought to this and it is very hard. I've written more about it here: rust-lang/regex#425 For example, a key thing you're missing is accounting for how the DFA works. It has to run backwards to find the starting location.

Well I don't think that is possible without match size restrictions. I am sure I miss a lot of things, I don't know much about regex engines, I know a few sequential pattern mining algorithms, but that's all. I am mostly an user instead of a developer from this perspective. I'll read what you linked.

@ghost
Copy link
Author

ghost commented Mar 20, 2020

The rust-lang/regex#425 is the exact same thing I thought of, but this is just a naive approach. Usually one needs to think a lot more before coming up a real solution or giving up. I'll continue there, thanks for the links!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants