-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do you think it is possible to support streams without a maximum length? #251
Comments
Sorry, RE2 does not support streaming. There was some discussion about this at the end of 2016 on #126 and #127. Depending on your use case, you might want to look into using Hyperscan or lightgrep. You could also use RE2 for parsing and compiling only and then execute the RE2 bytecode however you like. |
I've given a lot of thought to this and it is very hard. I've written more about it here: rust-lang/regex#425 For example, a key thing you're missing is accounting for how the DFA works. It has to run backwards to find the starting location. |
@BurntSushi Thanks! |
+1 to what @BurntSushi wrote in rust-lang/regex#425 (comment). A further note about Hyperscan and tracking where matches begin:
(source) |
@junyer Yes, I don't expect it to be easy to implement, just that it is possible at a cetain level with certain patterns. The upper dummy example is an evidence of that. |
Well I don't think that is possible without match size restrictions. I am sure I miss a lot of things, I don't know much about regex engines, I know a few sequential pattern mining algorithms, but that's all. I am mostly an user instead of a developer from this perspective. I'll read what you linked. |
The rust-lang/regex#425 is the exact same thing I thought of, but this is just a naive approach. Usually one needs to think a lot more before coming up a real solution or giving up. I'll continue there, thanks for the links! |
I mean the usual solution for streams is setting a maximum size for the match and using a circular buffer with twice the size and running the pattern matching on the buffer. While this works, I am curious if there is a better solution.
I think it is possible to solve pattern matching without storing the whole string or even the matching parts in memory. All we need is the candidate matches we haven't closed yet. At least this works on a simple example. We have a buffer size of 3 characters, a pattern like
a\d+b
a string like"aa23436bx"
-> if we go through this:Here I start the
candid[1]
in the first chunk and finish it in the second chunk without keeping the first chunk. All I kept is the beginning position. Ofc. a real engine would need more info than that, especially in the case of complicated patterns, capturing groups, etc. but it would be still better than keeping the whole string in memory imho. And if we allow a second round on the stream we could extract the matching substrings too. Maybe I am wrong the re2 already supports this, but I found no mention of it... Any opinions?The text was updated successfully, but these errors were encountered: