Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update regex-automata to 0.3 #4

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

Threated
Copy link

@Threated Threated commented Sep 17, 2023

This fixes #3 and once the dependency is updated in tracing-subscriber it will no longer require regex-automata 0.1 and only 0.3 as this crate is used when the env-filter feature is enabled requiring both versions as of now.

Regarding this PR it is not quite ready yet as I am not sure on how to handle the start state stuff. Obviously the expects can't stay. Sadly we can't construct a dense::Error ourselves and I am not really 100% convinced that storing the starting state in the Pattern as well is a good idea maybe we pass a bool to the Matcher constructor?

@Threated
Copy link
Author

cc @hawkw

@matthiasbeyer
Copy link

ping @hawkw we want this.

Copy link
Owner

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not sure how I feel about the use of universal_start_state here, given that the regex-automata documentation states that

It is always correct for implementations to return None, and indeed, this is what the default implementation does. When this returns None, callers must use either start_state_forward or start_state_reverse to get the starting state.

It's a bit unfortunate that an automaton can no longer be constructed which is guaranteed to have a universal start state. I wonder if it's possible to use the Input API with the first chunk of input to the matcher to generate the start state when we start writing input to the Matcher for the first time, and then continue feeding individual bytes. I'm not sure if the regex-automata Input API works correctly if we do this, so I don't know if this is actually a good idea. We'd have to test that we can still match expressions correctly in that case.

Alternatively, we could look into whether it's possible to configure the DFA Builder to reject any patterns which would lack a universal start state.

Honestly, I'm not sure whether the regex-automata 0.3 API allows us to do the things we want to do correctly or not. It seems like the expectation is that the DFA is always provided with the complete haystack in the form of an Input. Whatever approach we end up pursuing, we probably need to add additional tests for more complex patterns to ensure our somewhat unorthodox use of the regex-automata 0.3 API works correctly.

src/lib.rs Outdated
let automaton = dense::DFA::new(pattern)?;
let start = automaton
.universal_start_state(Anchored::No)
.expect("I hope this works");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably return our own error type, which would be a #[non-exhaustive] enum of either a dense::BuildError or our own error variant indicating that the pattern is invalid because the automaton would have no universal start state. That way, we can remove the expect here.

@Threated
Copy link
Author

It's a bit unfortunate that an automaton can no longer be constructed which is guaranteed to have a universal start state. I wonder if it's possible to use the Input API with the first chunk of input to the matcher to generate the start state when we start writing input to the Matcher for the first time, and then continue feeding individual bytes. I'm not sure if the regex-automata Input API works correctly if we do this, so I don't know if this is actually a good idea. We'd have to test that we can still match expressions correctly in that case.

I played around with that idea a litte bit and made a rather ugly but kind of working POC you can take a look at. I changed the ToMatcher trait a bit there but we could maybe get around that.

Alternatively, we could look into whether it's possible to configure the DFA Builder to reject any patterns which would lack a universal start state.

Have not looked into that yet. I really wonder how the old version of regex-automata guarantied a start state.

Honestly, I'm not sure whether the regex-automata 0.3 API allows us to do the things we want to do correctly or not. It seems like the expectation is that the DFA is always provided with the complete haystack in the form of an Input. Whatever approach we end up pursuing, we probably need to add additional tests for more complex patterns to ensure our somewhat unorthodox use of the regex-automata 0.3 API works correctly.

Yep

The simplest way would of course be to reject every pattern that does not have a universal start state but that would maybe break existing usages of this library. So I think its worth exploring some options.

@hawkw
Copy link
Owner

hawkw commented Oct 1, 2023

The simplest way would of course be to reject every pattern that does not have a universal start state but that would maybe break existing usages of this library. So I think its worth exploring some options.

Well, I think the regex-automata upgrade is going to be a breaking change regardless, since we currently re-export its error types. It would be nice to not break any existing patterns, but this change will require a semver bump no matter what...

@Threated
Copy link
Author

Threated commented Oct 1, 2023

Yeah I was talking more about breaking existing patterns but of course this is a breaking change no matter what. The main motivation of this PR was to get rid of the duplicate regex-automata version in tracing-subscriber so I am worried about compatibility with existing patterns but I don't really know how this crate is used in tracing-subscriber so I am not sure if this will be a problem or not.

@matthiasbeyer
Copy link

I had a quick look at tracing-subscriber two days ago and from what I saw I can at least say that this crate does not appear in the public interface of tracing-subscriber. I even tried to actually get rid of the matchers dependency in tracing-subscriber - that was before this PR got traction 😆 - so that tracing-subscriber got the duplicated dependency on regex_automata removed.
More I cannot say though.

@plugwash
Copy link

Note that regex-automata 0.4 has now been released and is now used by the latest version of the regex crate. So if you want to eliminate duplicate dependencies that is the version you would want to move to.

@Threated
Copy link
Author

So I added the error enum and bumped the regex-automata version and it didn't break anything (have not really looked what actually changed). I also ran the tracing-subscriber tests using this version of the crate and they seemed to pass.
I am still worried about this breaking existing patterns that used to work but I feel like I don't know enough about what used to work and what might not anymore to add tests for it.

@juliangilbey
Copy link

Hi there! Is there any progress on this? I ask as a non-Rust expert Debian developer for whom a matchers compatible with regex-automata 0.4 would be a great help! (It turns out that maturin depends on matchers but Debian unstable/testing no longer supports regex-automata 0.1 😢 ) Many thanks!

@SorenHolstHansen
Copy link

Any progress on this or the other PR updating regex-automata to 0.4?

@Threated
Copy link
Author

Well I kind of don't know how to proceed with this. It seems like this is the closest I could get to matching the old behavior with upgrading the version and indeed at least the tests for tracing subscriber were passing with this version last time I checked. Though I'm not 100% sure it will work for every use case because of the start state problem. @hawkw what do you think how could this PR proceed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update regex-automata dependency
6 participants