Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement thinning consistently #2

Open
thatch opened this issue May 4, 2014 · 1 comment · May be fixed by #27
Open

Implement thinning consistently #2

thatch opened this issue May 4, 2014 · 1 comment · May be fixed by #27

Comments

@thatch
Copy link
Collaborator

thatch commented May 4, 2014

When the user passes in a charset currently, it's only used for dot. I'm expanding this to be use intersection between the charset passed in, and categories like \w\s\d as well, but don't intend to for literals.

Should it apply to character classes? I'm not sure.

For some, like [^\w] it's pretty clear it should (once Unicode support lands), but others like [a-z_] are already fairly limited.

@jayvdb jayvdb linked a pull request Sep 22, 2020 that will close this issue
@jayvdb
Copy link
Contributor

jayvdb commented Sep 22, 2020

#27 takes a stab at this by allowing the caller to provide different subsets for each category, which if used sanely allows control over expansion of \w, \d and \s.

This helps a little towards re.UNICODE, as it brings charset= into the internals in a way that can be described and reasoned about when the categories have multiple possible values.

It doesnt address explicit subsets like [a-z_] - I think there is more pain than joy in using charset= as a way to reduce the result space in cases like this - it could only be used if the input regex is quite predictable to the developer.

@jayvdb jayvdb mentioned this issue Sep 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants