Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix regex limitations #81

Closed
wants to merge 1 commit into from
Closed

Conversation

Risto-Stevcev
Copy link

Resolves #68

@anko
Copy link
Contributor

anko commented Jul 2, 2016

Could you include a unit test?

The current regex tests are here.

@wavebeem
Copy link
Collaborator

wavebeem commented Jul 2, 2016

This does seem to solve the issue, but I'm a little unsure about the overall change provided here. I don't have a good sense of how people are using the P.regex parser, so I think this would have to be considered a breaking change, which makes me feel a bit uncomfortable about it.

Granted, this is still a 0.x library, so I can do that, but I want to feel really sure that it's not something too wild.

@michaelficarra
Copy link
Contributor

michaelficarra commented Jul 2, 2016

This seems like overkill. You should be able to do

- var anchored = RegExp('^(?:'+re.source+')', (''+re).slice((''+re).lastIndexOf('/')+1));
+ var anchored = RegExp('^(?:'+re.source+')', (''+re).slice((''+re).lastIndexOf('/')+1).replace('g', ''));

instead.

@Risto-Stevcev
Copy link
Author

@michaelficarra That doesn't solve the issue this is trying to solve. What this does is allow the regex to be greedy if the global flag is set, so the use cases I mentioned in #68 work

@wavebeem
Copy link
Collaborator

wavebeem commented Jul 2, 2016

I think @michaelficarra makes a good point... RegExps with the g flag are actually stateful, so they really shouldn't even be supported in Parsimmon.

> twoChars = P.regex(/../g)
Parser { _: [Function] }

> twoChars.parse("aa")
{ status: true, value: 'aa' }

> twoChars.parse("aa")
{ status: false,
  index: { offset: 0, line: 1, column: 1 },
  expected: [ '/../g' ] }

> twoChars.parse("aa")
{ status: true, value: 'aa' }

> twoChars.parse("aa")
{ status: false,
  index: { offset: 0, line: 1, column: 1 },
  expected: [ '/../g' ] }

@michaelficarra
Copy link
Contributor

Actually, we need to be careful with the y flag, too. So

var anchored = RegExp('^(?:'+re.source+')', (''+re).slice((''+re).lastIndexOf('/')+1).replace(/[gy]/g, ''));

@wavebeem
Copy link
Collaborator

wavebeem commented Jul 2, 2016

Yeah, based on the MDN page, I think we should only be passing i through. Everything else seems unsafe.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp

@michaelficarra
Copy link
Contributor

No, it is important to preserve m and u as well.

@wavebeem
Copy link
Collaborator

wavebeem commented Jul 2, 2016

Oh, yeah I suppose those are useful too. So just y and g should be ignored/treated as an error/whatever.

@Risto-Stevcev
Copy link
Author

Risto-Stevcev commented Jul 2, 2016

Ah yeah, I forgot that g is stateful (I've definitely been tripped up by it in the past). My solution doesn't need the g flag in there though.

Maybe instead there can be a parameter to the regex function that you can pass true for in order to make it greedy?

@wavebeem
Copy link
Collaborator

wavebeem commented Jul 2, 2016

For the parser you mentioned before, you can just manually wrap it in (...)+ and it should work.

> P.regex(/([+\-]?([0-9]+|[0-9]*\.[0-9]+)(e[+-]?[0-9]+)?)+/).parse('+1.4')
{ status: true, value: '+1.4' }

@Risto-Stevcev
Copy link
Author

Yeah that's true. I think it would still be good to have some way of specifying whether you would want the regex to be matched greedily to make the library easier to use. It wasn't clear that this was possible (you even said so) considering how long it stayed open, until I came up with this solution. But if you aren't interested in integrating a solution into the library itself then I can just close this pull request.

@wavebeem
Copy link
Collaborator

wavebeem commented Jul 3, 2016

Yeah... regexps are hard, so this took a while to think about. I'm gonna say that since this is possible within the library already, without any large effort, I'm gonna keep it as-is so that we can do the least amount of processing possible on regexps. I'm gonna open an issue to make warnings/errors for regexps with unsupported flags after this.

I'm going to decline this PR, but thanks for all the discussion on this -- it was a good learning experience!

@wavebeem wavebeem closed this Jul 3, 2016
@wavebeem wavebeem mentioned this pull request Jul 3, 2016
@Risto-Stevcev
Copy link
Author

Risto-Stevcev commented Jul 3, 2016

Yeah, regexp can be hard to reason about sometimes. The anchoring part in the regex code is also possible for the user to write too but it makes sense to have it built-in, and so I still think it would be good to include an optional param to turn on greedy evaluation.

This issue wasn't so much with regex as it was dealing with the quirks of the library, since the initial example I gave matched the regexp fully when I used match, but for some reason the library couldn't match it all. It's not obvious to the user that they would need to do something like what I did here to make it greedy, unless they open up the library and mess around with the internals to see how it works like I did. So I think at the very least if the optional param isn't included, then there should at least be a sentence in the docs for regex on how to make your regex a greedy one.

@wavebeem
Copy link
Collaborator

wavebeem commented Jul 3, 2016

Yeah, #83 should cover that, I think.

@jneen
Copy link
Owner

jneen commented Jul 5, 2016

I would also be happy with, e.g. Parsimmon.rawRegex(...) which performs no wrapping, and you are required to anchor the thing yourself.

@jneen
Copy link
Owner

jneen commented Jul 5, 2016

(which is how Parsimmon.regex used to work)

@michaelficarra
Copy link
Contributor

@jneen What would it even mean for the regex to not be anchored at the beginning?

@jneen
Copy link
Owner

jneen commented Jul 5, 2016

It would mean nonsense, of course. But there are some regexen that have different behavior when wrapped in (?:...), so you could use rawRegex to get around that. Maybe unsafeRegexMakeSureYouAnchorTheThing?

@michaelficarra
Copy link
Contributor

michaelficarra commented Jul 5, 2016

there are some regexen that have different behavior when wrapped in (?:...)

Can you give an example? I thought it was simply due to the use of stateful flags (g and y).

@jneen
Copy link
Owner

jneen commented Jul 5, 2016

See @anko's comment on #68.

@jneen
Copy link
Owner

jneen commented Jul 5, 2016

Ah, I see, that was in fact because of /g. Well in that case I'm happy to just throw a panic if somebody passes us a /g regex.

@wavebeem
Copy link
Collaborator

wavebeem commented Jul 5, 2016

@jneen that's my plan for #83, along with /y

@jneen
Copy link
Owner

jneen commented Jul 5, 2016

Yay! Okay well it looks like you've got it all under control then :]

@wavebeem
Copy link
Collaborator

wavebeem commented Jul 7, 2016

If anyone here wants to take a look at the PR I did to address this, feel free: #85

@Risto-Stevcev
Copy link
Author

@wavebeem It would be helpful to add a sentence on how to construct greedy regular expressions. A lot of grammars assume the lexer is greedy.

@wavebeem
Copy link
Collaborator

wavebeem commented Jul 7, 2016

@Risto-Stevcev What's the shortest example you can give me for that? I'd be happy to add it.

@Risto-Stevcev
Copy link
Author

@wavebeem How about this one:

You can make your regex greedy by wrapping it in (regexp)+ (ie. /(a+|a*b)+/).

@wavebeem
Copy link
Collaborator

wavebeem commented Jul 7, 2016

Looks good to me. Thanks.

On Jul 7, 2016, at 12:14, Risto Stevcev notifications@github.com wrote:

@wavebeem How about this one:

You can make your regex greedy by wrapping it in (regexp)+ (ie. /(a+|a*b)+/).


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants