Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escaped whitespace and slashes in Heregexes #3214

Merged
merged 1 commit into from
Oct 22, 2013

Conversation

marchaefner
Copy link
Collaborator

Escaped whitespace and slashes in Heregexes

Escaped whitespaces and slashes in Heregexes

It's a bit ugly but it implements #3059 and fixes #2238.
(Google and MSDN say this should work with all relevant JS engines.)

This PR also fixes the HEREGEX regex, so this is a thing now:

threeSlashes = /// \/// ///

Notes

  • Linebreaks are excluded from the escapable characters to avoid any confusion. (A backslash at the end of a line usually denotes a continuation on the next line but would mean the exact opposite here.)

  • Whitespace escaping introduces a potential pitfall:

    /// this\ #is not a comment ///
    

    It would be possible to treat this as a comment but that would be somewhat inconsistent and break other constructs (e.g. /// ^[\ #].* ///).

* Resolves jashkenas#3059: Don't remove escaped whitespace.
* Fixes jashkenas#2238: Prevent escaping slashes that are already escaped.
* Fix detection of end of heregex with escaped slashes.
@jashkenas
Copy link
Owner

Fabulous. Keep up the good work!

jashkenas added a commit that referenced this pull request Oct 22, 2013
Escaped whitespace and slashes in Heregexes
@jashkenas jashkenas merged commit 54840c0 into jashkenas:master Oct 22, 2013

HEREGEX_OMIT = /\s+(?:#.*)?/g
HEREGEX_OMIT = ///
((?:\\\\)+) # consume (and preserve) an even number of backslashes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not sufficient. In the case of three backslashes, this will just consume the latter two. Use ^([^\\]|\\.)* instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RE consumes the string left to right, evaluating the first alternative first. In the case of three backslashes the first two are consumed by the first alternative, leaving the third backslash for another sub-rule. (There are also three backslashes in the test.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marchaefner: I see, nothing follows the match. I didn't realise there were two alternations in the regex. If something had followed, the regex would backtrack in order to make the match succeed. Okay, LGTM.

@michaelficarra
Copy link
Collaborator

@jashkenas: It'd be nice if we left a little time for more than one reviewer to look at PRs, especially ones that can be as tricky as this.

@satyr
Copy link
Collaborator

satyr commented Oct 23, 2013

Linebreaks are excluded from the escapable characters to avoid any confusion

Isn't this counter-intuitive? I'd expect

/// a \
    b ///

to mean /a\nb/ rather than /a\b/.

@jashkenas
Copy link
Owner

Isn't this counter-intuitive? I'd expect [...] to mean /a\nb/ rather than /a\b/.

Agreed.

@marchaefner
Copy link
Collaborator Author

Isn't this counter-intuitive?

It may seem somewhat illogical, but my reasoning was that an operator (\ at end of line) shouldn't have two opposite meanings depending on context. (That would be counter-intuitive IMO.)

One could also expect the above example to mean /a\r\nb/ if the source has Windows line breaks.

Besides it's always possible (and much better style) to just write a literal \n.

Nevertheless, it needs fixing. But I think it should compile to /ab/.

@davidchambers
Copy link
Contributor

Another option is for an end-of-line \ in a heregex to produce a compile-time error. I'm sure we all agree that a literal \n is clearer. Disallowing \ at end of line in this context would remove the possibility for confusion.

@satyr
Copy link
Collaborator

satyr commented Oct 23, 2013

two opposite meanings depending on context

In what context the other meaning comes into play?

Seems to me allowing trailing \ as \n is:

  1. Much simpler to explain: "Heregexes ignore non-escaped whitespaces."

  2. Consistent with the heredoc behavior:

    $ bin/coffee -bcs
      '''
      a \
      b
      '''
    
    // Generated by CoffeeScript 1.6.3
    'a \nb';
    

@connec
Copy link
Collaborator

connec commented Oct 23, 2013

It seems that if, in the context of regular expressions, \<space> means <space> in the resulting regex, then \<linebreak> should mean <linebreak> in the resulting regex.

Different line endings break that, though, so you either have to go for what's in the file (which would be horrible), a 'standard' (probably \n), ignore it, or disallow it. The latter, whilst perhaps not the least surprising, is probably the easiest to debug if your assumptions are wrong (e.g. "Syntax error on line ...: can't use \ at end of line in heregex").

Edit: given there's precedent in heredocs, using \n would probably be the least surprising.

@jashkenas
Copy link
Owner

Yes -- in terms of CoffeeScript, \n is a linebreak. No windows nonsense ;)

I think @satyr's reasoning here is convincing. Let's allow it to escape whitespace and newlines, as in heredocs.

@michaelficarra
Copy link
Collaborator

Plus, that's how CSR behaves. On purpose.

@marchaefner
Copy link
Collaborator Author

In what context the other meaning comes into play?

x = some.long thing \
      or something.else
  1. Much simpler to explain: "Heregexes ignore non-escaped whitespaces."

But it clashes with the other simple explanation: "Long lines can be split with \"

  1. Consistent with the heredoc behavior

IMHO heredocs do it wrong and should be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Escape literal whitespace within Heregex (not a dup!) Multiline Regex backslash behavior
6 participants