Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruby -> JavaScript regexp support #6

Closed
joecorcoran opened this issue Apr 19, 2012 · 12 comments
Closed

Ruby -> JavaScript regexp support #6

joecorcoran opened this issue Apr 19, 2012 · 12 comments
Assignees

Comments

@joecorcoran
Copy link
Collaborator

There are big differences between the regular expression capabilities of Ruby and JavaScript. It's not within the scope of this project to bridge the gap, but it does cause problems for certain typical uses of the format validator. See issue #5 for an example.

At the moment, I'm thinking that the best way to solve this is to write a plugin for XRegExp which will add some of the more commonly used Oniguruma features (character types and anchors, POSIX bracket syntax for character classes). We can then use XRegExp in place of the native RegExp in Judge's client-side format validator method.

@ghost ghost assigned joecorcoran Apr 19, 2012
@slevithan
Copy link

I'm interested to see any work you do towards this. But you should know up front that it's not possible to translate all Ruby/Oniguruma features to JS using XRegExp.

Shorthand classes like Oniguruma's \h and \H are easy-peasy (I'd just warn that Oniguruma's definitions of these two classes are different than Perl and PCRE, which use \h to match horizontal whitespace). POSIX character classes, again, are totally doable. Code for that is actually already written. Click the "Show more XRegExp.addToken examples" link on the XRegExp API page to see it. Ruby's \z anchor is also easy; just output '$(?!\\s)'.

The only thing you mentioned that isn't possible is Ruby's \A anchor. Since JS doesn't support lookbehind, it's not possible to translate \A in a way that will still work when JS's /m flag (which is NOT the same as Ruby's /m) is applied.

You could make it work whenever JS's /m is not active via something like this:

XRegExp.addToken(
  /\\A/,
  function () {
    if (this.hasFlag('m')) {
      throw new SyntaxError('cannot use \\A with flag /m');
    }
    return '^';
  }
);

Since that doesn't specify a token scope, it will work outside of character classes only, which is what you want here.

The problem with the above approach is that, if you want to emulate Ruby, you should enable /m for every regex, since Ruby's ^ and $ always match at newlines (and there's no mode to disable that).

Note that Ruby's /m works like XRegExp's /s.

Also note that the XRegExp.exec and XRegExp.test functions support pos and sticky arguments that allow you to accomplish the same thing as \A. E.g., XRegExp.test('str', /regex/, 0, 'sticky') will return true only if /regex/ matches at the zeroth character in the string.

@joecorcoran
Copy link
Collaborator Author

Thanks for this comment Steven, it's really very helpful.

I'm only planning on translating as many of the Oniguruma features as I can; hopefully I can cover some of the more common cases. People will still have to take care when writing expressions for both environments but at least it will be a little less painful.

I started work on this last week and funnily enough, the first thing I tried to tackle was the \A anchor. A disheartening start :)

I'll let you know when I've got the project up on GitHub. I'm out of the country for the next few weeks, so it might be late June by the time anything of note is finished. Thanks again for your help!

@slevithan
Copy link

Coolness. Yours might be the first third-party standalone XRegExp addon, so I'm looking forward to it.

BTW, \h and \H are actually trickier than I suggested. I wasn't thinking about character classes when I wrote the previous comment. It's still very doable, though, with something like this:

XRegExp.addToken(
  /\\([hH])/,
  function (match, scope) {
    var inv = (match[1] === 'H'); // Uppercase for inverted
    if (scope === 'class') {
      return inv ? '\\0-/:-@G-`g-\\uffff' : '0-9A-Fa-f';
    }
    return '[' + (inv ? '^' : '') + '0-9A-Fa-f]';
  },
  {scope: 'all'}
);

Also, it might be best to add your Ruby emulation only via a special constructor or function, so that standard XRegExp syntax remains as is when using the XRegExp constructor. You could do that using something like this:

(function (XRegExp) {
  var rubyMode = false;

  function RubyRegex(pattern, flags) {
    // Follow Ruby's flag /m -> /s quirk, and always apply the JavaScript /m
    // flag so that ^ and $ work like Ruby
    flags = (flags || '').replace(/m/g, 's') + 'm';
    // Enable all the Ruby syntax extension tokens
    rubyMode = true;
    try {
      return XRegExp(pattern, flags);
    } catch (err) {
      throw err;
    } finally {
      // Need to turn off rubyMode even if bad syntax caused an error
      rubyMode = false;
    }
  }

  XRegExp.install('extensibility');

  // These tokens are activated only when building a regex using RubyRegex
  // (not XRegExp), due to their trigger functions

  XRegExp.addToken(
    /\\A|other unsupported tokens/, // Update as necessary
    function (match) {
      throw new SyntaxError(match[0] + ' is not supported');
    },
    {
      // Might need another token for unsupported Ruby syntax in scope
      // 'class' or 'all'
      scope: 'default',
      trigger: function () {return !!rubyMode;}
    }
  );

  XRegExp.addToken(
    /\\z/,
    function () {
      return '$(?!\\s)';
    },
    {
      trigger: function () {return !!rubyMode;}
    }
  );

  // Add more tokens with the same trigger function...

}(XRegExp));

All of the code I've posted in these comments is untested, so beware of bugs. Hopefully this is helpful, though. :) Note that RubyRegex, as posted, doesn't replace flag m with s if the flag is provided via a leading mode modifier like (?xm). That shouldn't be difficult to add support for, though. You can use, e.g., /^\(\?[\w$]+\)/ to match and manipulate a leading mode modifier in provided pattern strings, before passing to XRegExp.

@joecorcoran
Copy link
Collaborator Author

I'm going ahead with the special constructor for now, but I was just thinking: it might be cool to have a way of switching on a plugin like this through the XRegExp interface.

Something like:

XRegExp.use('ruby', true);

which would allow us to add new tokens like this:

XRegExp.addToken(
  /\\z/,
  function () {
    return '$(?!\\s)';
  },
  {
    trigger: function () { return XRegExp.using('ruby'); }
  }
);

without a new constructor. You'd need to store the keys and values passed by the use method, but that's pretty simple.

Have you considered something like this before? I'm happy to fork XRegExp to demonstrate further if you want to explore it.

@slevithan
Copy link

That's a nice design. I like it. But I'd rather not rush into accepting new features that are exclusively intended for running XRegExp with different token sets via addons. (To use less XRegExp-exclusive terminology, I'm talking about features for swapping regex flavors on the fly.)

Another way to do something similar would be to add a method that returns a new XRegExp object that uses a fresh and discrete list of tokens. Here's some hypothetical code:

var RubyRegex = XRegExp.gimmeAFreshXRegExpYo();

RubyRegex.addToken(
  // No trigger function needed here
);

That way, you wouldn't have to worry about causing conflicts with code that calls the XRegExp constructor without realizing it will be using the Ruby token set. Your addon could then safely be dropped into a page that already uses XRegExp. That would also be a more efficient, since the Ruby tokens wouldn't have to be evaluated (i.e, their trigger functions don't have to run) when calling XRegExp, and vice versa for your Ruby constructor.

I'm happy to look over any changes in an XRegExp fork. But if your goal is to get them accepted upstream in the near term, it might be best to leave out use and using. Instead, you might want to make changes that make it possible to create the XRegExp.use and XRegExp.using functions yourself, within an addon.

Also, I'm not sure what the second (boolean) argument in your use function is for. Why not just run XRegExp.use('ruby') after first running something like XRegExp.addTokenSet('ruby')? Is the idea that you could make multiple token sets active at the same time? If so, that's interesting, and not something I've previously considered. Are you envisioning code like the following?

XRegExp.use('ruby', true);
XRegExp.use('steve', false); // turn off just this one
XRegExp.use('joe', true);
// Now using a mashup of ruby and joe token sets, in addition to the default tokens

The biggest challenge I envision with that is that tokens can overlap or otherwise conflict. Right now, there's a simple rule: the token added latest wins. In the above scenario, I'm not sure what the semantics would be.

FYI, I don't expect that I will allow any related features (such as the design you proposed or the one I described) to permit starting with a blank slate of no tokens (i.e., reverting to native JavaScript syntax). Some of the built-in tokens are critical to the bug-free functioning of XRegExp and its official addons. In particular, I'm thinking of the built-in tokens for making empty character classes work consistently cross-browser (necessary if you want to parse regex syntax, since otherwise you can't know where character classes end), and for disallowing octals (which is relied upon by XRegExp.union and XRegExp.build, when rewriting backreferences).

BTW, if there's a good solution to the problem of token precedence with XRegExp.use, then I could see using the function internally. That would help to justify its inclusion in the base library. Alternatively, the functionality could be bundled into XRegExp.install and XRegExp.uninstall. Here's a mockup of how I see XRegExp's built-in tokens being activated internally, if this were implemented:

// Critical tokens like those for disabling octals are always included and cannot
// be installed/uninstalled
XRegExp.install({
  namedCapture: true,
  builtinFlags: true,
  strictErrors: true,
  miscSyntax: true
});

That way, XRegExp's syntax would stay the same, out of the box, but logical token groups (rather than individual tokens) could be disabled upon request.

Not sure how quickly any of this will come to fruition, but it's a good discussion. I do want XRegExp to have robust support for addons. However, I'd also like to keep the addon API simple and prevent it from adding significant file size to xregexp.js.

@joecorcoran
Copy link
Collaborator Author

Lots to think about here! I actually like your idea for returning a new XRegExp object; it would be great to have control adding functionality without fear of conflicts.

The boolean argument for use was intended as on off/on switch for using multiple token sets, yeah. I was intending on keeping the last-token-is-the-winner situation too. You're totally right that it could get hairy without careful management. Needs more thought. It might be most sensible, as you say, to bundle this kind of behaviour up into install and uninstall.

Anyway, I'll carry on with the wrapped constructor from above. I'm keen to get my teeth into writing the new tokens. I'll get back to you when I'm back at home in a few weeks with some more things to ponder! Thanks for the chat.

@slevithan
Copy link

I actually like your idea for returning a new XRegExp object; it would be great to have control adding functionality without fear of conflicts.

Honestly, that approach would probably be safer and more manageable than use/install based functionality. Not only because the use/install route potentially adds complex semantics and issues to worry about in the future, but also because XRegExp is already almost too customizable for its own good. In particular, being able to remove XRegExp's built-in syntax might not actually be in the best interest of users.

Anyway, I'll carry on with the wrapped constructor from above. I'm keen to get my teeth into writing the new tokens. I'll get back to you when I'm back at home in a few weeks with some more things to ponder! Thanks for the chat.

If you have any questions about whether particular Oniguruma features can currently be reproduced in XRegExp, I'd be happy to answer.

@dlee
Copy link

dlee commented Aug 6, 2012

Is the reverse of this available (ie. Javascript Regex -> Ruby Regex)?

I'm assuming Javascript Regex is less powerful than Ruby Regex, and thus the translation from Javascript -> Ruby might be more thorough.

@slevithan
Copy link

@joecorcoran FYI, due to various changes in recent builds of XRegExp 3.0.0-pre, the wrapped constructor approach will no longer work. Rather, I now recommend simply using a shared flag (such as R for Ruby) that applies to all of your custom syntax tokens. Perhaps a constructor like RubyRegex that implicitly sets the R flag would still be worthwhile.

Also, the way that syntax tokens are linked to flags has been simplified. E.g., instead of this:

XRegExp.addToken(
    /\\z/,
    function() {
        return '$(?!\\s)';
    },
    {trigger: function() {
        return this.hasFlag('R');
    }}
);

...In XRegExp 3 you will need to use this:

XRegExp.addToken(
    /\\z/,
    function() {
        return '$(?!\\s)';
    },
    {flag: 'R'}
);

@waymondo
Copy link

@joecorcoran Just discovered this gem and I'm way into it so far, great work.

I also ran into this issue and with some googling I found this method in the new rails routing inspector:

https://github.com/rails/rails/blob/b67043393b5ed6079989513299fe303ec3bc133b/actionpack/lib/action_dispatch/routing/inspector.rb#L42

I patched it in here and it seems to be working so far:

waymondo@e329c74

I could create a pull request for it but I figured I'd point it out to you in case you thought there was a better place to patch it in first.

@joecorcoran
Copy link
Collaborator Author

Hey @waymondo, thanks for pointing that out, I've never noticed it before. It might actually turn out to be a cheap way of achieving what this ticket was originally discussing!

I definitely wouldn't want to just throw it in there though – it would be much better as a convert_regexp option that users can turn on in the config. If you want to give that a go and add some tests I'd be happy to merge.

@joecorcoran
Copy link
Collaborator Author

Closing this as part of bug triage. Feel free to continue discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants