Regexp Destructuring #1378

Closed
gampleman opened this Issue May 19, 2011 · 22 comments

Projects

None yet

9 participants

@gampleman

Example

/Hello (?<subject>\w+), what's (?<direction>\w+)/ = "Hello world, what's up?"

console.log subject # "world"
console.log direction # "up"

# Also works  with nesting etc.

{painter: /(?<first_name>\w+) (?<last_name>\w+)/, num_paintings} = {painter: "Pablo Picasso", num_paintings: 34}

Description

Regexps could be used to create variables from named capture groups applied to strings.

  1. Is this a bad idea? Why?
  2. If not, would something like this accepted into CoffeeScript?
@michaelficarra
Collaborator

Can you also include a suggested compilation or two?

@gampleman

Compiled js would likely look something like this for the first example:

var _matches, subject, direction;
_matches = "Hello world, what's up?".match(/Hello (\w+), what's (\w+)/), subject = _matches[1], 
direction = _matches[2];

The second would look like this:

var first_name, last_name, num_paintings, _ref, _matches;
_ref = {
  painter: "Pablo Picasso",
  num_paintings: 34
}, _matches = _ref.painter.match(/(\w+) (\w+)/), first_name = _matches[1], last_name = _matches[2], 
num_paintings = _ref.num_paintings;

Edit a function would look like this:

parse_assignment = (/^(?<variable>[\w\_]+) = (?<value>\d+|".+"); *$/) -> 
  @variables[variable] = value
var parse_assignment;
parse_assignment = function(_arg) {
  var _match, variable, value;
  _match = _arg.match(/^([\w\_]+) = (\d+|".+"); *$/), variable = _match[1], value = _match[2];
  return this.variables[variable] = value;
}
@erisdev

I like the idea, but I don't think the proposed syntax is very obvious as to what it does. It also poses a bit of a problem in that it would only work with literals and introduces special-case behavior of emulating named captures.

Destructuring assignment with regexp matches is already possible, of course, but it's ugly as hell. You also have to provide a variable to stick the entire matched substring in, which is kind of lame but workable.

[matched, subject, direction] = "Hello World, what's up?".match /Hello (\w+), what's (\w+)/

Is there some other way your suggestion could be written or otherwise reworked? I can see something like this being useful, but I doubt it will make it in as-is.

@gampleman

I just went with the named captures syntax mainly because it's already standard in some other languages (eg. Ruby 1.9 does pretty much exactly this, though it uses the ~= operator). Doesn't all destructuring work only with literals? I don't think that's a problem.

As to non-obvious syntax: normal regexps seem hard to extend with readable/intuitive syntax. Maybe heregexps could work:

# weird idea
///
https?:\/\/
?domain = [\w\-\_\.]+
/
?path = [\w\_\-\/]+
\.
?ext = \w{2,}
/// = "http://example.com/path/to/file.txt"

But honestly I think that captures really capture the idea.

@michaelficarra
Collaborator

I like the named capturing groups syntax.

@kitcambridge

What if we implement Ruby's =~ operator in CoffeeScript? It's less ambiguous, and avoids special-casing = for RegExp literals.

@erisdev

@gampleman Ok, you're right that all destructuring works only with literals. So, disregard that part of my argument—it doesn't really make sense. On the other hand, while I want to be in favor of this suggestion, I still think that supporting named captures in this case but nowhere else is kind of weird.

@kitgoncharov I doubt the devs would go for a new ~= operator. I can't find it, but I feel like I've seen similar ideas for new operators shot down in the past. Plus, using ~= in this case (for destructuring assignment) would be even more special.

@michaelficarra
Collaborator

@kitgoncharov: There have already been (two, I believe) other suggestions for the =~ operator from ruby, but I tried really hard to find them and I can't. If I remember correctly, @jashkenas was the one who shot down the idea both times.

@erisdiscord: Yes, we would have to pre-process all regexp literals, removing (and remembering the position of) the named capturing groups. The big gotcha, though, is the broken compatibility with new RegExp string. But you already mentioned that.

@satyr
Collaborator

Maybe heregexps could work

How would you support interpolations like:

///(?<#{name}>\w+)/// = target

~=

You mean =~?

irb(main):001:0> name = 'foo'
=> "foo"
irb(main):002:0> ' bar ' =~ /(?<#{name}>\w+)/
=> 1
irb(main):003:0> $~[name]
=> "bar"
@gampleman

@satyr Yeah I always confuse the order. Regarding support for interpolations: I wouldn't. I personally wouldn't support hereregexps in this case (that comment was a response to erisdiscord's comment about unintuitive syntax). I think that this is best suited for relatively simple patterns not requiring the heavy calibre of hereregexps.

@michaelficarra I think that elsewhere then in destruction it's not possible given how CoffeeScript works.

@debrouwere

Slightly in favor, although I think erisdiscord's suggestion to just use

[matched, subject, direction] = some_text.match /Hello (\w+), what's (\w+)/

makes sense too, and I don't find that particularly ugly.

@ELLIOTTCABLE

What about a native, compiled RegExp library for Node that understands named captures, and then having an optional CoffeeScript compile-time flag that brings in that library, and utilizes the named-capture functionality to destructure into variables?

That way, destructured regex literals operate the same as regex everywhere else (because all of them would be compiled into native-library calls, instead of into v8 RegExp literals.) Again, I'm suggesting this be an optional flag, because that could break some obscure existing code that depends on vagaries of the v8 RegExp engine.

This would be nice for entirely separate reasons as well, such as support for zero-width negative lookbehinds, and other regex power-user features.

If there's interest, and the CoffeeScripters like the idea, I might be willing to write PCRE bindings for Node.js that mirror the extant RegEx API, allowing CoffeeScript to compile regex literals into something like new PCRE("the regex", 'flags'). That would be a clean and sane approach, it seems to me.

Disclaimer: I'm not actually a CoffeeScript user, personally; I have fundamental issues with the idea; destructuring assignment of arrays is one of the only things that has ever remotely attracted me to it. This single feature alone would be enough to ‘bring me over to the dark side,’ if we could implement it sanely.

@jashkenas
Owner

I'm afraid that this sort of thing is out of the domain of CoffeeScript. Regexes are values, and they need to work the same way in JavaScript as they do in CoffeeScript.

To put it another way, you need to be able to pass a CoffeeScript regex into JavaScript code, and vice-versa, and have things work properly. If you'd like to destructure regex results, pattern matching it a great way to go, as @erisdiscord suggests (and no, destructuring doesn't only work with literals, it works with any expression):

string = "Hello World, what's up?"
regex = /Hello (\w+), what's (\w+)/
[match, subject, direction] = string.match regex
@jashkenas jashkenas closed this Dec 20, 2011
@erisdev

@elliottcable Mind, although this isn't really the domain of CoffeeScript, there's nothing stopping you from writing those PCRE bindings for Node and using them along with destructuring. It might not be quite as pretty, but consider a slightly modified version of Jeremy's example using your hypothetical PCRE class with named captures:

string = "Hello World, what's up?"
regex = PCRE.compile "Hello (?<who>\\w+), what's (?<where>\\w+)"
{who, where} = string.match regex

Still pretty good, yeah?

@ELLIOTTCABLE

@erisdiscord that sort of defeats the point. ;D

Anyway, it was worth a shot. Enjoy your CoffeeScript'in!

@erisdev

@elliottcable I'm not so sure it does! I mean, the main point was to use object deconstruction with named captures, right?

@ELLIOTTCABLE

Yes, in a beautiful way. It's not like I can't already use match(). The point wasn't the restructuring of regexes, but instead the use of regexes as a beautiful way to destructure data quickly.

response = Twitter.get(path)
/(?<username>\w+)\/status\/(?<ID>\d+)/ = path
database[ID] = username
// … etc, etc, etc

Very similar to production code I've written in another language (though a little more ugly, because we're resorting to the regex-named-capture syntax, whereas I'm more used to something of the form “{username}/status/{id}” ← request path). Useful stuff, but not very appropriate for CoffeeScript, I suppose.

@erisdev

@elliottcable Ok, yeah, I see what you mean there. If JavaScript regexps supported named captures I would probably be at least +0.5 on that proposal, because I can definitely see it being useful.

@ELLIOTTCABLE

@jashkenas much later, after-the-fact, I'd like to point out (having come across this thread again, much later) … the LHS is always a literal.

For example, the new RegExp string approach mentioned, doesn't apply; consider this:

new Array(foo, bar, baz) = func()

Rather senseless, isn't it? You have to destructure into what looks like a literal, but actually isn't, correct? i.e. [foo, bar, baz] = … or similar. That, I would posit, is no different than requiring you to assign to a regexp literal, as opposed to a constructed regexp.

As to the other argument, regarding shim'ing the syntax for named captures … well, again, in a way, that's exactly what you're doing with array or object syntax, for destructuring assignment. In an Object literal, we're looking at a format of {key: <value>, key: <value>} … Coffeescript basically copies that syntax, for the completely different semantics of {variable: furtherDestructuring}. So, to state this proposal a different way … it seems reasonable to suggest that Coffeescript also copy (but extend!) the RegExp literal syntax for a string-matching destructure.

Make sense?

Edit: Further note; the “just destructure the return-value of match()” suggestion … has the rather major failing of causing a TypeError anytime the match fails, i.e., doesn't match at all:

coffeee> [_, a, b, c] = "foo bar".match /([aA])([bB])([cC])/
TypeError: Cannot read property '0' of null
@jashkenas
Owner

Yep -- I get the literal bit. If JS regex literals had named captures, this ticket would make a whole lotta sense. But sadly it doesn't, so it doesn't.

You can, of course, add a guard if the match may fail:

if regex.test string
  [a, b, c, d, e] = string.match regex
@epidemian

You can, of course, add a guard if the match may fail

Or fallback to an empty array:

[a, b, c, d, e] = (string.match regex) or []
@ELLIOTTCABLE

Both of which, of course, are enough extra syntax / generated-code to make sense out of this ticket …

But. Nonetheless. The answer exists; I'd just hoped a little more clarification might sway you, a year later. Ah, well. (=

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment