Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Named captures #14

Open
4 of 8 tasks
jcgoble3 opened this issue Feb 13, 2016 · 2 comments
Open
4 of 8 tasks

Named captures #14

jcgoble3 opened this issue Feb 13, 2016 · 2 comments

Comments

@jcgoble3
Copy link
Owner

EDIT: Here's a proper task list:

  • add member to MatchState
  • modify start_capture() to parse the name
  • modify match() to parse and handle named and named-style numbered backreferences
  • modify build_result_table to check for and handle named captures
  • modify add_s() to handle named and named-style numbered backreferences
  • modify matchobj_expand() to handle named and named-style numbered backreferences
  • add documentation
  • for the love of all things programming, ADD TESTS!!!

Original:

Named captures would be awesome. Here's my idea: add a third member to the captures struct in MatchState, which would be an fixed array of char. start_capture would then read the upcoming characters, and if a named capture is detected, obtain the name and insert it into the array, followed by '\0'. (We can't just store a pointer in the MatchState, since then the array would expire when start_capture returns, or worse, when the inner if block ends. I'd rather not get into malloc stuff here.) Then it would proceed with the match as normal. In order to allow position captures to be named, the check for a position capture would be moved out of match and into start_capture.

Alternatively, a separate array can be added to MatchState, mapping group names to group numbers. This would simplify backreferences, but complicate result table building, and would also require either a NULL sentinel or an extra member tracking the number of named groups. I think this is better, though.

Table matching functions would include named captures in the table (obviously), but it's a bit tricky: because Lua (unlike Python) does not distinguish between indexing and attribute access, and some special fields are already defined. Thus, a new table, groups will be added, which will contain all captures, both numbered and named, instead of the main table. Next, to avoid having it shadowed by a capture named "expand", the expand method will be placed directly in the table. Finally, the table's __index metamethod will point to the groups subtable, thereby making all groups accessible by direct indexing, except for those that share the name of a special field. (Note that this will require a separate metatable for each result table, instead of a single metatable stored in the registry that all result tables share. expand will also have to be modified to pull from the groups subtable.

startpos and endpos would also carry the named fields, but fortunately they do not suffer from this problem. The documentation will officially recommend accessing named captures through groups rather than directly, especially when the name is unknown (e.g. user input). Using the groups subtable will also make pairs iteration over captures easier.

The syntax would be the same as most regex: (?<name>...). Excluding the subpattern (...) would turn it into a position capture. Backreferences within the pattern would use the PCRE syntax (but with Lua's escape character) %k<name>, while group references in a gsub replacement pattern would use simply %<name>. As a bonus, these backreference syntaxes can be overloaded to allow referencing numbered captures 10 or higher.

Valid group names would be anything that is a valid variable in Lua, but with a cap on length (maybe 15). Duplicate names would throw an error when encountered. Backreferences to non-existent groups would also throw an error.

Backwards-compatibility: I do not consider this proposed syntax to be backward incompatible with PUC-Rio Lua. In PUC-Lua, (? will open a capture and then match ? literally, but this is undefined. The manual defines ? as a special character which always has special meaning; the fact that it matches literally when following another special sequence (e.g. try it after another quantifier) is thus undefined behavior, and I do not consider a change in behavior that was previously undefined to be backwards-incompatible. Likewise, %< in a PUC-Lua replacement string would throw an error for an invalid escape; thus, all replacement strings that previously worked will continue to work. Finally, %k is currently not a defined character class, and thus the fact that it matches itself is undefined. Therefore, named captures can be included in the basic functions, even though their only use will be backreferences (since the basic functions don't have a means to return names).

(I think this explanation is more long-winded than the implementation will be. :P

@jcgoble3
Copy link
Owner Author

Work is underway in the feature/named-captures branch.

@jcgoble3
Copy link
Owner Author

No idea why I thought of this tonight when I haven't touched this project for months, but why can't we just use a Lua table for relating a capture name to its group number? Names as keys and numbers as values. Then given a name, we can just use the C API to fetch the group number and then fetch the group? This suffers the slowdown of using Lua rather than pure C, but turns named capture lookup from an O(n) operation to an O(1) operation. (Although I'm guessing the Lua slowdown is bigger than the gain from going to O(1) here.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant