Skip to content

Commit

Permalink
[doc] Draft of regex API doc
Browse files Browse the repository at this point in the history
Based on a part of the eggex doc.  Still needs attention.
  • Loading branch information
Andy C committed Dec 17, 2023
1 parent 3473f7c commit 9a20d71
Show file tree
Hide file tree
Showing 7 changed files with 116 additions and 50 deletions.
1 change: 1 addition & 0 deletions build/doc.sh
Expand Up @@ -83,6 +83,7 @@ readonly MARKDOWN_DOCS=(
warts

eggex
ysh-regex-api
upgrade-breakage
ysh-tour

Expand Down
2 changes: 1 addition & 1 deletion builtin/method_str.py
Expand Up @@ -107,4 +107,4 @@ def Call(self, rd):
if indices is None:
return value.Null

return value.Match(string, indices)
return value.Match(string, indices, eggex_val.name_types)
2 changes: 1 addition & 1 deletion core/value.asdl
Expand Up @@ -86,7 +86,7 @@ module value
# TODO: might need a pointer to Eggex for the names
# Problem: ERE has no non-capturing groups, so grouping will occupy a
# number
| Match(str s, List[int] indices)
| Match(str s, List[int] indices, List[NameType?]? name_types)

# ^[42 + a[i]]
| Expr(expr e)
Expand Down
57 changes: 10 additions & 47 deletions doc/eggex.md
Expand Up @@ -323,62 +323,25 @@ You can spread regexes over multiple lines and add comments:

### The YSH API

Testing and extracting matches:
See [YSH regex API](ysh-regex-api.html) for details.

var s = 'days 04-01 and 10-31'
In summary, YSH has Perl-like conveniences with an `~` operator:

var s = 'on 04-01, 10-31'
var pat = /<capture d+ as month> '-' <capture d+ as day>/

if (s ~ pat) {
echo $[_group(1)]
if (s ~ pat) { # search for the pattern
echo $[_group(1)] # => 04
}

More explicit API with with search():
It also has an explicit and powerful Python-like API with the `search()` and
leftMatch()` methods on strings.

var m = 's' => search(pat)
var m = s => search(pat, pos=8) # start searching at a position
if (m) {
echo $[m => group(1)]
}

Iterative matching with with leftMatch():

var s = 'hi 123'
var lexer = / <capture [a-z]+> | <capture d+> | <capture s+> /
var pos = 0
while (true) {
var m = s => leftMatch(lexer, pos=pos)
if (not m) {
break
}
if (m => group(1) !== null) {
echo 'letter'
elif (m => group(2) !== null) {
echo 'digit'
elif (m => group(3) !== null) {
echo 'space'
}

setvar pos = m => end(0)
echo $[m => group(1)] # => 10
}

(Still to be implemented.)

Substitution:

var new = s => replace(/<capture d+ as month>/, ^"month is $month")
# (could be stdlib function)

Slurping all like Python:

var matches = findAll(s, / (d+) '.' (d+) /)
# (could be stdlib function)

# s => findAll(pat) => reversed()

Splitting:

var parts = s => split(/space+/) # contrast with shSplit()
# (could be stdlib function)

### Language Reference

- See bottom of the [YSH Expression Grammar]($oils-src:ysh/grammar.pgen2) for
Expand Down
101 changes: 101 additions & 0 deletions doc/ysh-regex-api.md
@@ -0,0 +1,101 @@
---
default_highlighter: oils-sh
in_progress: true
---

YSH Regex API - A Mix of Python and Perl/Awk
============================================

TODO:

- Make these work:
- [`_group()`]($help-topic:_group)
- [`Match => group()`]($help-topic:group)
- [`Str => search()`]($help-topic:search)
- [`Str => leftMatch()`]($help-topic:leftMatch)

Mechanisms

- Awk and Perl-like with ~ operator
- `_group() _start() _end()` - like Python
- Python-like except on `Str`
- `Str => search()`
- `Str => leftMatch()` for lexers
- note that it's consistent with find() consistency
- TODO: should accept raw ERE string too
- TODO: replace() and unevaluated string literal
- is `replace()` polymorphic with strings?
- or maybe `sub()`

- Others can be implemented with search() and leftMatch()
- `split()` by regex
- although is our `Str => split()` also polymorphic?
- `findAll()` or `allMatches()` - in Python this has a weird signature

Related: [Egg Expressions](eggex.html)

## Basic Tests with ~

var s = 'days 04-01 and 10-31'
var pat = /<capture d+ as month> '-' <capture d+ as day>/

if (s ~ pat) {
echo $[_group(1)]
}

## More explicit API

### search()

var m = 's' => search(pat)
if (m) {
echo $[m => group(1)]
}

### Iterative matching with with leftMatch():

var s = 'hi 123'
var lexer = / <capture [a-z]+> | <capture d+> | <capture s+> /
var pos = 0
while (true) {
var m = s => leftMatch(lexer, pos=pos)
if (not m) {
break
}
if (m => group(1) !== null) {
echo 'letter'
elif (m => group(2) !== null) {
echo 'digit'
elif (m => group(3) !== null) {
echo 'space'
}

setvar pos = m => end(0)
}

## Named Captures and Types - like `scanf()`

var date_pattern = / <capture d+ as month> '-' <capture d+ as day> /

## Summary

Mix Python and Perl.

## Appendix: Still to be implemented

### Substitution

var new = s => replace(/<capture d+ as month>/, ^"month is $month")
# (could be stdlib function)

### Slurping all matches, like Python

var matches = findAll(s, / (d+) '.' (d+) /)
# (could be stdlib function)

# s => findAll(pat) => reversed()

### Splitting

var parts = s => split(/space+/) # contrast with shSplit()
# (could be stdlib function)
2 changes: 1 addition & 1 deletion frontend/syntax.asdl
Expand Up @@ -616,7 +616,7 @@ module syntax
| Alt(List[re] children)

| Group(re child)
# TODO: <capture d+ as month: Int> needs Token? type field
# TODO: NameType isn't quite right because List[Int] doesn't make sense
| Capture(re child, NameType? name_type)
| Backtracking(bool negated, Token name, re child)

Expand Down
1 change: 1 addition & 0 deletions ysh/regex_translate.py
Expand Up @@ -264,6 +264,7 @@ def _AsPosixEre(node, parts, name_types):
node = cast(re.Capture, UP_node)

# Collect in order of ( appearance
# TODO: get the name string, and type string
name_types.append(node.name_type)

parts.append('(')
Expand Down

0 comments on commit 9a20d71

Please sign in to comment.