# Day 19 - regular expressions

* https://adventofcode.com/2020/day/19

The problem description amounts to a [regular expression](https://www.regular-expressions.info/); by traversing the graph of rules you can combine the string literals into a regex pattern that the Python [`re` module](https://docs.python.org/3/library/re.html) can compile into a pattern. Using the [`Pattern.fullmatch()` method](https://docs.python.org/3/library/re.html#re.Pattern.fullmatch) you can then check each message for validity.

Having just used the `tokenize` module the [day before](./Day%2018.ipynb), I found it very helpful to parse the rules, as well.

In [1]:
import re
from collections import deque
from collections.abc import Iterable, Mapping, MutableMapping
from io import StringIO
from itertools import islice
from tokenize import generate_tokens, NUMBER, STRING, TokenInfo
from typing import Callable, Dict, Tuple

def parse_rules(lines: Iterable[str], make_regex: Callable[[str], re.Pattern[str]]) -> re.Pattern[str]:
    def rule_to_tokens(rule: str) -> Tuple[str, Iterable[TokenInfo]]:
        tokens = generate_tokens(StringIO(rule).readline)
        # tokens are NUMBER, COLON, (...)+, we skip the COLON.
        return next(tokens).string, list(islice(tokens, 1, None))

    unprocessed = dict(map(rule_to_tokens, lines))
    rules: MutableMapping[str, str] = {}
    dispatch: Mapping[int, Callable[[str], str]] = {NUMBER: rules.__getitem__, STRING: lambda s: s[1:-1]}
    stack = deque(['0'])
    while stack:
        tokens = unprocessed[stack[-1]]
        if missing := {t.string for t in tokens if t.type == NUMBER and t.string not in rules}:
            stack += missing
            continue
        rule = "".join([dispatch.get(t.type, str)(t.string) for t in tokens])
        rules[stack.pop()] = f"(?:{rule})"
    return make_regex(rules["0"])

def validate_messages(data: str, make_regex: Callable[[str], re.Pattern[str]] = re.compile) -> int:
    rule_data, messages = data.split("\n\n")
    rule_regex = parse_rules(rule_data.splitlines(), make_regex)
    return sum(bool(rule_regex.fullmatch(msg)) for msg in messages.splitlines())

assert validate_messages("""\
0: 4 1 5
1: 2 3 | 3 2
2: 4 4 | 5 5
3: 4 5 | 5 4
4: "a"
5: "b"

ababbb
bababa
abbbab
aaabbb
aaaabbb
""") == 2

In [2]:
import aocd
data = aocd.get_data(day=19, year=2020)

In [3]:
print("Part 1:", validate_messages(data))

Part 1: 104


## Part 2 - recursive regex

Part two introduces _recursion_; patterns `8` and `11` add self-references.

For rule 8, that just means that the contained rule `42` just matches 1 or more times (`"42 | 42 8"` will match `"42"`, `"42 42"`, `"42 42 42"`, etc), so can be simplified using the [`+` repetition operator](https://www.regular-expressions.info/repeat.html), to `"8: 42 +"` which my tokenizer-based parser will happily assemble.

But the change for rule 11, `"42 31 | 42 11 31"` is not so easily simplified. The rule matches for any number of repetitions of `"42"` and `"31"` **provided they repeat an equal number of times**. To check for such patterns using regular expressions, you need a regex engine that supports either [balancing groups](https://www.regular-expressions.info/balancing.html) or [recursion](https://www.regular-expressions.info/recurse.html). .NET's regex engine would let you use  balancing groups (the pattern, with spaces around the pattern IDs, would be `(?'g42' 42 )+ (?'-g42' 31 )+ (?(g42)(?!))`), and Perl, Ruby and any regex engine based on PCRE would let you use recursion.

Lucky for me, the [`regex` package](https://pypi.org/project/regex/) _does_ support recursion. The package may one day be ready to replace the standard-library `re` module, but that day has not yet arrived. In the meantime, if you have advanced regex needs, do keep the existence of that package in mind! As for the recursion syntax: given a named group `(?P<groupname>...)`, the expression `(?&groupname)` will match everything within the named pattern, and `(?&groupname)?` will do so 0 or more times. So, we can replace `"42 31 | 42 11 31"` with `"(?P<rule_11> 42 (?&rule_11)? 31 )"` to get the desired regex validation pattern.

In [4]:
import regex

def validate_corrected_rules(data: str) -> int:
    return validate_messages(
        data
        # 42 | 42 8, repeating 42 one or more times.
        .replace("8: 42\n", "8: 42 +\n")
        # 42 31 | 42 11 31, recursive self-reference
        .replace("11: 42 31\n", "11: (?P<rule_11> 42 (?&rule_11)? 31 )\n"),
        regex.compile
    )

assert validate_corrected_rules("""\
42: 9 14 | 10 1
9: 14 27 | 1 26
10: 23 14 | 28 1
1: "a"
11: 42 31
5: 1 14 | 15 1
19: 14 1 | 14 14
12: 24 14 | 19 1
16: 15 1 | 14 14
31: 14 17 | 1 13
6: 14 14 | 1 14
2: 1 24 | 14 4
0: 8 11
13: 14 3 | 1 12
15: 1 | 14
17: 14 2 | 1 7
23: 25 1 | 22 14
28: 16 1
4: 1 1
20: 14 14 | 1 15
3: 5 14 | 16 1
27: 1 6 | 14 18
14: "b"
21: 14 1 | 1 14
25: 1 1 | 1 14
22: 14 14
8: 42
26: 14 22 | 1 20
18: 15 15
7: 14 5 | 1 21
24: 14 1

abbbbbabbbaaaababbaabbbbabababbbabbbbbbabaaaa
bbabbbbaabaabba
babbbbaabbbbbabbbbbbaabaaabaaa
aaabbbbbbaaaabaababaabababbabaaabbababababaaa
bbbbbbbaaaabbbbaaabbabaaa
bbbababbbbaaaaaaaabbababaaababaabab
ababaaaaaabaaab
ababaaaaabbbaba
baabbaaaabbaaaababbaababb
abbbbabbbbaaaababbbbbbaaaababb
aaaaabbaabaaaaababaa
aaaabbaaaabbaaa
aaaabbaabbaaaaaaabbbabbbaaabbaabaaa
babaaabbbaaabaababbaabababaaab
aabbbbbaabbbaaaaaabbbbbababaaaaabbaaabba
""") == 12

In [5]:
print("Part 2:", validate_corrected_rules(data))

Part 2: 314
