# Day 5: Pattern matching

- [Day 5](https://adventofcode.com/2018/day/5)

Part 1 at least is a typical pattern matching exercise. Repeatedly remove lowercase/UPPERCASE combos (in either order) until the length of the string no longer changes.

For string pattern matching the obvious tool is the [`re` module](https://docs.python.org/3/library/re.html) (see the [regex HOWTO](https://docs.python.org/3/howto/regex.html) as well), but regex doesn't have dedicated syntax to spell 'uppercase version of a matched letter' or 'lowercase version of a matched letter'. Not in the Python `re` module syntax at any rate, and not in the much more advanced Python [`regex` project](https://pypi.org/project/regex/) either.

But that doesn't stop us from just generating all possible combinations from [`string.ascii_uppercase`](https://docs.python.org/3/library/string.html#string.ascii_uppercase) and [`string.ascii_lowercase`](https://docs.python.org/3/library/string.html#string.ascii_uppercase)....

Not that someone didn't try to do this with pure regex; with a local 'case-insensitive' modifier and a negative look-ahead that matches the exact same character twice in a row, with the second match insensitive to case, and at the same time use a negative lookahead to not match the same character, with case sensitivity enabled. That leaves only the opposite case as a a possible match:

    r'([a-zA-Z])(?!\1)(?i:\1)'

but in my view a generated regex is just simpler to manage and execute.


In [1]:
import re
import string
from functools import partial

patterns = "|".join(
    [
        f"{lc}{uc}|{uc}{lc}"
        for lc, uc in zip(string.ascii_lowercase, string.ascii_uppercase)
    ]
)
replace = partial(re.compile(patterns).sub, "")


def polymer_reactions(s):
    length = len(s)
    while True:
        s = replace(s)
        if len(s) == length:
            break
        length = len(s)
    return s

In [2]:
tests = {
    "aA": "",
    "abBA": "",
    "abAB": "abAB",
    "aabAAB": "aabAAB",
    "dabAcCaCBAcCcaDA": "dabCBAcaDA",
}

for t, expected in tests.items():
    assert polymer_reactions(t) == expected

## Improvement on the speed

This can be done better, however, if you see the polymer reactions as a stack process; add to the stack one char at a time, or remove from the stack if there is a match, repeatedly so if necessary.

We can also use ASCII properties here; it's bit 6 that differs between lowercas and uppercase, so we can test if just that bit differs:


In [3]:
from IPython.display import Markdown

output = []
for upper in b"ABCZ":
    for c in (upper, upper ^ 0x20):
        b = format(c, "08b")
        output.append(
            "|".join(
                [
                    f" {chr(c)} ",
                    f" `{format(c, '02X')}` ",
                    f" <tt>{b[:2]}<strong>{b[2]}</strong>{b[3:]}</tt> ",
                    f" `{b[2]}` ",
                    f" `{b[:2]}…{b[3:]}` ",
                ]
            )
        )
    if upper == ord("C"):
        output.append("⋮ | ⋮ | ⋮ | ⋮ | ⋮")

NL = "\n"
Markdown(
    f"""\
char | hex | binary | bit 6 | rest
:--: | :-: | :----: | :---: | :---:
{NL.join(output)}
"""
)

char | hex | binary | bit 6 | rest
:--: | :-: | :----: | :---: | :---:
 A | `41` | <tt>01<strong>0</strong>00001</tt> | `0` | `01…00001` 
 a | `61` | <tt>01<strong>1</strong>00001</tt> | `1` | `01…00001` 
 B | `42` | <tt>01<strong>0</strong>00010</tt> | `0` | `01…00010` 
 b | `62` | <tt>01<strong>1</strong>00010</tt> | `1` | `01…00010` 
 C | `43` | <tt>01<strong>0</strong>00011</tt> | `0` | `01…00011` 
 c | `63` | <tt>01<strong>1</strong>00011</tt> | `1` | `01…00011` 
⋮ | ⋮ | ⋮ | ⋮ | ⋮
 Z | `5A` | <tt>01<strong>0</strong>11010</tt> | `0` | `01…11010` 
 z | `7A` | <tt>01<strong>1</strong>11010</tt> | `1` | `01…11010` 


An 8-bit value with only bit 6 set is `0x20` in hexadecimal, and you can use an `XOR` bit-wise operation to flip that one bit, using the `^` operator. For any given `(first, second)` pair of characters as integers in a range $[0, 256)$, if `first ^ 0x20 == second` then we have a lowercase / uppercase or uppercase / lowercase pair.

For the stack, I use a [`collections.deque` object](https://docs.python.org/3/library/collections.html#collections.deque), which is [faster than the builtin `list` type at adding and removing individual entries when used as a stack](https://stackoverflow.com/a/23487658/100297).


In [4]:
from collections import deque


def polymer_reactions(s):
    it = iter(s.encode("ascii"))
    stack = deque([next(it)])
    for unit in it:
        if stack and stack[-1] ^ 0x20 == unit:
            stack.pop()
        else:
            stack.append(unit)
    return stack


for t, expected in tests.items():
    assert bytes(polymer_reactions(t)).decode() == expected

In [5]:
import aocd

data = aocd.get_data(day=5, year=2018)
print("Base length:", len(data))

Base length: 50000


In [6]:
print("Part 1:", len(polymer_reactions(data)))

Part 1: 10132


## Part 2

This is just a loop over the $[0x61, 0x7a]$ (`a`-`z`) range to tell the stack algorithm to just ignore any bytes that match `(unit | 0x20) == ignored`), so regardless of case.

We can do this all in a generator expression passed to `min()` to produce the answer, if we chain everything together.

My stack version takes about 8 milliseconds, so running it another 26 times should be trivial (albeit with a filter). We can start of the process with the result of part 1; it doesn't matter to the full set of reactions if we remove the ignored character before or after, but repeating the process 26 times with a 10k-ish polymer is faster than starting it with a 50k polymer each time.


In [7]:
%timeit polymer_reactions(data)

2.29 ms ± 17.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [8]:
def shortest_fixed_polymer(s):
    # clean out initial reactions
    minimised = bytes(polymer_reactions(s))

    ident = bytes.maketrans(b"", b"")

    def react(s, ignore=None):
        s = iter(s.translate(ident, delete=bytes([ignore, ignore | 0x20])))
        stack = deque([next(s)])
        for unit in s:
            if stack and stack[-1] ^ 0x20 == unit:
                stack.pop()
            else:
                stack.append(unit)
        return len(stack)

    # loop over A - Z (0x41, 0x5A) (so 0x5B range end)
    return min(react(minimised, i) for i in range(0x41, 0x5B))

In [9]:
assert shortest_fixed_polymer("dabAcCaCBAcCcaDA") == 4

In [10]:
print("Part 2:", shortest_fixed_polymer(data))

Part 2: 4572


In [11]:
%timeit shortest_fixed_polymer(data)

13.5 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
