# --- Day 9: Explosives in Cyberspace ---

Wandering around a secure area, you come across a datalink port to a new part of the network. After briefly scanning it for interesting files, you find one file in particular that catches your attention. It's compressed with an experimental format, but fortunately, the documentation for the format is nearby.

The format compresses a sequence of characters. Whitespace is ignored. To indicate that some sequence should be repeated, a marker is added to the file, like (10x2). To decompress this marker, take the subsequent 10 characters and repeat them 2 times. Then, continue reading the file after the repeated data. The marker itself is not included in the decompressed output.

If parentheses or other characters appear within the data referenced by a marker, that's okay - treat it like normal data, not a marker, and then resume looking for markers after the decompressed section.

For example:

- ADVENT contains no markers and decompresses to itself with no changes, resulting in a decompressed length of 6.
- A(1x5)BC repeats only the B a total of 5 times, becoming ABBBBBC for a decompressed length of 7.
- (3x3)XYZ becomes XYZXYZXYZ for a decompressed length of 9.
- A(2x2)BCD(2x2)EFG doubles the BC and EF, becoming ABCBCDEFEFG for a decompressed length of 11.
- (6x1)(1x3)A simply becomes (1x3)A - the (1x3) looks like a marker, but because it's within a data section of another marker, it is not treated any differently from the A that comes after it. It has a decompressed length of 6.
- X(8x2)(3x3)ABCY becomes X(3x3)ABC(3x3)ABCY (for a decompressed length of 18), because the decompressed data from the (8x2) marker (the (3x3)ABC) is skipped and not processed further.

**What is the decompressed length of the file (your puzzle input)? Don't count whitespace.**

---

In [2]:
with open(f'inputs/9.txt') as f:
    data = f.read().strip()
data[:240]

'(106x9)(9x11)XRTHYQJRI(16x7)PQFHWDDUNODSQZFA(3x14)UTS(46x5)(11x2)ZPIAOZZMWEI(4x15)SDLK(12x10)BUQRPYWOFRHL(3x2)IUD(376x15)(56x2)(2x8)HN(8x6)EMTIYSST(29x14)UMUBTFMGRIFIJMVOFTRZJBYZKRZTR(72x7)(15x14)AIEJQAVGCXESYMW(33x11)BOGWCYAIJENVPIZOHXMHVS'

## trying regex

In [3]:
import re
import regex

pattern = r"\([\w]+\)|[a-zA-Z0-9]+"
regex.findall(pattern, "X(8x2)(3x3)ABCY")

['X', '(8x2)', '(3x3)', 'ABCY']

- `r` is a raw string
- `\(` the `\` is an escape which passes the `(` so the regex parser matches on `(`
- `[]` this is the group, and the `\w` matches any any word or num char, while the `+` means grab them all up
- `\)` this matches the bracket at the end
- `|` this character is a seperator, so to be able to match different kinds of groups
- `[a-zA-Z0-9]+ this matches alpha and digit chars to capture the groups outside the brackets. could have used `\w+` here as well

lets see how it works on the actual input:

In [8]:
regex.findall(pattern, data)[:10]

['(106x9)',
 '(9x11)',
 'XRTHYQJRI',
 '(16x7)',
 'PQFHWDDUNODSQZFA',
 '(3x14)',
 'UTS',
 '(46x5)',
 '(11x2)',
 'ZPIAOZZMWEI']

So while the above looks good, turns out it doesn't really work for this puzzle as I need to look at the indexes of the original string, so trying another way:

In [9]:
p = regex.compile(r"\([\w]+\)|[a-zA-Z0-9]+")
for m in p.finditer(data):
    print(m.start(), m.end(), m.group())
    break

0 7 (106x9)


In [11]:
def get_nums(s):
    """returns nums in a string"""
    assert s[0] == "("
    a, b = (int(i) for i in s[1:-1].split("x"))
    return a,b

get_nums("(8x2)")

(8, 2)

After all that regex work, my solution is to go through the index one at a time:

In [12]:
def decompress(s):
    """takes in a string s and returns a decompressed version"""
    ans = ""
    i = 0 # to track where we are on the string
    while i < len(s):
        if s[i] is not "(":
            ans += s[i]
            i += 1
        else:
            end_pos = re.search(r"\)", s[i:]).end() # find how far the closing char is
            num_chars, repeat = get_nums(s[i:i+end_pos]) # grab digits from the marker
            
            ans += s[i+end_pos:i+end_pos+num_chars] * repeat
            
            # now move ahead by the length of the marker and the characters repeated
            i += num_chars + len(s[i:i+end_pos])
    
    return ans

assert decompress("X(8x2)(3x3)ABCY") == "X(3x3)ABC(3x3)ABCY"

In [13]:
len(decompress(data).strip())

99145

# --- Part Two ---

Apparently, the file actually uses version two of the format.

In version two, the only difference is that markers within decompressed data are decompressed. This, the documentation explains, provides much more substantial compression capabilities, allowing many-gigabyte files to be stored in only a few kilobytes.

For example:

- (3x3)XYZ still becomes XYZXYZXYZ, as the decompressed section contains no markers.
- X(8x2)(3x3)ABCY becomes XABCABCABCABCABCABCY, because the decompressed data from the (8x2) marker is then further decompressed, thus triggering the (3x3) marker twice for a total of six ABC sequences.
- (27x12)(20x12)(13x14)(7x10)(1x12)A decompresses into a string of A repeated 241920 times.
- (25x3)(3x3)ABC(2x3)XY(5x2)PQRSTX(18x9)(3x2)TWO(5x7)SEVEN becomes 445 characters long.

Unfortunately, the computer you brought probably doesn't have enough memory to actually decompress the file; you'll have to come up with another way to get its decompressed length.

**What is the decompressed length of the file using this improved format?**

---

We don't actually need to decompress, we just track how long it becomes, recursively:

In [14]:
def get_decompressed_length(s):
    """takes in a string s and returns the decompressed length"""
    length = 0 # length of the decompressed string
    i = 0      # index, or where we are at on the string
    
    while i < len(s):
        if s[i] is not "(":
            #ans += s[i]
            i += 1
            length += 1
        else:
            end_pos = re.search(r"\)", s[i:]).end()
            num_chars, repeat = get_nums(s[i:i+end_pos])
            
            #ans += s[i+end_pos:i+end_pos+num_chars] * repeat
            length += (get_decompressed_length(s[i+end_pos:i+end_pos+num_chars]) * repeat)
            
            # now move ahead by the length of the marker and the characters repeated
            i += num_chars + len(s[i:i+end_pos])
    
    return length

assert get_decompressed_length("X(8x2)(3x3)ABCY") == len("XABCABCABCABCABCABCY")

In [15]:
get_decompressed_length(data)

10943094568

`10943094568` is the right answer.

# Notes:

- you don't have to actually compute the thing you are doing sometimes, sometimes tracking it is good enough
- code a few recusion functions to get familiar