# Streaming binary format parsing

- <https://adventofcode.com/2021/day/16>

Part 1 is primarily a stream parsing task; given an input stream of hex digits, parse the stream into packets of data.

I decided to unpack the hex value as a byte string; in Python the `bytes` type has a convenient [`fromhex()` method](https://docs.python.org/3/library/stdtypes.html#bytes.fromhex), and I made sure to pad the end of the input transmission string with an extra `0` if it didn't contain a multiple of 2 hex digits; that turned out not to be necessary for both the test and puzzle inputs. The reason I did this is because you can then index into the `bytes` object and Python gives you a 8-bit integer, ideal for applying bitwise operations to to extract the desired bits.

Next, the class encapsulating the stream (the stream _reader_) tracks the position in the input stream of bytes, as well as an offset into the current byte so we can read data at the bit level. This requires that you use [bit shifting](https://en.wikipedia.org/wiki/Bitwise_operation#Bit_shifts) to move the target bits over to the right, as well as [bit masking](<https://en.wikipedia.org/wiki/Mask_(computing)#Masking_bits_to_0>) to turn off any other bits present in the byte. To make these operations that little bit more efficient I pre-compute those masks and shift counts based on the current bit offset into the byte. Of course, to read a given number of bits might well require that you look at multiple consecutive bytes, so the `StreamReader.read_bits()` method uses a loop and left-shifts to build up the resulting integer value from multiple bytes, as needed. If this was a real-world project, the reader would need to handle a transmission or read buffer, and handle the transmission or file running out of data prematurely, but for AoC we can ignore such trivial error-handling concerns.

To read each type of packet (literal or operator), the stream reader delegates to the `BasePacket` class, passing in the stream reader. This class reads the 3 bits for the version + the 3 bits for the packet type, then _dispatches_ to the right packet subclass to handle reading the rest of the data. Given that part 1 was hand-wavy about the operator types not mattering _yet_ I anticipated that there'd be another 7 types of operator to handle later on, so I created a dispatch system based on registration via the [`object.__init_subclass__()` class hook](https://docs.python.org/3/reference/datamodel.html#object.__init_subclass__). This hook is called whenever a subclass of the class with such a hook is created, and allows you to set class-specific parameters via keyword arguments in a [`class <ClassName>(bases, ...)` statement](https://docs.python.org/3/reference/compound_stmts.html#class). If a specific packet subtype uses `type=<type_id>` in the class definition it'll be used for by the base packet class to read packet data, falling back to the generic `BaseOperatorPacket` class for unknown packet types that I am sure part 2 will tell us more about. This model lets you nicely compartmentalize reading of the packet types and their data, as well as handle reading the contained sub-packets for operators.

Reading a literal then is simply a function of reading 5 bits in a loop, using the 4 right-most bits to build up the integer value (left-shifting the accumulating value by 4, add the new 4 bits after masking off the 5th bit), until the most-significant bit is 0.

As isn't that un-common with data transmission formats that have been around for a while, there are multiple ways of expressing how much data you need to expect to read for recursively-embedded packets. The `1` variant (11 bits with a packet count) lets us just loop _count_ times and ask the stream reader to read the next packet, but for the `0` variant (15 bits counting out the number of bits the contained packets will take up) requires that all packets track their own size. As you read packets, you can then track if you have read enough data for all child packets that the operator packet covers. Tracking the size of a given packet is not hard, of course, just a little finnicky as you need to account for the initial 6 bits with the version number and type id, plus whatever bits are necessary to implement the packet type.

Finally, a (cached) property takes care of exposing the version number sum; for literal packets, just return the `version` attribute, for operator packets, the version sum is that of their own `version` value plus the sum of all child `version_sum` values.


In [1]:
from __future__ import annotations

from dataclasses import dataclass
from functools import cached_property
from typing import ClassVar, Final, Iterator, TypeAlias, TypeVar

T = TypeVar("T", bound="BasePacket")
TypeId: TypeAlias = int


@dataclass(frozen=True)
class BasePacket:
    type: ClassVar[TypeId]
    # dispatch table to read specific packet types
    types: ClassVar[dict[TypeId, type[T]]] = {}

    def __init_subclass__(cls: T, type: TypeId = None, **kwargs) -> None:
        """Register a subclass for dispatch on a specific packet type_id"""
        if type is not None:
            cls.type = type
            BasePacket.types[type] = cls
        super().__init_subclass__(**kwargs)

    version: int
    size: int  # in bits

    @cached_property
    def version_sum(self) -> int:
        """Sum of the version values for this packet and any sub-packets"""
        return self.version

    @cached_property
    def expression_value(self) -> int:
        """Part 2 expression value for this packet"""
        raise NotImplementedError

    @classmethod
    def _read(cls: type[T], version: int, stream: StreamReader) -> T:
        """Read data for single packet from the stream

        The version and type have already been read, subclasses should implement
        how each packet type data is to be read from the stream, and create the
        specific packet instance.

        """
        raise NotImplementedError

    @classmethod
    def from_stream(cls, stream: StreamReader) -> BasePacket:
        """Read the next packet from the stream reader

        Dispatches to specific packet types based on the type_id read from the
        stream, falling back to BaseOperatorPacket if no specific type subclass
        is found.

        """
        version, type_id = stream.read_bits(3), stream.read_bits(3)
        return BasePacket.types.get(type_id, BaseOperatorPacket)._read(version, stream)


@dataclass(frozen=True)
class BaseOperatorPacket(BasePacket):
    children: tuple[BasePacket]

    @property
    def version_sum(self) -> int:
        return self.version + sum(child.version_sum for child in self.children)

    def __len__(self) -> int:
        """Number of child packets contained"""
        return len(self.children)

    def __iter__(self) -> Iterator[int]:
        """Part 2: iterate over the expression value of each child packet"""
        yield from (child.expression_value for child in self.children)

    def __getitem__(self, i: int) -> int:
        """Part 2: get the expression value of the ith child packet"""
        return self.children[i].expression_value

    @classmethod
    def _read(cls: type[T], version: int, stream: StreamReader) -> T:
        """Read operator packet and containd child packets from the stream

        The number of subpackets read is determined from the first bit following
        the version and type_id bits. Ff 0, read a 15-bit length value from the
        stream, then read child packets until their total bit size is equal to
        that length value. If 1, read an 11-bit child packet count, then read
        child packet count number of packets from the stream.

        """
        length_type_id = stream.read_bits(1)
        size = 7  # version + type + flag bits
        subpackets = []
        if length_type_id == 0:
            # length is in bits
            packet_length = stream.read_bits(15)
            size += 15 + packet_length
            while packet_length:
                sub = next(stream)
                packet_length -= sub.size
                subpackets.append(sub)
            assert packet_length == 0
        else:
            # length is in packets
            packet_count = stream.read_bits(11)
            size += 11
            for _ in range(packet_count):
                sub = next(stream)
                size += sub.size
                subpackets.append(sub)
        return cls(version, size, tuple(subpackets))


@dataclass(frozen=True)
class LiteralPacket(BasePacket, type=4):
    value: int

    @cached_property
    def expression_value(self) -> int:
        """Part 2 expression value for this packet is the literal value"""
        return self.value

    @classmethod
    def _read(cls, version: int, stream: StreamReader) -> LiteralPacket:
        """Read a literal value packet

        Reads groups of 5 bits, containing a continuation bit and 4 bits
        for the literal value. Reading continues until the continuation bit
        is 0.

        """
        value = 0
        size = 6  # length of version + type bits
        while True:
            chunk = stream.read_bits(5)
            size += 5
            value = (value << 4) | (chunk & 0xF)
            if not chunk & 0x10:
                break
        return cls(version, size, value)


# pre-computed bitmasks and right-shift amount to extract a certain number of
# bits from a byte, given the bit offset (0-7) and the number of bits to extract
# (1-8).
MASKS: Final[dict[tuple[int, int, int], tuple[int, int]]] = {
    (offset, count): ((2**count - 1) << (8 - offset - count), 8 - offset - count)
    for offset in range(8)
    for count in range(1, 9)
    if count + offset <= 8
}


class StreamReader:
    def __init__(self, stream: bytes) -> None:
        self.stream = stream
        self.pos = 0  # byte position in stream
        self.bit_pos = 0  # offset into current byte (0-7)

    @classmethod
    def from_string(cls, s: str) -> StreamReader:
        # if the input length doesn't align to full bytes, pad with a 0
        # and start reading at the second nibble.
        if len(s) % 2:
            s += "0"
        return cls(bytes.fromhex(s))

    def read_bits(self, count: int) -> int:
        result = 0
        bpos, pos, stream = self.bit_pos, self.pos, self.stream
        while count:
            bcount = min(count, 8 - bpos)
            count -= bcount
            result <<= bcount
            mask, shift = MASKS[bpos, bcount]
            result |= (stream[pos] & mask) >> shift
            bpos += bcount
            if bpos == 8:
                bpos = 0
                pos += 1
        self.bit_pos, self.pos = bpos, pos
        return result

    def __iter__(self) -> Iterator[BasePacket]:
        return self

    def __next__(self) -> BasePacket:
        return BasePacket.from_stream(self)


tests = {
    "D2FE28": 6,
    "38006F45291200": 9,
    "EE00D40C823060": 14,
    "8A004A801A8002F478": 16,
    "620080001611562C8802118E34": 12,
    "C0015000016115A2E0802F182340": 23,
    "A0016C880162017C3686B18A3D4780": 31,
}

for test_input, expected in tests.items():
    assert next(StreamReader.from_string(test_input)).version_sum == expected

In [2]:
import aocd

transmission = aocd.get_data(day=16, year=2021)
print("Part 1:", next(StreamReader.from_string(transmission)).version_sum)

Part 1: 901


# Part 2: executing the operators

Now that we can decode packets, we can define the operators; all I had to do was provide implementations of each operator type. I did have to update part 1 to give the my classes a notion of _expression values_, and I expanded the `BaseOperatorPacket` class some additional Python magic methods to implement indexing and iteration over the expression values of the contained child packets, but I could have done the same with mixin classes; it would just have been a lot more verbose.


In [3]:
import operator
from functools import reduce


class OpSum(BaseOperatorPacket, type=0):
    @cached_property
    def expression_value(self) -> int:
        return sum(self)


class OpProduct(BaseOperatorPacket, type=1):
    @cached_property
    def expression_value(self) -> int:
        return reduce(operator.mul, self)


class OpMinimum(BaseOperatorPacket, type=2):
    @cached_property
    def expression_value(self) -> int:
        return min(self)


class OpMaximum(BaseOperatorPacket, type=3):
    @cached_property
    def expression_value(self) -> int:
        return max(self)


class OpGreaterThan(BaseOperatorPacket, type=5):
    @cached_property
    def expression_value(self) -> int:
        assert len(self.children) == 2
        return int(self[0] > self[1])


class OpLessThan(BaseOperatorPacket, type=6):
    @cached_property
    def expression_value(self) -> int:
        assert len(self.children) == 2
        return int(self[0] < self[1])


class OpEqualTo(BaseOperatorPacket, type=7):
    @cached_property
    def expression_value(self) -> int:
        assert len(self.children) == 2
        return int(self[0] == self[1])


expression_tests = {
    "C200B40A82": 3,
    "04005AC33890": 54,
    "880086C3E88112": 7,
    "CE00C43D881120": 9,
    "D8005AC2A8F0": 1,
    "F600BC2D8F": 0,
    "9C005AC2F8F0": 0,
    "9C0141080250320F1802104A08": 1,
}
for test_input, expected in expression_tests.items():
    assert next(StreamReader.from_string(test_input)).expression_value == expected

In [4]:
print("Part 2:", next(StreamReader.from_string(transmission)).expression_value)

Part 2: 110434737925


# Comparison with real-world data streams

Todays exercise is quite a good model for how real-world binary formats work. They usually do not pack data into such odd bit counts however; computers work much more efficiently with data packed into (powers of 2 of multples of) bytes. Individual bytes can still contain multiple pieces of information such as 1 bit flags or several smaller integer values.

## Continuation flags

The continuation bit set in the literal value format is exactly how a range of variable-width encodings work, such as [UTF-8](https://en.wikipedia.org/wiki/UTF-8). Some formats, like UTF-8, use multiple bits to handle continuation signalling (in UTF-8, the number of bits for this purpose is _variable too_, from 1 to 5). The advantage of such a format is that you can encode complex data into a much more compact form than if you used a fixed number of bytes for every possible value you encode, but the disadvantage is that you can't just index into a stream to get to a specific Nth value. For UTF-8, you can encode ASCII text (e.g. the majority of textual computer data in the western world) in just 1 byte per character, increasing to 2 bytes for most [Latin-script alphabets](https://en.wikipedia.org/wiki/Latin-script_alphabet), while 3 bytes covers the remainder of the [Unicode Basic Multilingual Plane](<https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane>) (BMP) and you only need 4 bytes for Unicode data beyond the BMP. But, if you need to index into a UTF-8 byte string, you'll have to look at the first 4 bits of a lot of the bytes between the start of such a stream and the Nth codepoint you want to skip to (rule of thumb: left-most-bit not set? Then move on to the next byte, otherwise count consecutive left-most bits that are `1` and skip that many times minus 1).

## Sub-packet size expressed in total length or a count

The operator packets record either the total length of the subpackets (in bits, real-world formats are far more likely to record a byte count), or a number of packets. For formats with a variable packet length, those two numbers have very different consequences for how you read such a stream, and both have advantages and disadvantages. Because of this many formats will use both, depending on what kind of information is encoded!

A fixed size lets you skip over to the next packet in one big step, while a packet count requires that you read each sub-packet in turn, or at least enough to be able to skip the rest of that packet. But, a fixed size requires that the _encoder_ knows what size it is going to be sending, _up front_, and not all data streams lend themselves to this. For a streaming video format, for example, the encoder might not know how well later video data will compress and so won't know how many packets it'll end up sending. But such an encoder will pull in video data to encode in chunks, and may well know that it is going to send N frames of video, each of variable size, and tell the decoder on the other end to expect N sub packets.
