# Filesystem operations

* https://adventofcode.com/2022/day/7

We have to do two things:

- parse the command sequence and the results of the `ls` commands
- track directory sizes

So, I built a filesystem representation to track what directories and files are found, as well as their sizes. This structure can then be traversed (via a `.walk()` iterator) and so we can find all the directories that _recursively_ have the required size. Note that *nested* directories count; in the example `/a/e` counts, but so does `/a` (which is the size of `/a/e` plus other entries in `/a`). To avoid repeatedly calculating the size of subdirectories each time we get the size of a parent, I cache the result of the `size` property.

The filesystem commands are very simple; only `cd <name>` and `ls` are used, and `cd` only uses `/` (go to the root), `..` (go to the parent directory) or a single directory name (no slashes). The filesystem listings are also very simple, either `dir <name>` for directories or `<size> <name>` for files, and nothing else, so I used the new `match` construct to match the contents of a line (split on whitespace) against the possible options.

In [1]:
from abc import ABC, abstractmethod
from collections import deque
from dataclasses import dataclass, field
from functools import cached_property
from typing import Iterable, Iterator, Self


@dataclass
class Entry(ABC):
    name: str

    @property
    @abstractmethod
    def size(self) -> int:
        ...

    @classmethod
    def from_ls_output(cls, size_or_dir: str, name: str) -> Self:
        return Directory(name) if size_or_dir == "dir" else File(name, int(size_or_dir))



@dataclass
class File(Entry):
    _size: int

    @property
    def size(self) -> int:
        return self._size

@dataclass
class Directory(Entry):
    entries: dict[str, Entry] = field(default_factory=dict)

    @cached_property
    def size(self) -> int:
        return sum(entry.size for entry in self.entries.values())
    
    def __iadd__(self, entry: Entry) -> Self:
        self.entries[entry.name] = entry
        return self
    
    def __contains__(self, name: str) -> bool:
        return name in self.entries
    
    def __getitem__(self, name: str) -> Entry:
        return self.entries[name]
    

@dataclass
class Filesystem:
    root: Directory

    @classmethod
    def from_lines(cls, lines: Iterable[str]) -> Self:
        root = Directory("/")
        path = [root]
        for line in lines:
            cwd = path[-1]
            match line.split():
                case ["$", "ls"]:
                    continue
                case ["$", "cd", "/"]:
                    path = [root]
                case ["$", "cd", ".."]:
                    if cwd is not root:
                        path.pop()
                case ["$", "cd", name] if name in cwd and isinstance(entry := cwd[name], Directory):
                    path.append(entry)
                case ["$", "cd", name]:
                    raise ValueError(f"Invalid path: {name}")
                case ["$", *cmd]:
                    raise ValueError("Unknown command: {' '.join(cmd)}")
                case [size_or_dir, name]:
                    cwd += Entry.from_ls_output(size_or_dir, name)
                case _:
                    raise ValueError("Unknown output: {line}")
        return cls(root)

    def walk(self) -> Iterator[tuple[Directory, list[str]]]:
        """Traverse the filesystem directories in depth-first order.

        The second element is the list of subdirectory names; you can remove directories from this list
        to prune traversal.

        """
        stack = deque([self.root])
        while stack:
            cwd = stack.pop()
            names = [name for name, entry in cwd.entries.items() if isinstance(entry, Directory)]
            yield cwd, names
            for name in reversed(names):
                stack.append(cwd.entries[name])


example = """\
$ cd /
$ ls
dir a
14848514 b.txt
8504156 c.dat
dir d
$ cd a
$ ls
dir e
29116 f
2557 g
62596 h.lst
$ cd e
$ ls
584 i
$ cd ..
$ cd ..
$ cd d
$ ls
4060174 j
8033020 d.log
5626152 d.ext
7214296 k
""".splitlines()

test_fs = Filesystem.from_lines(example)
assert sum(size for dir, _ in test_fs.walk() if (size := dir.size) < 100000) == 95437

In [2]:
import aocd

terminal_output = aocd.get_data(day=7, year=2022).splitlines()
filesystem = Filesystem.from_lines(terminal_output)
print("Part 1:", sum(size for dir, _ in filesystem.walk() if (size := dir.size) < 100000))

Part 1: 1644735


## Part two, optimising space

We are asked to find the one directory that could free enough space; so basically the *smallest* value that's bigger than the difference between actual diskspace used, and the target space used.

Because I cached the file sizes in the tree, this is easily achieved even without further optimisation, but you can avoid traversing into subfolders once you figured out that the parent folder is not large enough. 

I've refactored the filesystem `walk()` method to work like the [`os.walk()` method](https://docs.python.org/3/library/os.html#os.walk), in that you can _prune_ the traversal by removing entries from the `dirnames` list. This lets us trivially prune candidate directories.

In [3]:
from typing import Final


AVAILABLE: Final[int] = 70000000
TARGET_FREE: Final[int] = 30000000
MAX_USED: Final[int] = AVAILABLE - TARGET_FREE

def find_space(filesystem: Filesystem) -> int:
    total = filesystem.root.size
    if total <= MAX_USED:
        return 0
    min_size = total - MAX_USED
    def dirsizes() -> Iterator[int]:
        for dir, dirnames in filesystem.walk():
            if dir.size < min_size:
                # prune search
                dirnames.clear()
                continue
            yield dir.size
    return min(dirsizes())

assert find_space(test_fs) == 24933642

In [4]:

print("Part 2:", find_space(filesystem))

Part 2: 1300850
