In [160]:
import utils

import re
from collections import Counter
from itertools import pairwise
from dataclasses import dataclass, field
from collections import defaultdict, deque
from line_profiler import profile
%load_ext line_profiler

The line_profiler extension is already loaded. To reload it, use:
  %reload_ext line_profiler


## Day 9: Disk Fragmenter

[#](https://adventofcode.com/2024/day/9) A disk map represents the layout of files and free space. The digits alternate between the num of blocks in a file and the length of free space.

1. Defrag the hd
2. calcuate a disk checksum by adding: file_id * file_position for each block (ignoring empty blocks).

This problem is interesting as its a classic CS problem which can be solved multiple ways. I vaguely remember linked lists and pointers from C++, but over here in python I'm going to represent a disk as a list of blocks or something.

In [137]:
sample_input: str = """2333133121414131402"""

puzzle_input = utils.get_input(9, splitlines=False)

And on day 9, I finally got around to using OO programming - a disk is a list of blocks.

In [159]:
@dataclass
class Block:
    pos: int
    file_id: int  # file_id the block contains, -1 is empty

    def checksum(self):
        return self.pos * self.file_id


@dataclass
class Disk:
    blocks: list = field(default_factory=list)

    def checksum(self):
        return sum((b.pos * b.file_id for b in self.blocks if b.file_id > -1))

    def swap_blocks(self, pos1: int, pos2: int):
        self.blocks[pos1].file_id, self.blocks[pos2].file_id = (
            self.blocks[pos2].file_id,
            self.blocks[pos1].file_id,
        )

    def get_first_empty_pos(self):
        for block in self.blocks:
            if block.file_id == -1:
                return block.pos
        return False

    def get_last_full(self):
        for block in self.blocks[::-1]:
            if block.file_id != -1:
                return block.pos

    @profile
    def defrag_blocks(self, debug=False):
        if debug:
            print(self)
        empty_pos = self.get_first_empty_pos()
        right_pos = self.get_last_full()

        while right_pos > empty_pos:
            self.swap_blocks(empty_pos, right_pos)

            empty_pos = self.get_first_empty_pos()
            right_pos = self.get_last_full()
            if debug:
                print(self)

    def __str__(self):
        return "".join(
            [str(b.file_id) if b.file_id != -1 else "." for b in self.blocks]
        )


def parse_input(input_str=sample_input, debug: bool = False):
    blocks = []
    nums = [int(n) for n in input_str.strip()]

    position = 0

    for id, length in enumerate(nums):
        is_file = id % 2 == 0
        file_id = id // 2 if is_file else -1
        for _ in range(position, position + length):
            # the input alternates b/w file and free space
            blocks.append(Block(position, file_id))
            position += 1

    return Disk(blocks)


disk = parse_input(sample_input, True)
print(disk)

00...111...2...333.44.5555.6666.777.888899


In [165]:
%%time
def solve(inp: str = sample_input, debug: bool = False):
    disk = parse_input(inp)

    disk.defrag_blocks(debug)

    return {"result": disk.checksum()}


assert solve(sample_input, True)["result"] == 1928  # sample ans check

results = solve(puzzle_input, debug=False)
print(f"Part 1: {results["result"]}")

00...111...2...333.44.5555.6666.777.888899
009..111...2...333.44.5555.6666.777.88889.
0099.111...2...333.44.5555.6666.777.8888..
00998111...2...333.44.5555.6666.777.888...
009981118..2...333.44.5555.6666.777.88....
0099811188.2...333.44.5555.6666.777.8.....
009981118882...333.44.5555.6666.777.......
0099811188827..333.44.5555.6666.77........
00998111888277.333.44.5555.6666.7.........
009981118882777333.44.5555.6666...........
009981118882777333644.5555.666............
00998111888277733364465555.66.............
0099811188827773336446555566..............
Part 1: 6471961544878
CPU times: user 16.6 s, sys: 189 ms, total: 16.7 s
Wall time: 16.8 s


This was quite slow - I would guess the main culprit here is the many many iterations in the defrag func looking for blocks. Testing that theory:

In [168]:
%lprun -f disk.defrag_blocks solve(puzzle_input, debug=False)

Timer unit: 1e-09 s

Total time: 82.4217 s
File: /var/folders/8k/lqs5dh5j5ln8f85vhjz20ctr0000gn/T/ipykernel_13454/2316221461.py
Function: defrag_blocks at line 34

Line #      Hits         Time  Per Hit   % Time  Line Contents
    34                                               @profile
    35                                               def defrag_blocks(self, debug=False):
    36         1       1000.0   1000.0      0.0          if debug:
    37                                                       print(self)
    38         1       2000.0   2000.0      0.0          empty_pos = self.get_first_empty_pos()
    39         1     911000.0 911000.0      0.0          right_pos = self.get_last_full()
    40                                           
    41     23695   11980000.0    505.6      0.0          while right_pos > empty_pos:
    42     23694   14313000.0    604.1      0.0              self.swap_blocks(empty_pos, right_pos)
    43                                           
    44  

I was right, it's the iterations which take up 99.5% the time - this is a easy problem to solve but leaving it for future learnings. 

## Part 2

Instead of moving individual blocks, starting from the right, move the whole file to the first block of free space which will fit the file.

My solution above is a bit shit, so I had to redo it for part 2. My unit is now a Chunk (of blocks) instead of 1 block at a time.

The trick part which got me stuck for a while was: I was looping over a list, but inserting into the list along the loop - which was messing up the idx position I was looking at. Moral of the story: don't insert into a list while looping over it!

There are a lot of ways to speed up the below, e,g maintaining a seperate file index instead of going through the files list every time to find a file, and I'm sure many others.


In [174]:
@dataclass
class Chunk:
    """Chunk of Blocks, size defines how many blocks a chunk has"""

    pos: int
    size: int
    file_id: int  # file_id the chunk has a file, -1 is empty

    def checksum(self):
        return self.pos * self.id


@dataclass
class Disk:
    chunks: list = field(default_factory=list)
    size: int = field(init=False)

    def __post_init__(self):
        self.size = len(self.chunks)

    def checksum(self):
        ans = 0
        position = 0
        for chunk in self.chunks:
            if chunk.file_id > -1:
                for pos in range(position, position + chunk.size):
                    ans += pos * chunk.file_id
                    position += 1
            else:
                position += chunk.size
        return ans

    def swap_chunks(self, free_pos, file_pos, debug=False):
        """swaps chunks, splits free space if needed"""
        b1, b2 = self.chunks[free_pos], self.chunks[file_pos]
        if debug:
            print("swapping :", b1, b2)
        if (b1.size >= b2.size) and b1.file_id == -1:
            if b1.size > b2.size:  # split the free chunk
                new_chunk = Chunk(b1.pos + b2.size, b1.size - b2.size, -1)
                self.chunks[free_pos].size = b2.size

                self.chunks.insert(free_pos + 1, new_chunk)
                self.size += 1
                file_pos += 1  # hmm... what about chunks to the right of insertion

            self.chunks[free_pos], self.chunks[file_pos] = (
                self.chunks[file_pos],
                self.chunks[free_pos],
            )

    def defrag(self, debug=False):
        file_ids = sorted(
            {b.file_id for b in self.chunks if b.file_id != -1}, reverse=True
        )

        moved_files = set()  # Keep track of moved files

        if debug:
            print(self)

        for file_id in file_ids:
            if file_id in moved_files:
                continue  # skip already moved files

            # find index of file - this is slow, ideally maintain an index of chunk: position to make this a lookup
            for i, chunk in enumerate(self.chunks):
                if chunk.file_id == file_id:
                    idx = i
                    break
            file = self.chunks[idx]
            # find leftmost free chunk which can fit our file
            for j in range(idx):  # only look uptil the files pos in chunks list
                chunk = self.chunks[j]
                if chunk.file_id == -1 and chunk.size >= file.size:
                    self.swap_chunks(j, idx)
                    moved_files.add(file_id)
                    if debug:
                        print(self)
                    break

    def __str__(self):
        return "".join(
            [
                str(b.file_id) * b.size if b.file_id != -1 else "." * b.size
                for b in self.chunks
            ]
        )


def parse_input(input_str=sample_input, debug: bool = False):
    chunks = []
    nums = [int(n) for n in input_str.strip()]

    position = 0

    for id, length in enumerate(nums):
        is_file = id % 2 == 0
        file_id = id // 2 if is_file else -1
        chunks.append(Chunk(position, size=length, file_id=file_id))
        position += length

    return Disk(chunks)


disk = parse_input(sample_input, False)
disk.defrag(True)
assert "00992111777.44.333....5555.6666.....8888.." == disk.__str__()
assert disk.checksum() == 2858

00...111...2...333.44.5555.6666.777.888899
0099.111...2...333.44.5555.6666.777.8888..
0099.1117772...333.44.5555.6666.....8888..
0099.111777244.333....5555.6666.....8888..
00992111777.44.333....5555.6666.....8888..


In [175]:
%%time
disk = parse_input(puzzle_input, False)
disk.defrag()
disk.checksum()

CPU times: user 3.97 s, sys: 37 ms, total: 4.01 s
Wall time: 4 s


6511178035564

Part 2 runs much faster then before, but here as well an index of files would speed this up:

In [176]:
disk = parse_input(puzzle_input, False)
%lprun -f disk.defrag disk.defrag()

Timer unit: 1e-09 s

Total time: 48.4577 s
File: /var/folders/8k/lqs5dh5j5ln8f85vhjz20ctr0000gn/T/ipykernel_13454/302338560.py
Function: defrag at line 52

Line #      Hits         Time  Per Hit   % Time  Line Contents
    52                                               def defrag(self, debug=False):
    53         2     123000.0  61500.0      0.0          file_ids = sorted(
    54     20000    3697000.0    184.8      0.0              {b.file_id for b in self.chunks if b.file_id != -1}, reverse=True
    55                                                   )
    56                                           
    57         1       1000.0   1000.0      0.0          moved_files = set()  # Keep track of moved files
    58                                           
    59         1          0.0      0.0      0.0          if debug:
    60                                                       print(self)
    61                                           
    62     10001    6068000.0    606.7 

About 60% time is spent just finding the actual chunk ID of a file!