Skip to content

Tachyon binary format chunking#2

Open
maurycy wants to merge 1 commit into
tachyon-runtime-socketfrom
tachyon-runtime-socket-chunking
Open

Tachyon binary format chunking#2
maurycy wants to merge 1 commit into
tachyon-runtime-socketfrom
tachyon-runtime-socket-chunking

Conversation

@maurycy
Copy link
Copy Markdown
Owner

@maurycy maurycy commented Jun 1, 2026

@pablogsal

The reason why I found python#150662 is as follows is that I'm wondering whether the approach I described in python#145464 is really the best one, and that likely Lalit is right.

I'd like you to invite to discuss this approach, since I'm still exploring. I don't think a code review makes sense yet. I merely aimed to check whether I can add chunks to the binary format, what kind of roadblocks I would hit. More like whether you agree with the direction.

My thinking goes that streaming is effectively the output target and we should really not pin ourselves with any format (eg: JSONL). Some formats are streamable. Specifically, all the --stream-types flags I described are JSONL-specific, which is not great.

The basic problem with starting with the binary format is that the dictonaries are saved at the end, but, apart from this, it's extremely efficient, it's great, and every file is effectively self-contained, so it'd be a waste to not use it for streaming.

A yet another problem with binary format is that it's not crash-proof and it's really NOT suitable for a really long-running headless sessions for precisely the same reason. Let's say that a large corporation wants to run Tachyon over a week or two, and, for some reason, it OOMs (sounds like a perfectly reasonable scenario). The output file will be corrupt.

I think the solution to both problems (ie: collector-agnostic streaming and making very long binary streaming) is chunking and/or periodical flushing:

% ./python.exe -m profiling.sampling run \
    --binary \
    --flush interval:180s \
    -- python zażółćgęsląjaźn.py
% ./python.exe -m profiling.sampling run \
    --binary \
    # I believe that bytes: could be supported but I think it requires a number of complex decisions
    --flush samples:5000 \
    -- python zażółćgęsląjaźn.py

What follows:

% ./python.exe -m profiling.sampling run \
    --binary \
    --flush interval:180s \
    --output just-a-file-by-default.txt
    -- python zażółćgęsląjaźn.py
% ./python.exe -m profiling.sampling run \
    --jsonl \
    --flush interval:180s \
    --output file:just-a-file-by-default.txt
    -- python zażółćgęsląjaźn.py
% ./python.exe -m profiling.sampling run \
    --jsonl \
    --flush interval:180s \
    --output ws://213.25.252.50:9000
    -- python zażółćgęsląjaźn.py
% ./python.exe -m profiling.sampling run \
    --binary \
    --flush interval:180s \
    # or tcp-connect: and tcp-listen:
    # what about auth?
    --output tcp:213.25.252.50:9000
    -- python zażółćgęsląjaźn.py

Various --output would go under the streaming PR I'm not a huge fan of --ws and friends, but this gives us optionality.

From the file perspective, it could be just:

+------------------+  Offset 0
|     Chunk 0      |  Header, sample data, tables, footer
+------------------+  chunk0_size
|     Chunk 1      |  Header, sample data, tables, footer
+------------------+  chunk0_size + chunk1_size
|       ...        |
+------------------+

Precisely:

One CHUNK = a complete, independently-decodable mini-profile
(byte-for-byte the same shape as a whole v1 file)

+-----------------------------------------------------------------+  <- chunk_start
| HEADER  (64 B)                                                  |
|    @ 0  magic         <- written LAST  = commit marker          |
|    @ 4  version = 2                                             |
|    @12  start_time                                              |
|    @28  sample_count   (THIS chunk only)                        |
|    @36  string_table_offset                                     |
|    @44  frame_table_offset                                      |
|    @56  chunk_size     <- locator: where THIS chunk ends        |
+-----------------------------------------------------------------+
| SAMPLE DATA      delta-encoded; stack/timestamp baselines       |
|                  reset at every chunk boundary                  |
+-----------------------------------------------------------------+
| STRING TABLE     local to this chunk                            |
+-----------------------------------------------------------------+
| FRAME TABLE      local to this chunk                            |
+-----------------------------------------------------------------+
| FOOTER  (32 B)   string_count | frame_count | chunk_end         |
+-----------------------------------------------------------------+  <- chunk_start + chunk_size
(+ one fresh zstd frame per chunk when compression is on)

(Disclaimer: Claude-assisted graphics, based on the current docs.)

Unfortunately, this likely implies bumping the version to v2 because of using the file_size:

https://github.com/python/cpython/blob/540b3d0a7fa7cd842f064f79b1410cbd6868bffa/Modules/_remote_debugging/binary_io_reader.c#L115

I couldn't figure out any other option than this, and taking over reserved@56. Do I miss some hack here? Unfortunately, it's not really usable for streaming (since we cannot seek back and forth in the stream), and we get chunk start and chunk end symmetry.

I'm still researching if this is a common trick for dual-use formats:

https://github.com/apache/arrow/blob/e2b44378edf0ccceda07e8a6823890bd4ecc1952/docs/source/format/Columnar.rst#L1231-L1232

Maybe?

Would it make sense to start calculating the checksum, also? What about appending to files?

It's not visible in the CLI, and I'm using the following script to test it as a sanity check, except just tests we already have:

import collections
import tempfile
import types
import _remote_debugging


def sample(fn):
    fr = _remote_debugging.FrameInfo(
        (fn, _remote_debugging.LocationInfo((1, 1, 0, 0)), "f", None)
    )
    return [
        _remote_debugging.InterpreterInfo(
            (0, [_remote_debugging.ThreadInfo((1, 0, [fr]))])
        )
    ]


BATCHES = (("c0", 3), ("c1", 2), ("c2", 4))  # one chunk per batch

path = tempfile.mktemp(suffix=".bin")
w = _remote_debugging.BinaryWriter(path, 1000, 0, compression=0, chunked=True)
for name, count in BATCHES:
    for _ in range(count):
        w.write_sample(sample(name), 1000)
    w.flush()
w.finalize()

seen = collections.Counter()


def collect(stack_frames, timestamps_us=None):
    weight = len(timestamps_us) if timestamps_us else 1
    for interp in stack_frames:
        for thread in interp[1]:
            for frame in thread[2]:
                seen[frame[0]] += weight


with _remote_debugging.BinaryReader(path) as r:
    replayed = r.replay(types.SimpleNamespace(collect=collect))
    version = r.get_info()["version"]

expected = dict(BATCHES)
total = sum(expected.values())
assert version == 2, f"version {version} != 2 (not chunked)"
assert replayed == total, f"replayed {replayed} != {total}"
assert dict(seen) == expected, f"content {dict(seen)} != {expected}"
print(
    f"OK: {replayed} samples across {len(BATCHES)} chunks, content == {expected}"
)

(Disclaimer: Code-wise I tried to limit any changes to the existing code. I experimented with Codex 5.5 xhigh but it was very bad in this regard, making it effectively impossible to review, even while I prompted it explicitly exactly which functions to touch and how. I used it as a tool for internal code reviews, though.)

Some completely other random notes:

  • Oh, imagine -o - and converting to different formats on the fly...
  • To be honest, I think that the best pattern for something like --gecko should be recording binary, and then replaying.
  • There's something remotely related between dump command, wanted in the runtime control, and flush.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant