Tachyon binary format chunking by maurycy · Pull Request #2 · maurycy/cpython

maurycy · 2026-06-01T10:37:03Z

The reason why I found python#150662 is as follows is that I'm wondering whether the approach I described in python#145464 is really the best one, and that likely Lalit is right.

I'd like you to invite to discuss this approach, since I'm still exploring. I don't think a code review makes sense yet. I merely aimed to check whether I can add chunks to the binary format, what kind of roadblocks I would hit. More like whether you agree with the direction.

My thinking goes that streaming is effectively the output target and we should really not pin ourselves with any format (eg: JSONL). Some formats are streamable. Specifically, all the --stream-types flags I described are JSONL-specific, which is not great.

The basic problem with starting with the binary format is that the dictonaries are saved at the end, but, apart from this, it's extremely efficient, it's great, and every file is effectively self-contained, so it'd be a waste to not use it for streaming.

A yet another problem with binary format is that it's not crash-proof and it's really NOT suitable for a really long-running headless sessions for precisely the same reason. Let's say that a large corporation wants to run Tachyon over a week or two, and, for some reason, it OOMs (sounds like a perfectly reasonable scenario). The output file will be corrupt.

I think the solution to both problems (ie: collector-agnostic streaming and making very long binary streaming) is chunking and/or periodical flushing:

% ./python.exe -m profiling.sampling run \
    --binary \
    --flush interval:180s \
    -- python zażółćgęsląjaźn.py
% ./python.exe -m profiling.sampling run \
    --binary \
    # I believe that bytes: could be supported but I think it requires a number of complex decisions
    --flush samples:5000 \
    -- python zażółćgęsląjaźn.py

What follows:

% ./python.exe -m profiling.sampling run \
    --binary \
    --flush interval:180s \
    --output just-a-file-by-default.txt
    -- python zażółćgęsląjaźn.py
% ./python.exe -m profiling.sampling run \
    --jsonl \
    --flush interval:180s \
    --output file:just-a-file-by-default.txt
    -- python zażółćgęsląjaźn.py
% ./python.exe -m profiling.sampling run \
    --jsonl \
    --flush interval:180s \
    --output ws://213.25.252.50:9000
    -- python zażółćgęsląjaźn.py
% ./python.exe -m profiling.sampling run \
    --binary \
    --flush interval:180s \
    # or tcp-connect: and tcp-listen:
    # what about auth?
    --output tcp:213.25.252.50:9000
    -- python zażółćgęsląjaźn.py

Various --output would go under the streaming PR I'm not a huge fan of --ws and friends, but this gives us optionality.

From the file perspective, it could be just:

+------------------+  Offset 0
|     Chunk 0      |  Header, sample data, tables, footer
+------------------+  chunk0_size
|     Chunk 1      |  Header, sample data, tables, footer
+------------------+  chunk0_size + chunk1_size
|       ...        |
+------------------+

Precisely:

One CHUNK = a complete, independently-decodable mini-profile
(byte-for-byte the same shape as a whole v1 file)

+-----------------------------------------------------------------+  <- chunk_start
| HEADER  (64 B)                                                  |
|    @ 0  magic         <- written LAST  = commit marker          |
|    @ 4  version = 2                                             |
|    @12  start_time                                              |
|    @28  sample_count   (THIS chunk only)                        |
|    @36  string_table_offset                                     |
|    @44  frame_table_offset                                      |
|    @56  chunk_size     <- locator: where THIS chunk ends        |
+-----------------------------------------------------------------+
| SAMPLE DATA      delta-encoded; stack/timestamp baselines       |
|                  reset at every chunk boundary                  |
+-----------------------------------------------------------------+
| STRING TABLE     local to this chunk                            |
+-----------------------------------------------------------------+
| FRAME TABLE      local to this chunk                            |
+-----------------------------------------------------------------+
| FOOTER  (32 B)   string_count | frame_count | chunk_end         |
+-----------------------------------------------------------------+  <- chunk_start + chunk_size
(+ one fresh zstd frame per chunk when compression is on)

(Disclaimer: Claude-assisted graphics, based on the current docs.)

Unfortunately, this likely implies bumping the version to v2 because of using the file_size:

https://github.com/python/cpython/blob/540b3d0a7fa7cd842f064f79b1410cbd6868bffa/Modules/_remote_debugging/binary_io_reader.c#L115

I couldn't figure out any other option than this, and taking over reserved@56. Do I miss some hack here? Unfortunately, it's not really usable for streaming (since we cannot seek back and forth in the stream), and we get chunk start and chunk end symmetry.

I'm still researching if this is a common trick for dual-use formats:

https://github.com/apache/arrow/blob/e2b44378edf0ccceda07e8a6823890bd4ecc1952/docs/source/format/Columnar.rst#L1231-L1232

Maybe?

Would it make sense to start calculating the checksum, also? What about appending to files?

It's not visible in the CLI, and I'm using the following script to test it as a sanity check, except just tests we already have:

import collections
import tempfile
import types
import _remote_debugging


def sample(fn):
    fr = _remote_debugging.FrameInfo(
        (fn, _remote_debugging.LocationInfo((1, 1, 0, 0)), "f", None)
    )
    return [
        _remote_debugging.InterpreterInfo(
            (0, [_remote_debugging.ThreadInfo((1, 0, [fr]))])
        )
    ]


BATCHES = (("c0", 3), ("c1", 2), ("c2", 4))  # one chunk per batch

path = tempfile.mktemp(suffix=".bin")
w = _remote_debugging.BinaryWriter(path, 1000, 0, compression=0, chunked=True)
for name, count in BATCHES:
    for _ in range(count):
        w.write_sample(sample(name), 1000)
    w.flush()
w.finalize()

seen = collections.Counter()


def collect(stack_frames, timestamps_us=None):
    weight = len(timestamps_us) if timestamps_us else 1
    for interp in stack_frames:
        for thread in interp[1]:
            for frame in thread[2]:
                seen[frame[0]] += weight


with _remote_debugging.BinaryReader(path) as r:
    replayed = r.replay(types.SimpleNamespace(collect=collect))
    version = r.get_info()["version"]

expected = dict(BATCHES)
total = sum(expected.values())
assert version == 2, f"version {version} != 2 (not chunked)"
assert replayed == total, f"replayed {replayed} != {total}"
assert dict(seen) == expected, f"content {dict(seen)} != {expected}"
print(
    f"OK: {replayed} samples across {len(BATCHES)} chunks, content == {expected}"
)

(Disclaimer: Code-wise I tried to limit any changes to the existing code. I experimented with Codex 5.5 xhigh but it was very bad in this regard, making it effectively impossible to review, even while I prompted it explicitly exactly which functions to touch and how. I used it as a tool for internal code reviews, though.)

Some completely other random notes:

Oh, imagine -o - and converting to different formats on the fly...
To be honest, I think that the best pattern for something like --gecko should be recording binary, and then replaying.
There's something remotely related between dump command, wanted in the runtime control, and flush.

experiments

06167da

maurycy self-assigned this Jun 1, 2026

maurycy mentioned this pull request Jun 3, 2026

gh-150662: Stop unbounded memory growth in Tachyon --gecko collector python/cpython#150845

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tachyon binary format chunking#2

Tachyon binary format chunking#2
maurycy wants to merge 1 commit into
tachyon-runtime-socketfrom
tachyon-runtime-socket-chunking

maurycy commented Jun 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maurycy commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

maurycy commented Jun 1, 2026 •

edited

Loading