Tachyon binary format chunking#2
Open
maurycy wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
@pablogsal
The reason why I found python#150662 is as follows is that I'm wondering whether the approach I described in python#145464 is really the best one, and that likely Lalit is right.
I'd like you to invite to discuss this approach, since I'm still exploring. I don't think a code review makes sense yet. I merely aimed to check whether I can add chunks to the binary format, what kind of roadblocks I would hit. More like whether you agree with the direction.
My thinking goes that streaming is effectively the output target and we should really not pin ourselves with any format (eg: JSONL). Some formats are streamable. Specifically, all the
--stream-typesflags I described are JSONL-specific, which is not great.The basic problem with starting with the binary format is that the dictonaries are saved at the end, but, apart from this, it's extremely efficient, it's great, and every file is effectively self-contained, so it'd be a waste to not use it for streaming.
A yet another problem with binary format is that it's not crash-proof and it's really NOT suitable for a really long-running headless sessions for precisely the same reason. Let's say that a large corporation wants to run Tachyon over a week or two, and, for some reason, it OOMs (sounds like a perfectly reasonable scenario). The output file will be corrupt.
I think the solution to both problems (ie: collector-agnostic streaming and making very long binary streaming) is chunking and/or periodical flushing:
% ./python.exe -m profiling.sampling run \ --binary \ --flush interval:180s \ -- python zażółćgęsląjaźn.py % ./python.exe -m profiling.sampling run \ --binary \ # I believe that bytes: could be supported but I think it requires a number of complex decisions --flush samples:5000 \ -- python zażółćgęsląjaźn.pyWhat follows:
% ./python.exe -m profiling.sampling run \ --binary \ --flush interval:180s \ --output just-a-file-by-default.txt -- python zażółćgęsląjaźn.py % ./python.exe -m profiling.sampling run \ --jsonl \ --flush interval:180s \ --output file:just-a-file-by-default.txt -- python zażółćgęsląjaźn.py % ./python.exe -m profiling.sampling run \ --jsonl \ --flush interval:180s \ --output ws://213.25.252.50:9000 -- python zażółćgęsląjaźn.py % ./python.exe -m profiling.sampling run \ --binary \ --flush interval:180s \ # or tcp-connect: and tcp-listen: # what about auth? --output tcp:213.25.252.50:9000 -- python zażółćgęsląjaźn.pyVarious
--outputwould go under the streaming PR I'm not a huge fan of--wsand friends, but this gives us optionality.From the file perspective, it could be just:
Precisely:
(Disclaimer: Claude-assisted graphics, based on the current docs.)
Unfortunately, this likely implies bumping the version to v2 because of using the
file_size:https://github.com/python/cpython/blob/540b3d0a7fa7cd842f064f79b1410cbd6868bffa/Modules/_remote_debugging/binary_io_reader.c#L115
I couldn't figure out any other option than this, and taking over
reserved@56. Do I miss some hack here? Unfortunately, it's not really usable for streaming (since we cannotseekback and forth in the stream), and we get chunk start and chunk end symmetry.I'm still researching if this is a common trick for dual-use formats:
https://github.com/apache/arrow/blob/e2b44378edf0ccceda07e8a6823890bd4ecc1952/docs/source/format/Columnar.rst#L1231-L1232
Maybe?
Would it make sense to start calculating the checksum, also? What about appending to files?
It's not visible in the CLI, and I'm using the following script to test it as a sanity check, except just tests we already have:
(Disclaimer: Code-wise I tried to limit any changes to the existing code. I experimented with Codex 5.5 xhigh but it was very bad in this regard, making it effectively impossible to review, even while I prompted it explicitly exactly which functions to touch and how. I used it as a tool for internal code reviews, though.)
Some completely other random notes:
-o -and converting to different formats on the fly...--geckoshould be recording binary, and then replaying.dumpcommand, wanted in the runtime control, and flush.