-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standard cube format. #8
Comments
A few of the discussions in other issues have talked about a more efficient binary representation. If you continued with an xyz00010101010 style encoding, I think this would be unambiguous, and fast to read and write. This would also be quite space efficient, which would help with the other comments about the eventual ballooning size of the npy files. Not sure this is the best approach, but off the top of my head I wondered about
The size of the data bytes (padding required) would I guess be calculated as ceil(xyz)? You could shave a byte by encoding x y and z into the first 15 bits of [byte][byte] as 5 bytes each, with each max dimension then being 32, which seems plenty since the biggest dimension size is n. You could forego padding but you'd end up having to extract all of these data from within bytes at strange offsets, in the end I don't know if that's a worthwhile approach when LZ compression would probably do? |
I think that is pretty much optimal for the raw data structure. however it might be useful to store some meta data on which particular version of the cubes (e.g. their orientation), and how many are stored in this file, so that programs don't have to do rotations when importing because they aren't sure if they use the same rotations as the creating program. I'm proposing a file format, and have written some python code to convert to and from the existing file format.
[4 bytes] magic:
[byte] orientation:
more orientations can be added if better methods for finding identical rotations are found. [byte] compression:
[bytes...] cube_count: body:
as a note converting the n=12 file to pcube format reduces the size from 2GB to 180MB |
I would like to propose a change so we can add writing blocks of cubes of the same size. This would entail:
This would mean reducing the amount of data stored per cube 3 bytes, and perhaps increase the file density significantly. This would also allow for far more efficient in-file deduplication because you can plan for the size you need to read much more efficiently. Edit: uh, right. We have 7 bits left over in the byte that stores the orientation... I propose we add the bitflag to that :P |
As a note of why adding the "storing by blocks" may be worth it: performing that optimization on the N = 12 cubes file decreases the size from 175 MiB to 121 MiB without compression, and from 111 MiB to 80 MiB with compression. Constructing the file takes some time, but I think the size-reduction factor and the fact that reading the file in parallel becomes a lot easier is quite worth it. |
agreed, i will open a v2 spec on a PR for the converter, use a different magic to differentiate v2 and add a 4 bytes of feature flags and support for in file blocks. |
tools interacting with the files should preferably maintain backwards compat with the v1 format but thats optional, its not like this is a commonly re run project. |
Including a markdown file with the cube format at the top level of the project may also be a good idea :) That way you can't miss it, and it's easy to figure out what the current "real" spec is :D |
I have three more information points that would be excellent to have in the header:
This way you can deduce if a cache file contains all unique polycubes of size Edit: this assumes that these properties are not already enforced by the format, of course. If they are, that should probably be in the spec. Adding |
One last request: it would be really nice to have other compression mechanisms. |
Given we now have a second implementation language tentatively waiting as a pull request. it would be optimal to agree on a standard format for cubes at some point so that multiple implementations can share data.
could any ideas for this format go here please.
The text was updated successfully, but these errors were encountered: