Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standard cube format. #8

Open
bertie2 opened this issue Jul 13, 2023 · 9 comments
Open

Standard cube format. #8

bertie2 opened this issue Jul 13, 2023 · 9 comments

Comments

@bertie2
Copy link
Collaborator

bertie2 commented Jul 13, 2023

Given we now have a second implementation language tentatively waiting as a pull request. it would be optimal to agree on a standard format for cubes at some point so that multiple implementations can share data.
could any ideas for this format go here please.

@mikepound
Copy link
Owner

A few of the discussions in other issues have talked about a more efficient binary representation. If you continued with an xyz00010101010 style encoding, I think this would be unambiguous, and fast to read and write. This would also be quite space efficient, which would help with the other comments about the eventual ballooning size of the npy files. Not sure this is the best approach, but off the top of my head I wondered about

x     y     z     data
[byte][byte][byte][bytes...]

The size of the data bytes (padding required) would I guess be calculated as ceil(xyz)? You could shave a byte by encoding x y and z into the first 15 bits of [byte][byte] as 5 bytes each, with each max dimension then being 32, which seems plenty since the biggest dimension size is n.

You could forego padding but you'd end up having to extract all of these data from within bytes at strange offsets, in the end I don't know if that's a worthwhile approach when LZ compression would probably do?

@bertie2
Copy link
Collaborator Author

bertie2 commented Jul 13, 2023

I think that is pretty much optimal for the raw data structure.

however it might be useful to store some meta data on which particular version of the cubes (e.g. their orientation), and how many are stored in this file, so that programs don't have to do rotations when importing because they aren't sure if they use the same rotations as the creating program.

I'm proposing a file format, and have written some python code to convert to and from the existing file format.
you can find the converter code at https://github.com/bertie2/opencubes/tree/file_format
extension : .pcube
data: binary
header:

[4 bytes] [byte]      [byte]      [bytes...] [bytes...]
magic     orientation compression cube_count body

[4 bytes] magic:
for uniquely identifying this format, takes the raw hex value

0xCBECCBEC

[byte] orientation:
allows for different future canonical orientations current options are:

0 : UNSORTED : these are random rotations rotate them yourself to get whatever your canonical rotations are.
1 : BITWISE_HIGHEST_VALUE: the rotation of the cube which has the highest value when compared byte by byte in the cube format (see body format for details):

more orientations can be added if better methods for finding identical rotations are found.

[byte] compression:
indicates whether the body data is compressed:

0: uncompressed data stream
1: gzip compressed data stream

[bytes...] cube_count:
a LEB128 encoded arbitrarily large unsigned integer representing the number of cubes in this file / stream.
https://en.wikipedia.org/wiki/LEB128
0 indicates that this is a variable length file / stream and will simply abruptly end.

body:

x     y     z     data
[byte][byte][byte][bytes...]
[byte] x:
an 8 bit unsigned integer representing the cubes x dimension
[byte] y:
an 8 bit unsigned integer representing the cubes x dimension
[byte] z:
an 8 bit unsigned integer representing the cubes x dimension
[bytes...] data:
the flattened bits of the polycube where 1 is a filled space and 0 is empty, flattened in row-major (C-style), stored in little endian, padded to the nearest byte with zeros.

as a note converting the n=12 file to pcube format reduces the size from 2GB to 180MB

@datdenkikniet
Copy link
Contributor

datdenkikniet commented Jul 20, 2023

I would like to propose a change so we can add writing blocks of cubes of the same size. This would entail:

  1. A bit flag in the header indicating this format (we don't have any reserved bits at the moment that we could use for this, so we need a flag-byte).
  2. For every block: an xyz in size, an LEB128 number indicating the amount of cubes of that size following, and then that data in the already-defined row-major C-style

This would mean reducing the amount of data stored per cube 3 bytes, and perhaps increase the file density significantly.

This would also allow for far more efficient in-file deduplication because you can plan for the size you need to read much more efficiently.

Edit: uh, right. We have 7 bits left over in the byte that stores the orientation... I propose we add the bitflag to that :P

@datdenkikniet
Copy link
Contributor

datdenkikniet commented Jul 21, 2023

As a note of why adding the "storing by blocks" may be worth it: performing that optimization on the N = 12 cubes file decreases the size from 175 MiB to 121 MiB without compression, and from 111 MiB to 80 MiB with compression. Constructing the file takes some time, but I think the size-reduction factor and the fact that reading the file in parallel becomes a lot easier is quite worth it.

@bertie2
Copy link
Collaborator Author

bertie2 commented Jul 22, 2023

agreed, i will open a v2 spec on a PR for the converter, use a different magic to differentiate v2 and add a 4 bytes of feature flags and support for in file blocks.

@bertie2
Copy link
Collaborator Author

bertie2 commented Jul 22, 2023

tools interacting with the files should preferably maintain backwards compat with the v1 format but thats optional, its not like this is a commonly re run project.

@datdenkikniet
Copy link
Contributor

datdenkikniet commented Jul 22, 2023

Including a markdown file with the cube format at the top level of the project may also be a good idea :) That way you can't miss it, and it's easy to figure out what the current "real" spec is :D

@datdenkikniet
Copy link
Contributor

datdenkikniet commented Jul 23, 2023

I have three more information points that would be excellent to have in the header:

  1. uniqueness => true if all cubes in the file are unique if transformed into some canonical form (no guarantee about the specific canonical form, unless rotation bit is also set).
  2. n => the size that all polycubes in this file have, or 0 if that is unknown or varies.
  3. (Optional, probably too specific) "at least 1" (n != 0) => the file contains at least one copy of all unique cubes of size N, but may contain multiple copies of individual cubes (possibly in different rotations, if orientation == 0)

This way you can deduce if a cache file contains all unique polycubes of size n, or the file can tell you outright if uniqueness == true && "at least 1" == true.

Edit: this assumes that these properties are not already enforced by the format, of course. If they are, that should probably be in the spec. Adding n would still be nice, that way you can "avoid" being forced to look at the first cube to figure out the number.

@datdenkikniet
Copy link
Contributor

One last request: it would be really nice to have other compression mechanisms. gzip is quite slow, especially if you want to compress large sizes (the file I've generated for N=16 is around 900 GB, uncompressed). Other algorithms like brotli and/or zstd would be really nice to have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants