C++ implementation #7

nsch0e · 2023-07-13T12:01:37Z

This implementation is completely separate from the python implementation. It uses simpler representation than RLE, by just using a list of the indices of all ones (like .nonzero() in numpy).

bertie2 · 2023-07-13T12:14:16Z

holy hell this is fast, both in how quick you've made it, and in runtime, very nice.
unfortunately it still seems to slam into a memory wall at n=13, but much less so than the python version.
when would you be happy for full feedback and review on this ? don't want to spam you since this pr is still marked as draft.

nsch0e · 2023-07-13T13:55:53Z

I'm not sure how the process should be. There are things I would like to work on:

still redundant structures (vector<> and ordered_set<>) which both hold all cubes.
- fixed for single threaded
No caching implemented
multithreading is working and doesn't inflate memory footprint much
There is an interesting optimization in Optimizations from guy that did n=16 #5 which could be incorporated
- I tried this, but there was no speed up

I'm open for feedback. I just selected Draft because the code isn't commented or structured well. Just what I coded down.

derdilla · 2023-07-13T18:52:59Z

As the memory seems to be a main issue, it would optimal to only hold the cubes in memory that are currently needed. The way I see it you don't need a cube after expanding it.

bertie2 · 2023-07-13T19:27:09Z

happy to merge this after some cleanup and commenting, would you rather I go through and point stuff out line by line in the review, or just make the changes and pull-request my changes into your repo.

this will not be replacing the python implementation but will be parallel to it.

nsch0e · 2023-07-13T20:04:13Z

@NobodyForNothing I'm really not sure where the line is for the memory footprint. At some point the problem will get big. We can try to be efficient, but for "removing cubes after expanding" to be efficient we have to consider other data structures, so removing single items will really decrease size (eg. deque?).
Perhaps a pool for all XYZ structs is more efficient, because for each iteration N there are exactly N x XYZ structs in a Cube.

@bertie2 I think pull-requests to my repo are more efficient than line by line discussion.

mikepound · 2023-07-14T10:02:35Z

What's the current memory footprint e.g. at n=13? I have some servers available ;)

Some of the discussions elsewhere are thinking about how one could shard work into different processes, or distribute across machines. To pull that off we'd probably need to find a way of identifying the minimum set of all cubes from a set of sets of cubes produced on different machines. There's almost certainly a known algorithm for this from big data - maybe future work ;)

nsch0e · 2023-07-14T10:29:21Z

What's the current memory footprint e.g. at n=13? I have some servers available ;)

hard to say :) I only have 16G so I can't test everything:

N   SingleTh    4Threads    20 Threads
12  2.5GiB                  10.7GiB
13  10%=9GiB   10%=10GiB                     but the base set is already 2.5GiB from N=12

Memory seems to grow linearly with progress...

mpitof · 2023-07-14T10:42:35Z

For what it's worth I've run it single-threaded on my MacBook Pro with 32 GB memory, and N = 13 took ~12 hours (commit cf6e413).

N = 13 || generating new cubes from 18598427 base cubes.
  took 43899.52 s
  num cubes: 138462649

It's currently calculating N = 14 and consumes about ~30GB of memory, though it's rapidly increasing so I suspect it'll hit a wall here soon.

N = 14 || generating new cubes from 138462649 base cubes.
   2%   928 it/s, remaining: 145890s

nsch0e · 2023-07-14T11:50:42Z

so I updated the code to use synchronized access to a single set instead of multiple sets when more threads are used. So memory footprint should be more or less the same and not depend on number of threads.
Since all threads use the same resource (the set) they are blocking each other while accessing the set. This will be the limiting factor for speed when using many threads.

bertie2 · 2023-07-14T15:13:06Z

open pull request from me @ nsch0e#1
gets us pretty close to single threaded performance both in terms of memory usage and speed while running multi threaded.

@mikepound
n=13 is easily doable at 32GB of ram, and we seem to have the same growth pattern as the files, ~~7x each time.
interestingly in my testing the program works REALY WELL with a swap file (at least on my nvme ssd and at n=13) so if you have a server with 32-64GB of ram and a fast NVME SSD, you should be able to allocate a multi-hundred-GB swap file and push n very high. for only a fairly minor slowdown.
performance craters once swap usage gets >= ram, still continuing at 800 base cubes / thread x 16 threads at n=14 but much slower.

running up to n=15 numbers myself now with 32GB of ram and 256GB of swap, n=13 is taking less than 10 minutes.

VladimirFokow · 2023-07-14T23:06:57Z

Thanks to #5 (comment) for finding it, this paper really gives a lot of useful info: http://kevingong.com/Polyominoes/ParallelPoly.html.

For what it's worth, here are some of my conclusions & thoughts from it:

Do we really want to store all cubes of the record-breaking n, or is just calculating the number enough? If so, then we could try to get clever and differentiate the polycubes into non-intersecting sets - to save on RAM.

One such differentiating metric can be shape (rotated to some canon, e.g. increasing numbers, ensuring it's not mirrored) - but we don't necessarily have to use shape as a metric. And we can use several metrics. But if we can't predict in advance: e.g. for a given cube at level n=14 what metrics the derived cubes at n=16 would fall under, then using several metrics wouldn't be useful; we can just use one metric: modulo of some hashing algorithm (this can be any algorithm different from the one used for putting the elements into the set). And we can change the modulation number depending on how much RAM we have - to control the number of such groups - and thus, the sizes of each set. For the further example, let's say the modulation number k = 1000.

Also, we can only pre-compute cache until a certain level n (max. that our storage capacity allows), and then continue using this cache as a starting point of every run, not caching the subsequent layers. So their information will be lost after finishing the algorithm. - Why not cache the next layers? - because their size in MB will be ridiculous (more than we can hold), while the numbers of polycubes will be calculated in parallel on multiple machines - so this giant size will never actually be stored in one place at the same time. We can compute the sum of unique polycubes, but not store them.

Example:

We want to calculate the number for n=17.
The maximum cache that we have the capacity to store is for n=14. This is our starting cache.

So the workflow can be as follows:

distribute the starting cache into several machines, e.g. each of the 5 machines receives a random 20% portion.
for each possible modulo remainder (in our case k=1000, see above, so, for each number from 0 to 999):
    on each machine:
        current_set = empty_set()
        for each subset of cubes of the local portion of starting cache (each machine should know its appropriate subset size - to not overwhelm the RAM in the next step. Subset size can be even 1):
            calculate all canonical representations of polycubes with n=17 for the cubes of the current subset (this will take several steps, since the starting point is only n=14)
            filter these polycubes: apply some hashing algorithm and calculate the modulo remainder, throw away everything that doesn't match the modulo remainder of the current loop
            add the polycubes into current_set
    
    // after all sets on all machines have been calculated:
    merge all sets from all machines together into a resulting set
    count the number of elements in this resulting set
    save this integer somewhere (can also save the current modulo remainder with it - for logging)

add up all these integers

since the set contains all the cubes that fall under the metric (modulo remainder in this example), and nothing else - we safely process such sets independently from each other
if one machine finishes early - don't wait for others to finish - export the set to some external place, and go on to the next modulo remainder

If the time is too long:

increase the starting cache, e.g. from n=14 to n=15 (but need more storage). This is "depth vs width balance" idea from the paper
or choose a higher subset size (need more RAM, because you probably use RAM for sets in the intermediate layers)
or use a smaller modulation number (k), e.g. not 1000 but 800 (each set will have more elements - need more RAM),
or use more parallelism: not only can we process the same modulo remainder on different machines and at the end just merge all these sets together to count the number of its elements, we can also process different modulo remainders on different machines.

Maybe some further improvements can be made - with ideas both from the paper and people.
Or maybe some things from here are impractical or outdated..
At least, this is the edge of my understanding right now. Thank you for reading!

remove "using namespace std;" code smell. All changes are simply pasting std:: in front of all that didn't compile. One exception is with DBG defined in cubes.cpp: -rotatedCube.print(); +lowestHashCube.print(); to fix build regression.

with cleaner solution. Just do the standalone hasher functors and add typedef for the unordered_set<> as *CubeSet* and *XYZSet*

I would like to make the Cube::sparse private, but there are some single use cases for it.

When the source cube is not need just std::move() the object into another home. Few templates added to get perfect forwarding going for Cube. Finally convert from push_back()/insert() to C++11 emplace variant.

C++ code base cleanup

…s, fix README for tests

Reformatting For Readability.

bertie2 · 2023-07-17T20:13:46Z

@nsch0e can I get your thumbs up to merge this into main as it stands, I'm happy with it as it, but just want to check you don't have changes in flight ?

nsch0e · 2023-07-17T20:44:27Z

@bertie2 I just did my last 3 small changes and am also happy with this state. Further improvements will have to wait. I have ideas but I think we should provide this as a starting point for others.

mikepound · 2023-07-18T12:01:40Z

This is looking great! For info I have access to some 50 core server with 1/2Tb ram - so if we need some testing doing.. ;)

mikepound · 2023-07-18T12:06:17Z

Fyi I got a nice tweet from @tjol on his implementation here:

https://github.com/tjol/polycubes

I'll point him in this direction.

bertie2 · 2023-07-18T12:27:35Z

@mikepound if growth continues as we expect then with 2TB of ram n=16 should be runnable with both this and tjol 's implementation as they stand right now (just make sure you also have 2TB of disk space to write out the result to), however n=17 wont.

the next step to get any higher will be to implement direct disk streaming, which I am working on in my own branch, but I want to get this and the rust implementation merged first as they stand so that others have something compatible to work from.

(I estimate you need 4-6 TB of space for n = 17 so it pretty much has to be disk streaming)

mikepound · 2023-07-18T12:34:13Z

Yes no rush! I also note that the enumeration has actually now been computed up to n=18, from wikipedia. Clearly my video prompted someone to update the page! Enumeration and generation are two very different things though.

tjol · 2023-07-18T12:50:15Z

I wouldn’t mind finding out how much RAM my version actually uses for N>14 and if it’s limited by disk speed (writing manageable chunks of the result to disk one by one is all well and good by my method of deduplication – looping through the entire file once for every now block of data – is actually quite slow, who knew)

JATothrim · 2023-07-18T15:40:46Z

I have queued some pull requests on @nsch0e cpp branch:

I would like to know are they still candidates to be pulled?

The optimizations had following impact:
./build/cubes -n 12 -t 2 245.05s user 1.88s system 180% cpu 2:16.76 total
vs.
./build/cubes -n 12 -t 2 297.87s user 2.04s system 180% cpu 2:46.07 total

bertie2 · 2023-07-18T15:45:59Z

@JATothrim if you raise your changes as pull requests into this repo I will take a look, I think I would merge 10 on the spot, but would want from feedback from @nsch0e before merging 11.

nsch0e · 2023-07-18T16:09:04Z

I like both. :)

also, please feel free to improve the C++ implementation in this repo as you see fit. I don't think I can/want approve everything, as there are far more knowledgeable people than me for C++. ;)

bertie2 · 2023-08-02T11:01:03Z

This is looking great! For info I have access to some 50 core server with 1/2Tb ram - so if we need some testing doing.. ;)

@mikepound, not sure where to put this so replying here, the current latest C++ implementation with the split cache files should be able to produce all poly-cubes up to n=16 with 2TB of ram in about 2 days, as long as it also has ~3TB of storage available for the results, table of estimates here:

N	Peak Memory Usage (GB)	Total Output Size (GB)	Growth Factor
10	x	0.01
11	x	0.08	8
12	x	0.638	7.975
13	6	5	7.836
14	27	40.7	8.14
15	216	325.6
16	1728	2604.8
17	13824	20838.4
18	110592	166707.2
19	884736	1333657.6

the data for n=15 + is the estimate, the data for n=10 to n=14 is measured, x indicates i couldn't properly measure the memory usage.

whether this is actually worth running is up to you, at above n=16 it is pretty much required we move to a enumeration technique rather than trying to actually store the cubes.

currently only the rust version can do storage-less enumeration, and https://github.com/datdenkikniet has run it up to n=16.
unfortunately using similar extrapolation, it would take about a year to beat n=18 currently, though possibly faster on your hardware:

N	Time Taken	Growth Factor	Estimated Time (days)
13	1.5
14	13	8.66666666666667
15	110	8.46153846153846
16	970	8.81818181818182
17	7760		5.38888888888889
18	62080		43.1111111111111
19	496640		344.888888888889
20	3973120		2759.11111111111

again need to decide if its worth starting now or we should wait for code improvements, for now the breakthroughs seem to have distinctly slowed down.

mikepound · 2023-08-05T22:07:45Z

This is already an amazing improvement! :) I'd seriously consider running n=16, if only so we can say we did it! 3Tb is a lot, but also not totally out of the question. The RAM is currently an issue, we have 1/2 Tb on our largest server, we may need to wait until we get a few more optimisations, though 75% memory saving may not count as "a few optimisations". I can certainly enumerate n=17, but if we are confident the code is working, all we can do there is verify the correctness of the paper.

I'd argue that enumeration and generation are two separate and entirely worthwhile goals. I.e., that the fact that cubes have been enumerated up to 19 doesn't mean we shouldn't generate a file of cubes up to n=16 (or 17 if we get really lucky with optimisations). Then we stick it one a web app that randomly servers you one of the billions of shapes each time you visit it - next video!

JATothrim · 2023-08-07T13:49:40Z

The split cache-files should be some what compressible by just zlib. This would reduce the disk space needed.

I'm working on "cube swapper" system that moves the cube XYZ list representation data out from RAM and instead stores it in disk via memory mapped file. The data is read in only when needed.
A draw back is that this system is currently very slow as flushing the cubes out to disk periodically takes a long time.
As proof-of-concept stage it does work. Used ram for computing the polycubes is minuscule compared to current situation: N=13 run going at 52/66 shapes, the process is showing less than 300MiB RSS. :-)

I noticed @nsch0e has done some experiments compressing the cube representation and I don't know about his plans on it.

nsch0e · 2023-08-07T15:51:06Z

Unfortunately life is getting in the way... :)

For what it's worth I have pushed my WIP on Compression to this branch. Idea is to encode pcubes as a string of direction changes. Since pcubes can have a tree like structure also jumps have to be encoded.
Currently there is no spec description of the encoding so it is perhaps a bit difficult to understand, but if @JATothrim wants to have a look, feel free to proceed with my draft.

In this branch encoding is used in the hashset to save memory. compression on disk could be approx. reduce size to 20% (5x). Each voxel is represented by a nibble + jumps (mostly also a nibble, sometimes 2, when big jumps) when needed.

I don't think I will have much progress in the next month, so don't count on me...

nsch0e · 2023-08-07T20:12:53Z

so I tried to describe the compression:

PCube Compression

This compression uses the fact that polycubes have to be connected so most of the time we only have to specify a direction for the next cube in the polycube. Only when a string of cubes ends we use a jump instruction to begin the next branch of cubes.

Encoding:

+----------------+-----------+-----------+---------------------+
| 8 bits         | 4 bits    | 4 bits    | ...2*length nibbles |
+----------------+-----------+-----------+---------------------+
| length in Byte | [dir|jmp] | [dir|jmp] | ...                 |
+----------------+-----------+-----------+---------------------+

Nibbles

dir

0b0xxx has its highest bit fixed at zero. Lower 3 bits are the index of the direction.
direction table:

table = [
  { 0,  0,  1},
  { 0,  0, -1},
  { 0,  1,  0}, 
  { 0, -1,  0},
  { 1,  0,  0},
  {-1,  0,  0}
]

jmp

0b1xxx has its highest bit fixed at one. Lower 3 bits are the reverse jump distance minus 1. So the jump 0b1000 references the second last encoded cube.
jmp instructions can be concatinated to represent bigger jumps. Each following jump instruction shifts previous jump distance by 3 bits.
example:

encoded jump:  1001 1010
jump distance:  001  010 = 0b1010

When a polycube is encoded by an odd number of nibble a jump instruction 0b1000 can be filled at the end, because it doesn't add a cube.

JATothrim · 2023-08-09T11:54:21Z

@nsch0e Thank you for describling the compression scheme. :)

cpp

574fac0

nsch0e added 2 commits July 13, 2023 17:30

implement cache files

aa17160

add lpthread flag to g++

42c6ca9

nsch0e marked this pull request as ready for review July 13, 2023 18:37

nsch0e added 3 commits July 13, 2023 22:21

clear lists early

cf6e413

cleanup by splitting into multiple files

31596bc

use env var USE_CACHE to disable cache

fb768fc

use synchronized access to single set

c9f81ef

only use rotations where shape.x <= shape.y <= shape.z

730c4a3

This comment was marked as outdated.

Sign in to view

subhashy

aa768a0

JATothrim added 6 commits July 15, 2023 02:58

add preprocessor include guards and few likely needed includes.

3e8d228

Fixup the global namespace:

2510e80

remove "using namespace std;" code smell. All changes are simply pasting std:: in front of all that didn't compile. One exception is with DBG defined in cubes.cpp: -rotatedCube.print(); +lowestHashCube.print(); to fix build regression.

Replace the "namespace std { struct hash ..."

abf4ba9

with cleaner solution. Just do the standalone hasher functors and add typedef for the unordered_set<> as *CubeSet* and *XYZSet*

add some vector functions to Cube struct itself.

0f3ee22

Change code to use the now range-for capable Cube.

28164d7

I would like to make the Cube::sparse private, but there are some single use cases for it.

Correct the Cube getting copied too many times in critical loop.

d29bc39

When the source cube is not need just std::move() the object into another home. Few templates added to get perfect forwarding going for Cube. Finally convert from push_back()/insert() to C++11 emplace variant.

VladimirFokow mentioned this pull request Jul 15, 2023

[Discussion] I made a graph that allows me to estimate about how big a cubes_n.npy file will get (in bytes) when given n cubes. mikepound/cubes#14

Open

Merge pull request #5 from JATothrim/cleanup

1ffe456

C++ code base cleanup

bertie2 and others added 9 commits July 16, 2023 23:31

run formatter

51248f8

add LICENSE file for cmdparser library

e6a8aef

add unit tests

c81d054

cean up type definitions and includes to be more c++ compliant

0c04315

fix compiler warnings

34e4a17

remove raw executable from repo

d8a728b

add warning suppression for unknown pragma to quiete the build proces…

d9e9227

…s, fix README for tests

update gitignore due to new build tree

9671d93

Merge pull request #9 from bertie2/reformating

f17c5ad

Reformatting For Readability.

nsch0e added 3 commits July 17, 2023 22:37

fix typo in cache header

fcfc982

make thread argument optional mit 1 as default

bce13dc

update readme

5c46aa2

bertie2 merged commit 25f6e5c into mikepound:main Jul 18, 2023

JATothrim mentioned this pull request Aug 28, 2023

OpenCubes file formats #48

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C++ implementation #7

C++ implementation #7

nsch0e commented Jul 13, 2023

bertie2 commented Jul 13, 2023

nsch0e commented Jul 13, 2023 •

edited

Loading

derdilla commented Jul 13, 2023

bertie2 commented Jul 13, 2023

nsch0e commented Jul 13, 2023

mikepound commented Jul 14, 2023

nsch0e commented Jul 14, 2023 •

edited

Loading

mpitof commented Jul 14, 2023 •

edited

Loading

nsch0e commented Jul 14, 2023

bertie2 commented Jul 14, 2023 •

edited

Loading

This comment was marked as outdated.

VladimirFokow commented Jul 14, 2023 •

edited

Loading

bertie2 commented Jul 17, 2023

nsch0e commented Jul 17, 2023 •

edited

Loading

mikepound commented Jul 18, 2023

mikepound commented Jul 18, 2023

bertie2 commented Jul 18, 2023 •

edited

Loading

mikepound commented Jul 18, 2023

tjol commented Jul 18, 2023

JATothrim commented Jul 18, 2023

bertie2 commented Jul 18, 2023

nsch0e commented Jul 18, 2023

bertie2 commented Aug 2, 2023

mikepound commented Aug 5, 2023

JATothrim commented Aug 7, 2023

nsch0e commented Aug 7, 2023

nsch0e commented Aug 7, 2023

JATothrim commented Aug 9, 2023

C++ implementation #7

C++ implementation #7

Conversation

nsch0e commented Jul 13, 2023

bertie2 commented Jul 13, 2023

nsch0e commented Jul 13, 2023 • edited Loading

derdilla commented Jul 13, 2023

bertie2 commented Jul 13, 2023

nsch0e commented Jul 13, 2023

mikepound commented Jul 14, 2023

nsch0e commented Jul 14, 2023 • edited Loading

mpitof commented Jul 14, 2023 • edited Loading

nsch0e commented Jul 14, 2023

bertie2 commented Jul 14, 2023 • edited Loading

This comment was marked as outdated.

VladimirFokow commented Jul 14, 2023 • edited Loading

For what it's worth, here are some of my conclusions & thoughts from it:

Example:

bertie2 commented Jul 17, 2023

nsch0e commented Jul 17, 2023 • edited Loading

mikepound commented Jul 18, 2023

mikepound commented Jul 18, 2023

bertie2 commented Jul 18, 2023 • edited Loading

mikepound commented Jul 18, 2023

tjol commented Jul 18, 2023

JATothrim commented Jul 18, 2023

bertie2 commented Jul 18, 2023

nsch0e commented Jul 18, 2023

bertie2 commented Aug 2, 2023

mikepound commented Aug 5, 2023

JATothrim commented Aug 7, 2023

nsch0e commented Aug 7, 2023

nsch0e commented Aug 7, 2023

PCube Compression

Nibbles

dir

jmp

JATothrim commented Aug 9, 2023

nsch0e commented Jul 13, 2023 •

edited

Loading

nsch0e commented Jul 14, 2023 •

edited

Loading

mpitof commented Jul 14, 2023 •

edited

Loading

bertie2 commented Jul 14, 2023 •

edited

Loading

VladimirFokow commented Jul 14, 2023 •

edited

Loading

nsch0e commented Jul 17, 2023 •

edited

Loading

bertie2 commented Jul 18, 2023 •

edited

Loading