Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++ implementation #7

Merged
merged 42 commits into from
Jul 18, 2023
Merged

C++ implementation #7

merged 42 commits into from
Jul 18, 2023

Conversation

nsch0e
Copy link
Contributor

@nsch0e nsch0e commented Jul 13, 2023

This implementation is completely separate from the python implementation. It uses simpler representation than RLE, by just using a list of the indices of all ones (like .nonzero() in numpy).

@bertie2
Copy link
Collaborator

bertie2 commented Jul 13, 2023

holy hell this is fast, both in how quick you've made it, and in runtime, very nice.
unfortunately it still seems to slam into a memory wall at n=13, but much less so than the python version.
when would you be happy for full feedback and review on this ? don't want to spam you since this pr is still marked as draft.

@nsch0e
Copy link
Contributor Author

nsch0e commented Jul 13, 2023

I'm not sure how the process should be. There are things I would like to work on:

  • still redundant structures (vector<> and ordered_set<>) which both hold all cubes.
    • fixed for single threaded
  • No caching implemented
  • multithreading is working and doesn't inflate memory footprint much
  • There is an interesting optimization in Optimizations from guy that did n=16 #5 which could be incorporated
    • I tried this, but there was no speed up

I'm open for feedback. I just selected Draft because the code isn't commented or structured well. Just what I coded down.

@nsch0e nsch0e marked this pull request as ready for review July 13, 2023 18:37
@derdilla
Copy link

As the memory seems to be a main issue, it would optimal to only hold the cubes in memory that are currently needed. The way I see it you don't need a cube after expanding it.

@bertie2
Copy link
Collaborator

bertie2 commented Jul 13, 2023

happy to merge this after some cleanup and commenting, would you rather I go through and point stuff out line by line in the review, or just make the changes and pull-request my changes into your repo.

this will not be replacing the python implementation but will be parallel to it.

@nsch0e
Copy link
Contributor Author

nsch0e commented Jul 13, 2023

@NobodyForNothing I'm really not sure where the line is for the memory footprint. At some point the problem will get big. We can try to be efficient, but for "removing cubes after expanding" to be efficient we have to consider other data structures, so removing single items will really decrease size (eg. deque?).
Perhaps a pool for all XYZ structs is more efficient, because for each iteration N there are exactly N x XYZ structs in a Cube.

@bertie2 I think pull-requests to my repo are more efficient than line by line discussion.

@mikepound
Copy link
Owner

What's the current memory footprint e.g. at n=13? I have some servers available ;)

Some of the discussions elsewhere are thinking about how one could shard work into different processes, or distribute across machines. To pull that off we'd probably need to find a way of identifying the minimum set of all cubes from a set of sets of cubes produced on different machines. There's almost certainly a known algorithm for this from big data - maybe future work ;)

@nsch0e
Copy link
Contributor Author

nsch0e commented Jul 14, 2023

What's the current memory footprint e.g. at n=13? I have some servers available ;)

hard to say :) I only have 16G so I can't test everything:

N   SingleTh    4Threads    20 Threads
12  2.5GiB                  10.7GiB
13  10%=9GiB   10%=10GiB                     but the base set is already 2.5GiB from N=12

Memory seems to grow linearly with progress...

@mpitof
Copy link

mpitof commented Jul 14, 2023

For what it's worth I've run it single-threaded on my MacBook Pro with 32 GB memory, and N = 13 took ~12 hours (commit cf6e413).

N = 13 || generating new cubes from 18598427 base cubes.
  took 43899.52 s
  num cubes: 138462649

It's currently calculating N = 14 and consumes about ~30GB of memory, though it's rapidly increasing so I suspect it'll hit a wall here soon.

N = 14 || generating new cubes from 138462649 base cubes.
   2%   928 it/s, remaining: 145890s

@nsch0e
Copy link
Contributor Author

nsch0e commented Jul 14, 2023

so I updated the code to use synchronized access to a single set instead of multiple sets when more threads are used. So memory footprint should be more or less the same and not depend on number of threads.
Since all threads use the same resource (the set) they are blocking each other while accessing the set. This will be the limiting factor for speed when using many threads.

@bertie2
Copy link
Collaborator

bertie2 commented Jul 14, 2023

open pull request from me @ nsch0e#1
gets us pretty close to single threaded performance both in terms of memory usage and speed while running multi threaded.

@mikepound
n=13 is easily doable at 32GB of ram, and we seem to have the same growth pattern as the files, ~~7x each time.
interestingly in my testing the program works REALY WELL with a swap file (at least on my nvme ssd and at n=13) so if you have a server with 32-64GB of ram and a fast NVME SSD, you should be able to allocate a multi-hundred-GB swap file and push n very high. for only a fairly minor slowdown.
performance craters once swap usage gets >= ram, still continuing at 800 base cubes / thread x 16 threads at n=14 but much slower.

running up to n=15 numbers myself now with 32GB of ram and 256GB of swap, n=13 is taking less than 10 minutes.

@nsch0e

This comment was marked as outdated.

@VladimirFokow
Copy link
Contributor

VladimirFokow commented Jul 14, 2023

Thanks to #5 (comment) for finding it, this paper really gives a lot of useful info: http://kevingong.com/Polyominoes/ParallelPoly.html.

For what it's worth, here are some of my conclusions & thoughts from it:

Do we really want to store all cubes of the record-breaking n, or is just calculating the number enough? If so, then we could try to get clever and differentiate the polycubes into non-intersecting sets - to save on RAM.

One such differentiating metric can be shape (rotated to some canon, e.g. increasing numbers, ensuring it's not mirrored) - but we don't necessarily have to use shape as a metric. And we can use several metrics. But if we can't predict in advance: e.g. for a given cube at level n=14 what metrics the derived cubes at n=16 would fall under, then using several metrics wouldn't be useful; we can just use one metric: modulo of some hashing algorithm (this can be any algorithm different from the one used for putting the elements into the set). And we can change the modulation number depending on how much RAM we have - to control the number of such groups - and thus, the sizes of each set. For the further example, let's say the modulation number k = 1000.

Also, we can only pre-compute cache until a certain level n (max. that our storage capacity allows), and then continue using this cache as a starting point of every run, not caching the subsequent layers. So their information will be lost after finishing the algorithm. - Why not cache the next layers? - because their size in MB will be ridiculous (more than we can hold), while the numbers of polycubes will be calculated in parallel on multiple machines - so this giant size will never actually be stored in one place at the same time. We can compute the sum of unique polycubes, but not store them.


Example:

We want to calculate the number for n=17.
The maximum cache that we have the capacity to store is for n=14. This is our starting cache.

So the workflow can be as follows:

distribute the starting cache into several machines, e.g. each of the 5 machines receives a random 20% portion.
for each possible modulo remainder (in our case k=1000, see above, so, for each number from 0 to 999):
    on each machine:
        current_set = empty_set()
        for each subset of cubes of the local portion of starting cache (each machine should know its appropriate subset size - to not overwhelm the RAM in the next step. Subset size can be even 1):
            calculate all canonical representations of polycubes with n=17 for the cubes of the current subset (this will take several steps, since the starting point is only n=14)
            filter these polycubes: apply some hashing algorithm and calculate the modulo remainder, throw away everything that doesn't match the modulo remainder of the current loop
            add the polycubes into current_set
    
    // after all sets on all machines have been calculated:
    merge all sets from all machines together into a resulting set
    count the number of elements in this resulting set
    save this integer somewhere (can also save the current modulo remainder with it - for logging)

add up all these integers
  • since the set contains all the cubes that fall under the metric (modulo remainder in this example), and nothing else - we safely process such sets independently from each other

  • if one machine finishes early - don't wait for others to finish - export the set to some external place, and go on to the next modulo remainder


If the time is too long:

  • increase the starting cache, e.g. from n=14 to n=15 (but need more storage). This is "depth vs width balance" idea from the paper
  • or choose a higher subset size (need more RAM, because you probably use RAM for sets in the intermediate layers)
  • or use a smaller modulation number (k), e.g. not 1000 but 800 (each set will have more elements - need more RAM),
  • or use more parallelism: not only can we process the same modulo remainder on different machines and at the end just merge all these sets together to count the number of its elements, we can also process different modulo remainders on different machines.

Maybe some further improvements can be made - with ideas both from the paper and people.
Or maybe some things from here are impractical or outdated..
At least, this is the edge of my understanding right now. Thank you for reading!

remove "using namespace std;" code smell.

All changes are simply pasting std:: in front of all that
didn't compile.

One exception is with DBG defined in cubes.cpp:
-rotatedCube.print();
+lowestHashCube.print();
to fix build regression.
with cleaner solution.

Just do the standalone hasher functors
and add typedef for the unordered_set<>
as *CubeSet* and *XYZSet*
I would like to make the Cube::sparse private,
but there are some single use cases for it.
When the source cube is not need just std::move()
the object into another home.

Few templates added to get perfect forwarding going for Cube.

Finally convert from push_back()/insert() to C++11 emplace variant.
@bertie2
Copy link
Collaborator

bertie2 commented Jul 17, 2023

@nsch0e can I get your thumbs up to merge this into main as it stands, I'm happy with it as it, but just want to check you don't have changes in flight ?

@nsch0e
Copy link
Contributor Author

nsch0e commented Jul 17, 2023

@bertie2 I just did my last 3 small changes and am also happy with this state. Further improvements will have to wait. I have ideas but I think we should provide this as a starting point for others.

@mikepound
Copy link
Owner

This is looking great! For info I have access to some 50 core server with 1/2Tb ram - so if we need some testing doing.. ;)

@mikepound
Copy link
Owner

Fyi I got a nice tweet from @tjol on his implementation here:

https://github.com/tjol/polycubes

I'll point him in this direction.

@bertie2
Copy link
Collaborator

bertie2 commented Jul 18, 2023

@mikepound if growth continues as we expect then with 2TB of ram n=16 should be runnable with both this and tjol 's implementation as they stand right now (just make sure you also have 2TB of disk space to write out the result to), however n=17 wont.

the next step to get any higher will be to implement direct disk streaming, which I am working on in my own branch, but I want to get this and the rust implementation merged first as they stand so that others have something compatible to work from.

(I estimate you need 4-6 TB of space for n = 17 so it pretty much has to be disk streaming)

@mikepound
Copy link
Owner

Yes no rush! I also note that the enumeration has actually now been computed up to n=18, from wikipedia. Clearly my video prompted someone to update the page! Enumeration and generation are two very different things though.

@tjol
Copy link

tjol commented Jul 18, 2023

I wouldn’t mind finding out how much RAM my version actually uses for N>14 and if it’s limited by disk speed (writing manageable chunks of the result to disk one by one is all well and good by my method of deduplication – looping through the entire file once for every now block of data – is actually quite slow, who knew)

@bertie2 bertie2 merged commit 25f6e5c into mikepound:main Jul 18, 2023
@JATothrim
Copy link
Contributor

I have queued some pull requests on @nsch0e cpp branch:

I would like to know are they still candidates to be pulled?

The optimizations had following impact:
./build/cubes -n 12 -t 2 245.05s user 1.88s system 180% cpu 2:16.76 total
vs.
./build/cubes -n 12 -t 2 297.87s user 2.04s system 180% cpu 2:46.07 total

@bertie2
Copy link
Collaborator

bertie2 commented Jul 18, 2023

@JATothrim if you raise your changes as pull requests into this repo I will take a look, I think I would merge 10 on the spot, but would want from feedback from @nsch0e before merging 11.

@nsch0e
Copy link
Contributor Author

nsch0e commented Jul 18, 2023

I like both. :)

also, please feel free to improve the C++ implementation in this repo as you see fit. I don't think I can/want approve everything, as there are far more knowledgeable people than me for C++. ;)

@bertie2
Copy link
Collaborator

bertie2 commented Aug 2, 2023

This is looking great! For info I have access to some 50 core server with 1/2Tb ram - so if we need some testing doing.. ;)

@mikepound, not sure where to put this so replying here, the current latest C++ implementation with the split cache files should be able to produce all poly-cubes up to n=16 with 2TB of ram in about 2 days, as long as it also has ~3TB of storage available for the results, table of estimates here:

N Peak Memory Usage (GB) Total Output Size (GB) Growth Factor
10 x 0.01  
11 x 0.08 8
12 x 0.638 7.975
13 6 5 7.836
14 27 40.7 8.14
15 216 325.6  
16 1728 2604.8  
17 13824 20838.4  
18 110592 166707.2  
19 884736 1333657.6  

the data for n=15 + is the estimate, the data for n=10 to n=14 is measured, x indicates i couldn't properly measure the memory usage.

whether this is actually worth running is up to you, at above n=16 it is pretty much required we move to a enumeration technique rather than trying to actually store the cubes.

currently only the rust version can do storage-less enumeration, and https://github.com/datdenkikniet has run it up to n=16.
unfortunately using similar extrapolation, it would take about a year to beat n=18 currently, though possibly faster on your hardware:

N Time Taken Growth Factor Estimated Time (days)
13 1.5    
14 13 8.66666666666667  
15 110 8.46153846153846  
16 970 8.81818181818182  
17 7760   5.38888888888889
18 62080   43.1111111111111
19 496640   344.888888888889
20 3973120   2759.11111111111

again need to decide if its worth starting now or we should wait for code improvements, for now the breakthroughs seem to have distinctly slowed down.

@mikepound
Copy link
Owner

This is already an amazing improvement! :) I'd seriously consider running n=16, if only so we can say we did it! 3Tb is a lot, but also not totally out of the question. The RAM is currently an issue, we have 1/2 Tb on our largest server, we may need to wait until we get a few more optimisations, though 75% memory saving may not count as "a few optimisations". I can certainly enumerate n=17, but if we are confident the code is working, all we can do there is verify the correctness of the paper.

I'd argue that enumeration and generation are two separate and entirely worthwhile goals. I.e., that the fact that cubes have been enumerated up to 19 doesn't mean we shouldn't generate a file of cubes up to n=16 (or 17 if we get really lucky with optimisations). Then we stick it one a web app that randomly servers you one of the billions of shapes each time you visit it - next video!

@JATothrim
Copy link
Contributor

The split cache-files should be some what compressible by just zlib. This would reduce the disk space needed.

I'm working on "cube swapper" system that moves the cube XYZ list representation data out from RAM and instead stores it in disk via memory mapped file. The data is read in only when needed.
A draw back is that this system is currently very slow as flushing the cubes out to disk periodically takes a long time.
As proof-of-concept stage it does work. Used ram for computing the polycubes is minuscule compared to current situation: N=13 run going at 52/66 shapes, the process is showing less than 300MiB RSS. :-)

I noticed @nsch0e has done some experiments compressing the cube representation and I don't know about his plans on it.

@nsch0e
Copy link
Contributor Author

nsch0e commented Aug 7, 2023

Unfortunately life is getting in the way... :)

For what it's worth I have pushed my WIP on Compression to this branch. Idea is to encode pcubes as a string of direction changes. Since pcubes can have a tree like structure also jumps have to be encoded.
Currently there is no spec description of the encoding so it is perhaps a bit difficult to understand, but if @JATothrim wants to have a look, feel free to proceed with my draft.

In this branch encoding is used in the hashset to save memory. compression on disk could be approx. reduce size to 20% (5x). Each voxel is represented by a nibble + jumps (mostly also a nibble, sometimes 2, when big jumps) when needed.

I don't think I will have much progress in the next month, so don't count on me...

@nsch0e
Copy link
Contributor Author

nsch0e commented Aug 7, 2023

so I tried to describe the compression:

PCube Compression

This compression uses the fact that polycubes have to be connected so most of the time we only have to specify a direction for the next cube in the polycube. Only when a string of cubes ends we use a jump instruction to begin the next branch of cubes.

Encoding:

+----------------+-----------+-----------+---------------------+
| 8 bits         | 4 bits    | 4 bits    | ...2*length nibbles |
+----------------+-----------+-----------+---------------------+
| length in Byte | [dir|jmp] | [dir|jmp] | ...                 |
+----------------+-----------+-----------+---------------------+

Nibbles

dir

0b0xxx has its highest bit fixed at zero. Lower 3 bits are the index of the direction.
direction table:

table = [
  { 0,  0,  1},
  { 0,  0, -1},
  { 0,  1,  0}, 
  { 0, -1,  0},
  { 1,  0,  0},
  {-1,  0,  0}
]

jmp

0b1xxx has its highest bit fixed at one. Lower 3 bits are the reverse jump distance minus 1. So the jump 0b1000 references the second last encoded cube.
jmp instructions can be concatinated to represent bigger jumps. Each following jump instruction shifts previous jump distance by 3 bits.
example:

encoded jump:  1001 1010
jump distance:  001  010 = 0b1010

When a polycube is encoded by an odd number of nibble a jump instruction 0b1000 can be filled at the end, because it doesn't add a cube.

@JATothrim
Copy link
Contributor

@nsch0e Thank you for describling the compression scheme. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants