-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C++ implementation #7
Conversation
holy hell this is fast, both in how quick you've made it, and in runtime, very nice. |
I'm not sure how the process should be. There are things I would like to work on:
I'm open for feedback. I just selected Draft because the code isn't commented or structured well. Just what I coded down. |
As the memory seems to be a main issue, it would optimal to only hold the cubes in memory that are currently needed. The way I see it you don't need a cube after expanding it. |
happy to merge this after some cleanup and commenting, would you rather I go through and point stuff out line by line in the review, or just make the changes and pull-request my changes into your repo. this will not be replacing the python implementation but will be parallel to it. |
@NobodyForNothing I'm really not sure where the line is for the memory footprint. At some point the problem will get big. We can try to be efficient, but for "removing cubes after expanding" to be efficient we have to consider other data structures, so removing single items will really decrease size (eg. deque?). @bertie2 I think pull-requests to my repo are more efficient than line by line discussion. |
What's the current memory footprint e.g. at n=13? I have some servers available ;) Some of the discussions elsewhere are thinking about how one could shard work into different processes, or distribute across machines. To pull that off we'd probably need to find a way of identifying the minimum set of all cubes from a set of sets of cubes produced on different machines. There's almost certainly a known algorithm for this from big data - maybe future work ;) |
hard to say :) I only have 16G so I can't test everything:
Memory seems to grow linearly with progress... |
For what it's worth I've run it single-threaded on my MacBook Pro with 32 GB memory, and N = 13 took ~12 hours (commit cf6e413).
It's currently calculating N = 14 and consumes about ~30GB of memory, though it's rapidly increasing so I suspect it'll hit a wall here soon.
|
so I updated the code to use synchronized access to a single set instead of multiple sets when more threads are used. So memory footprint should be more or less the same and not depend on number of threads. |
open pull request from me @ nsch0e#1 @mikepound running up to n=15 numbers myself now with 32GB of ram and 256GB of swap, n=13 is taking less than 10 minutes. |
This comment was marked as outdated.
This comment was marked as outdated.
Thanks to #5 (comment) for finding it, this paper really gives a lot of useful info: http://kevingong.com/Polyominoes/ParallelPoly.html. For what it's worth, here are some of my conclusions & thoughts from it:Do we really want to store all cubes of the record-breaking n, or is just calculating the number enough? If so, then we could try to get clever and differentiate the polycubes into non-intersecting sets - to save on RAM. One such differentiating metric can be shape (rotated to some canon, e.g. increasing numbers, ensuring it's not mirrored) - but we don't necessarily have to use shape as a metric. And we can use several metrics. But if we can't predict in advance: e.g. for a given cube at level n=14 what metrics the derived cubes at n=16 would fall under, then using several metrics wouldn't be useful; we can just use one metric: modulo of some hashing algorithm (this can be any algorithm different from the one used for putting the elements into the set). And we can change the modulation number depending on how much RAM we have - to control the number of such groups - and thus, the sizes of each set. For the further example, let's say the modulation number k = 1000. Also, we can only pre-compute cache until a certain level n (max. that our storage capacity allows), and then continue using this cache as a starting point of every run, not caching the subsequent layers. So their information will be lost after finishing the algorithm. - Why not cache the next layers? - because their size in MB will be ridiculous (more than we can hold), while the numbers of polycubes will be calculated in parallel on multiple machines - so this giant size will never actually be stored in one place at the same time. We can compute the sum of unique polycubes, but not store them. Example:We want to calculate the number for n=17. So the workflow can be as follows:
If the time is too long:
Maybe some further improvements can be made - with ideas both from the paper and people. |
remove "using namespace std;" code smell. All changes are simply pasting std:: in front of all that didn't compile. One exception is with DBG defined in cubes.cpp: -rotatedCube.print(); +lowestHashCube.print(); to fix build regression.
with cleaner solution. Just do the standalone hasher functors and add typedef for the unordered_set<> as *CubeSet* and *XYZSet*
I would like to make the Cube::sparse private, but there are some single use cases for it.
When the source cube is not need just std::move() the object into another home. Few templates added to get perfect forwarding going for Cube. Finally convert from push_back()/insert() to C++11 emplace variant.
C++ code base cleanup
…s, fix README for tests
Reformatting For Readability.
@nsch0e can I get your thumbs up to merge this into main as it stands, I'm happy with it as it, but just want to check you don't have changes in flight ? |
@bertie2 I just did my last 3 small changes and am also happy with this state. Further improvements will have to wait. I have ideas but I think we should provide this as a starting point for others. |
This is looking great! For info I have access to some 50 core server with 1/2Tb ram - so if we need some testing doing.. ;) |
Fyi I got a nice tweet from @tjol on his implementation here: https://github.com/tjol/polycubes I'll point him in this direction. |
@mikepound if growth continues as we expect then with 2TB of ram n=16 should be runnable with both this and tjol 's implementation as they stand right now (just make sure you also have 2TB of disk space to write out the result to), however n=17 wont. the next step to get any higher will be to implement direct disk streaming, which I am working on in my own branch, but I want to get this and the rust implementation merged first as they stand so that others have something compatible to work from. (I estimate you need 4-6 TB of space for n = 17 so it pretty much has to be disk streaming) |
Yes no rush! I also note that the enumeration has actually now been computed up to n=18, from wikipedia. Clearly my video prompted someone to update the page! Enumeration and generation are two very different things though. |
I wouldn’t mind finding out how much RAM my version actually uses for N>14 and if it’s limited by disk speed (writing manageable chunks of the result to disk one by one is all well and good by my method of deduplication – looping through the entire file once for every now block of data – is actually quite slow, who knew) |
I have queued some pull requests on @nsch0e cpp branch: I would like to know are they still candidates to be pulled? The optimizations had following impact: |
@JATothrim if you raise your changes as pull requests into this repo I will take a look, I think I would merge 10 on the spot, but would want from feedback from @nsch0e before merging 11. |
I like both. :) also, please feel free to improve the C++ implementation in this repo as you see fit. I don't think I can/want approve everything, as there are far more knowledgeable people than me for C++. ;) |
@mikepound, not sure where to put this so replying here, the current latest C++ implementation with the split cache files should be able to produce all poly-cubes up to n=16 with 2TB of ram in about 2 days, as long as it also has ~3TB of storage available for the results, table of estimates here:
the data for n=15 + is the estimate, the data for n=10 to n=14 is measured, x indicates i couldn't properly measure the memory usage. whether this is actually worth running is up to you, at above n=16 it is pretty much required we move to a enumeration technique rather than trying to actually store the cubes. currently only the rust version can do storage-less enumeration, and https://github.com/datdenkikniet has run it up to n=16.
again need to decide if its worth starting now or we should wait for code improvements, for now the breakthroughs seem to have distinctly slowed down. |
This is already an amazing improvement! :) I'd seriously consider running n=16, if only so we can say we did it! 3Tb is a lot, but also not totally out of the question. The RAM is currently an issue, we have 1/2 Tb on our largest server, we may need to wait until we get a few more optimisations, though 75% memory saving may not count as "a few optimisations". I can certainly enumerate n=17, but if we are confident the code is working, all we can do there is verify the correctness of the paper. I'd argue that enumeration and generation are two separate and entirely worthwhile goals. I.e., that the fact that cubes have been enumerated up to 19 doesn't mean we shouldn't generate a file of cubes up to n=16 (or 17 if we get really lucky with optimisations). Then we stick it one a web app that randomly servers you one of the billions of shapes each time you visit it - next video! |
The split cache-files should be some what compressible by just zlib. This would reduce the disk space needed. I'm working on "cube swapper" system that moves the cube XYZ list representation data out from RAM and instead stores it in disk via memory mapped file. The data is read in only when needed. I noticed @nsch0e has done some experiments compressing the cube representation and I don't know about his plans on it. |
Unfortunately life is getting in the way... :) For what it's worth I have pushed my WIP on Compression to this branch. Idea is to encode pcubes as a string of direction changes. Since pcubes can have a tree like structure also jumps have to be encoded. In this branch encoding is used in the hashset to save memory. compression on disk could be approx. reduce size to 20% (5x). Each voxel is represented by a nibble + jumps (mostly also a nibble, sometimes 2, when big jumps) when needed. I don't think I will have much progress in the next month, so don't count on me... |
so I tried to describe the compression: PCube CompressionThis compression uses the fact that polycubes have to be connected so most of the time we only have to specify a direction for the next cube in the polycube. Only when a string of cubes ends we use a jump instruction to begin the next branch of cubes. Encoding:
Nibblesdir
jmp
When a polycube is encoded by an odd number of nibble a jump instruction |
@nsch0e Thank you for describling the compression scheme. :) |
This implementation is completely separate from the python implementation. It uses simpler representation than RLE, by just using a list of the indices of all ones (like .nonzero() in numpy).