-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No hash table with less performance penalty #49
Comments
I cloned and test run
Your version is so fast that I thought is it just counting the polycubes? If above is true this is amazingly good. |
The default run mode is to total the polycubes in the "thread_pool_push_output" function. The worker threads call this after processing each polycube of size N-1. The generated cubes themselves are ignored unless writing to an output cache file. For the cache file, the cubes are written and recovered through the "compression_compress" and "compression_decompress" functions. The data format was an attempt to reduce the file size from something like 30 bytes per polycube for the raw point list to 9 bytes per polycube (both at N=15). I described the format in the readme somewhat, though I didn't really explain the recovery process:
It seems I introduced a bug in reading the cache files at some point, and I probably didn't notice because reading cache files doesn't seem to improve the performance much in practice. The cause was simple but took a bit to track down - I was feeding the wrong input length to the decompression function. As a result, the version you cloned segfaults whenever you try to actually use any cache file from above N=9. I think that issue is solved with the changes I just pushed. |
@snowmanam2 Could this be scaled horizontally by just serving the base pcubes from an job server? The worker machines could keep split output locally and after computation turned into "serving mode" for next run. I did notice the segfault with May I suggest that you write an export tool that exports the data in ".pcube" format? (describled in #17 (comment)) |
For the write performance, I think there might be a few different factors at play. I experimented by increasing the write buffers (WRITE_COUNT) in "thread_pool.c" and it seems to have some slight improvement, but I'm not sure how it would translate on a more powerful processor. Further experimentation with the thread count might yield some improvement in results. I also considered buffering the results in the worker threads themselves, though I'm not sure if that will help anything. I can see how this could be split by partitioning the original data only once by say, some equal number of polycubes. I opted not to do this at present so I'm not left with 1 or 2 threads processing the last bits of data when the others are all done. A server approach might have similar issues if the work wasn't divided evenly enough. On the other extreme end, if you made the partition size too small, the overhead could eat into the performance. I also see that such systems were otherwise discussed in some of the other issues on this project. I added the ability to read/write basic .pcube files today by just using the .pcube extension in the -i or -o flags. My original format does have a slight size advantage, but might as well make everything compatible. I still haven't implemented the compression because I haven't really researched how to work with zlib yet. The code was generally kindof thrown together, too - so I might want to refactor a bit in the future and add some more error checks. |
I pulled today and looked at the code.
So if you do implement the zlib for .pcube format it will likely win the BitFace format in space usage, but it will be slower at high compression ratio. About the I/O:
About the thread pool/work scheduling: If the worker completion time difference is an problem you could consider fetching the seed pcubes via atomic pointer/index increments from an shared chunk of seed pcubes. This would be the "fairest" way to distribute the seed pcubes to worker threads. Synchronization of atomics and locks is non-trivial task however... |
The progress updates were there for normal "bottom-up" generation, but I disabled them for input cache files because I didn't know the input file length. I found ways around that now:
I still need to add progress updates for the conversion option (equal N but different formats), though that's just because it bypasses the thread pool altogether. Compression is now allowed for writing pcube files using the |
I tested using zlib deflate settings
The program seems to be no longer I/O bound, but the scalability caps at ~800%. Looking at
And indeed, doing unlock(output_lock) before unlock(write_lock) seems to cause repeated data-races. To solve the scaling issue, I suggest you implement an output queue/circular-buffer:
Above allows workers to continue pushing the slabs of pcubes while one slab is being simultaneously deflated. |
For the data race / locking stuff, I actually tracked down the fairly cryptic ThreadSanitizer warnings I got to overflowing the input buffer in the new zlib input stream implementation. There was also unprotected access to the progress bar "total_input_index" variable, though ThreadSanitizer made that really easy to find. Changes were pushed last night for that specifically, though I know I should probably rework the locking logic regardless. Some explanation: the thread pool itself actually has 2 buffers - an "output" buffer and a "write" buffer. When the "output" buffer fills, the pointers get swapped and the unlucky worker goes to work on writing the data from the "write" buffer. While the worker is working, the "output" buffer is free to accept data from the other workers. Because the buffers are equal size, the hope is that the fresh "output" buffer won't fill by the time the "write" buffer has been written. The output lock is unlocked once the output buffer / variables are free to modify, but the write lock is kept to keep other threads from messing with the write buffer. By moving the output lock below the write lock release, the effect is actually to lock all threads after generation during the entire write process (quick test for N=12 showed ~40% slower, some N=13 tests only ~10% slower). Not sure if moving the buffering to the workers will really help much. There might be a way to do parallel compression. Pigz does just that by keeping the preceding block 32k of uncompressed data per thread to generate context. If running big enough blocks, maybe the performance would be worth it? Then again, I can only imagine the complexity required to actually pull this off. As a test of trying to parallelize the pack + write step, I made a very rough testing branch (test-pwrite) to see how much I can decouple the threads. I moved all of the pack + write work into the workers finishing in pwrite calls, and the only shared state for output is the total count and the write offset. It seems to work pretty well, though output compression no longer works as a temporary side effect. Even if this isn't the way to go, it's still probably best to keep the packing / serialization work in the workers. |
This thread is getting rather long.. |
Starting with the "hashtable-less" algorithm described by presseyt in #11, it seems it could be faster if we change the approach in checking if we have the correct polycube for output:
p
. Extend by candidate cubea
to yield polycubeq
(q
=p
+a
).a
.a
in eachq
.q
, check if there is any possibler
at a higher index ofq
that can be removed while remaining a polycube. Otherwise outputq
.The advantage of this method is we don't need to recanonize the resulting polycubes after removing the values of
r
. This doesn't come completely for free, as we have to check for duplicates within the set of polycubes generated by the seed polycubep
and do some connection checks. This is because symmetric seed polycubes will create at least 2 of the same generated polycube, though we check only within this set because the index of the added point is always the same in the output for the same seed polycube.I made my own C implementation based on this idea in this repository. Profiling on low numbers of N seems to indicate the connection checks for step 4 take about 15% of the total processing time. Testing shows it seems faster (N=15 in 1 hour 38 minutes on an old FX-8350 processor) than the current rust hashless or point-list versions on my machine, though I'm not sure how it compares to the solution described in #27. Admittedly, I'm rather limited in testing larger values of N because of how long I'll have to tie up my machine with its limited processing capability.
The text was updated successfully, but these errors were encountered: