Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No hash table with less performance penalty #49

Open
snowmanam2 opened this issue Aug 30, 2023 · 9 comments
Open

No hash table with less performance penalty #49

snowmanam2 opened this issue Aug 30, 2023 · 9 comments

Comments

@snowmanam2
Copy link

Starting with the "hashtable-less" algorithm described by presseyt in #11, it seems it could be faster if we change the approach in checking if we have the correct polycube for output:

  1. Start with seed polycube p. Extend by candidate cube a to yield polycube q (q = p + a).
  2. Repeat step 1 for all possible additions of a.
  3. Filter the results of step 2 to only have unique values, saving the highest found value of the index of a in each q.
  4. For each q, check if there is any possible r at a higher index of q that can be removed while remaining a polycube. Otherwise output q.

The advantage of this method is we don't need to recanonize the resulting polycubes after removing the values of r. This doesn't come completely for free, as we have to check for duplicates within the set of polycubes generated by the seed polycube p and do some connection checks. This is because symmetric seed polycubes will create at least 2 of the same generated polycube, though we check only within this set because the index of the added point is always the same in the output for the same seed polycube.

I made my own C implementation based on this idea in this repository. Profiling on low numbers of N seems to indicate the connection checks for step 4 take about 15% of the total processing time. Testing shows it seems faster (N=15 in 1 hour 38 minutes on an old FX-8350 processor) than the current rust hashless or point-list versions on my machine, though I'm not sure how it compares to the solution described in #27. Admittedly, I'm rather limited in testing larger values of N because of how long I'll have to tie up my machine with its limited processing capability.

@JATothrim
Copy link
Contributor

JATothrim commented Aug 30, 2023

I cloned and test run ./polycube_generator 13 -o cubes13.dat and it completed in 44 seconds.
For comparison:

  • Rust cargo run -r enumerate 13 is still going after 4min and it loaded N=12 cached file..
  • C++ cubes -n 13 -w -s -u loaded cached N=12 files and completed after 5min 52 seconds.

Your version is so fast that I thought is it just counting the polycubes?
Does the output file contain data for all polycubes of N=13?
(is it possible to extract particular pcube and the coordinates of it after the run is completed?)

If above is true this is amazingly good.
The code base is also read-able at least to me and has comments. I applaud.

@snowmanam2
Copy link
Author

The default run mode is to total the polycubes in the "thread_pool_push_output" function. The worker threads call this after processing each polycube of size N-1. The generated cubes themselves are ignored unless writing to an output cache file.

For the cache file, the cubes are written and recovered through the "compression_compress" and "compression_decompress" functions. The data format was an attempt to reduce the file size from something like 30 bytes per polycube for the raw point list to 9 bytes per polycube (both at N=15). I described the format in the readme somewhat, though I didn't really explain the recovery process:

  1. Add point (1,1,1) as the first point.
  2. Look at the first 6 bits of the polycube. For each bit set (X,Y,Z,-X,-Y,-Z), add a point at the designated face offset and remember the face that generated it.
  3. For the remaining N-2 groups of 5 bits, each set bit will generate a corresponding point in sequence as in step 2, but skip the bit check entirely for the face where this point was discovered.
  4. Normalize all points to have (1,1,1) as the minimum point.
  5. Sort the points in increasing order if desired.

It seems I introduced a bug in reading the cache files at some point, and I probably didn't notice because reading cache files doesn't seem to improve the performance much in practice. The cause was simple but took a bit to track down - I was feeding the wrong input length to the decompression function. As a result, the version you cloned segfaults whenever you try to actually use any cache file from above N=9. I think that issue is solved with the changes I just pushed.

@JATothrim
Copy link
Contributor

@snowmanam2
The most interesting thing is the fact that this is an embarrassingly parallel solution?
I have 8 core machine with 16 threads and this scaled to ~1200% cpu usage for N=14 which is really good.
N=14 completed in 328 seconds.
Would go faster, but writing the file is becoming an bottleneck.

Could this be scaled horizontally by just serving the base pcubes from an job server? The worker machines could keep split output locally and after computation turned into "serving mode" for next run.

I did notice the segfault with -i <cachefile> earlier 😉
Todays pull still segfaults on some input cache-files.

May I suggest that you write an export tool that exports the data in ".pcube" format? (describled in #17 (comment))
Bunch of other sub-projects expect ".pcube" format.
It does support gzip compression, but is otherwise quite "fluffy" in space usage. If yours is better in compression, then "reader" program would suffice that just pipes out the data in this format.
I made #48 issue where the file-formats used (like yours) can be documented.

@snowmanam2
Copy link
Author

For the write performance, I think there might be a few different factors at play. I experimented by increasing the write buffers (WRITE_COUNT) in "thread_pool.c" and it seems to have some slight improvement, but I'm not sure how it would translate on a more powerful processor. Further experimentation with the thread count might yield some improvement in results. I also considered buffering the results in the worker threads themselves, though I'm not sure if that will help anything.

I can see how this could be split by partitioning the original data only once by say, some equal number of polycubes. I opted not to do this at present so I'm not left with 1 or 2 threads processing the last bits of data when the others are all done. A server approach might have similar issues if the work wasn't divided evenly enough. On the other extreme end, if you made the partition size too small, the overhead could eat into the performance. I also see that such systems were otherwise discussed in some of the other issues on this project.

I added the ability to read/write basic .pcube files today by just using the .pcube extension in the -i or -o flags. My original format does have a slight size advantage, but might as well make everything compatible. I still haven't implemented the compression because I haven't really researched how to work with zlib yet. The code was generally kindof thrown together, too - so I might want to refactor a bit in the future and add some more error checks.

@JATothrim
Copy link
Contributor

I pulled today and looked at the code.
I noticed the progress bar is still gone? I hoped it would do an come back. 😉

polycube_generator 14 -i cubes13.dat -o cubes14.pcube completed in 256 seconds. Quite significant improvement.
polycube_generator 14 -i cubes13.dat -o cubes14.dat however completed in 356 seconds.
note: I compiled with -flto LTO mode.
The BitFace compression looks to have an significant performance penalty.

  • cubes14.dat ==> 9.4Gb
  • cubes14.pcube ==> 12.4Gb
  • cubes14.pcube.gz ==> 4,5Gb (pigz -k7 cubes14.pcube)

So if you do implement the zlib for .pcube format it will likely win the BitFace format in space usage, but it will be slower at high compression ratio.

About the I/O:
fwrite()/fread() APIs cap the IO depth at 1 while NVME devices can sustain hundreds.
The work around for IO depth of one is buffering more.
However at some point the buffering won't help anymore: the cost of memcpy'ing the data around becomes greater than just doing write() syscall. (efficient I/O tries to avoid memcpy much as possible)

pwrite() and pread() don't have single thread limitation and can be called in parallel with the same file handle.
man page of pread
These are however low-level POSIX APIs and some buffering is still required to avoid syscall overhead. You could also just split the output into multiple files and write into them in parallel. This would also run the compression in parallel?

About the thread pool/work scheduling:
I would not worry about the threads getting "de-synced" at end. This will always happen as the seed pcubes are different. Seed pcubes thus take varying amount of time to process. Because of that the worker threads must be able "steal" or dequeue work independently from others. If worker is stuck longer for any reason others must be able to pick up the slack.

If the worker completion time difference is an problem you could consider fetching the seed pcubes via atomic pointer/index increments from an shared chunk of seed pcubes. This would be the "fairest" way to distribute the seed pcubes to worker threads. Synchronization of atomics and locks is non-trivial task however...

@snowmanam2
Copy link
Author

The progress updates were there for normal "bottom-up" generation, but I disabled them for input cache files because I didn't know the input file length. I found ways around that now:

  • By writing a fixed width version of the LEB128 count in pcube files. Kindof hackish, but it seems to work and other implementations seem to read without issue. Note old cache files still don't have this data and won't have progress updates.
  • By using stat.h on bitface files, which works because the data is fixed width per polycube.

I still need to add progress updates for the conversion option (equal N but different formats), though that's just because it bypasses the thread pool altogether.

Compression is now allowed for writing pcube files using the -oz option, and reading is automatic. There is a definite performance penalty, but now everything should hopefully comply with the spec (more testing needed to verify). The zlib stuff was more involved to implement than I hoped, so there's now another abstraction layer that I'll probably regret adding - especially if I try to start using pwrite.

@JATothrim
Copy link
Contributor

I tested using zlib deflate settings level=2 and memLevel=9.
This increased the N=14 output to 5.2 Gb:

Starting file reader in PCube mode with compression.
Starting file writer in PCube mode with compression.
Using thread pool with 16 threads to generate n=14 from n=13
Found 138462649 polycubes in input file
1039496297 polycubes found of length 14                      
546 seconds elapsed

The program seems to be no longer I/O bound, but the scalability caps at ~800%.
I think the pwrite() idea doesn't mix well with the zlib compression, since it is near impossible to splice two zlib streams together.

Looking at thread_pool_push_output() it looks to have locking order hazard:

lock(&pool->output_lock);
lock(&pool->write_lock);
unlock(&pool->output_lock);
unlock(&pool->write_lock);

And indeed, doing unlock(output_lock) before unlock(write_lock) seems to cause repeated data-races.
I regard unlocking multiple mutexes in different order that you acquire them as 100% sure way to eventually break the code.
Moving unlock(output_lock) after if(...) { unlock(write_lock) } silences the -fsanitize=thread errors, but there is still few that don't repeat.

To solve the scaling issue, I suggest you implement an output queue/circular-buffer:

  • Worker threads calloc(), fill OUTPUT_CACHE slab of pcubes and push this slab into an queue.
    The slab is private to the worker until it pushed!
  • The slab queue/circular buffer is locked only while pushing/poping a slab of pcubes.
  • single fixed role thread is allowed to pop from that slab queue/buffer that also runs the deflate compression.
    (it also frees the slab memory.)

Above allows workers to continue pushing the slabs of pcubes while one slab is being simultaneously deflated.
Workers don't have wait for the deflate + I/O to complete. (at least until the slab queue is full)

@snowmanam2
Copy link
Author

For the data race / locking stuff, I actually tracked down the fairly cryptic ThreadSanitizer warnings I got to overflowing the input buffer in the new zlib input stream implementation. There was also unprotected access to the progress bar "total_input_index" variable, though ThreadSanitizer made that really easy to find. Changes were pushed last night for that specifically, though I know I should probably rework the locking logic regardless.

Some explanation: the thread pool itself actually has 2 buffers - an "output" buffer and a "write" buffer. When the "output" buffer fills, the pointers get swapped and the unlucky worker goes to work on writing the data from the "write" buffer. While the worker is working, the "output" buffer is free to accept data from the other workers. Because the buffers are equal size, the hope is that the fresh "output" buffer won't fill by the time the "write" buffer has been written. The output lock is unlocked once the output buffer / variables are free to modify, but the write lock is kept to keep other threads from messing with the write buffer. By moving the output lock below the write lock release, the effect is actually to lock all threads after generation during the entire write process (quick test for N=12 showed ~40% slower, some N=13 tests only ~10% slower). Not sure if moving the buffering to the workers will really help much.

There might be a way to do parallel compression. Pigz does just that by keeping the preceding block 32k of uncompressed data per thread to generate context. If running big enough blocks, maybe the performance would be worth it? Then again, I can only imagine the complexity required to actually pull this off.

As a test of trying to parallelize the pack + write step, I made a very rough testing branch (test-pwrite) to see how much I can decouple the threads. I moved all of the pack + write work into the workers finishing in pwrite calls, and the only shared state for output is the total count and the write offset. It seems to work pretty well, though output compression no longer works as a temporary side effect. Even if this isn't the way to go, it's still probably best to keep the packing / serialization work in the workers.

@JATothrim
Copy link
Contributor

JATothrim commented Sep 9, 2023

This thread is getting rather long..
I don't see reason to continue this discussion here since the code doesn't actually live in here.
I made snowmanam2/SnowmanPolycubeGenerator#1 (comment) to continue this discussion in @snowmanam2 repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants