Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize marker replacement #4

Closed
mxmlnkn opened this issue Nov 14, 2022 · 1 comment
Closed

Parallelize marker replacement #4

mxmlnkn opened this issue Nov 14, 2022 · 1 comment
Labels
performance Something is slower than it could be

Comments

@mxmlnkn
Copy link
Owner

mxmlnkn commented Nov 14, 2022

The decoding works in two steps:

  1. Decode with a bogus backreference buffer initialized to 16-bit indexes.
  2. Replace those 16-bit indexes (markers) with the actual backreference contents.

Currently, the second step is done on the orchestrator thread. This might limit performance. Marker replacement yields benchmark results of 12 GB/s and compacting the buffers from 16-bit storage type that only contains 8-bit values takes 4 GB/s.

  • This is quite fast, and parallelizing might effectively (only) yield a factor 2 speedup. Furthermore, at this point NUMA behavior might have to be considered when it comes to the ThreadPool.
  • Another problem is the load balancing. Introducing yet another thread pool would oversaturate the processor or underutilize the processor when limiting the decoding thread pool instead. Therefore, it might be nice to also use the existing thread pool for marker replacement. But then, it would have to implement a kind of priority system because marker replacement should always have higher priority. And we still would have to ensure that at least one thread can always decode or else it would still slow down. Maybe the orchestrator thread can keep acting as the main marker replacer but it also can distribute further work into the thread pool. And in case that even with higher priority, no one has begun to do the marker replacement when the orchestrator thread has finished its work, then it should be possible to steal back that work packet from the thread pool and let the orchestrator thread do it. This would also require a kind of work package ID to query for work completion and taking work back from the threda pool.

All in all, this slowly becomes an academic/high-performance computing issue not one of general ratarmount/pragzip usage but it would still be nice to have.

@mxmlnkn mxmlnkn added the performance Something is slower than it could be label Nov 14, 2022
@mxmlnkn
Copy link
Owner Author

mxmlnkn commented Jan 16, 2023

Implemented with mxmlnkn/indexed_bzip2@6cb4ab6

@mxmlnkn mxmlnkn closed this as completed Jan 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Something is slower than it could be
Projects
None yet
Development

No branches or pull requests

1 participant