Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upSpeed-up zimdump -D (fs dumping feature) #69
Comments
This comment has been minimized.
This comment has been minimized.
|
@lidel Thank you for this quality ticket, I'm supportive. Will do my best to get this don in January. |
This comment has been minimized.
This comment has been minimized.
|
Thanks @kelson42!! Curious how things are evolving now that we're in the new year - is this still on your agenda this month? |
This comment has been minimized.
This comment has been minimized.
|
@momack2 I would like, but we are a bit short on C++ resources currently. It has been posponed to Febuary for the moment. If you can recommend someone, please tell us! |
This comment has been minimized.
This comment has been minimized.
|
I don't know of any C++ devs with bandwidth, but @jnthnvctr might be able to suggest other routes to get this work increased attention. We'd really love to update our distributed wikipedia mirror with snapshots more recent than 2017... ;) |
This comment has been minimized.
This comment has been minimized.
|
@momack2 It is just a "small" delay and working already to find someone. Maybe you can retweet https://twitter.com/KiwixOffline/status/1214826834417860609 |
This comment has been minimized.
This comment has been minimized.
|
I didn't know about I agree with this ticket, |
This comment was marked as off-topic.
This comment was marked as off-topic.
|
Heads up for that.
Structure? |
This comment was marked as off-topic.
This comment was marked as off-topic.
|
@dignifiedquire - just FYI your past work is getting some |
This comment was marked as off-topic.
This comment was marked as off-topic.
|
Just made some updates which fixes some missed files, though I still have to investigate what exactly the diff between the two tools in output is. I also made it a bit faster (on my machine).
|
zimdumpfeels slower than it could be.Below some notes from my tests and ideas how to improve its performance.
Single thread? Lack of buffer in front of disk writes?
I have SSD but my disk I/O remains pretty slow (
iotopshows pretty slow disk writes at<400 K/s!).Tool seems to be limited by the CPU: a single core is used and is constantly at 100%. Remaining 7 cores remain unused. Looks like it is single-threaded and perhaps flushing after each write to disk?
Benchmarks
Unpacking wikipedia_en_top_mini_2019-09.zim (250M) took nearly 30 minutes:
This is super slow comparing to rust-based multicore
extract_zimfrom dignifiedquire/zim. It produces some errors and skips some files (tool is not maintained anymore), but is able to extract most of it under 10 seconds(!):Things to try
Applying some/all optimizations from dignifiedquire/zim should make
zimdumpmuch, much faster:(in Rust this is provided by BufWriter)