Skip to content
Please note that GitHub no longer supports Internet Explorer.

We recommend upgrading to the latest Microsoft Edge, Google Chrome, or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed-up zimdump -D (fs dumping feature) #69

Open
lidel opened this issue Nov 8, 2019 · 9 comments
Open

Speed-up zimdump -D (fs dumping feature) #69

lidel opened this issue Nov 8, 2019 · 9 comments
Assignees

Comments

@lidel
Copy link
Contributor

@lidel lidel commented Nov 8, 2019

Issue extracted from ipfs/distributed-wikipedia-mirror#66

zimdump feels slower than it could be.
Below some notes from my tests and ideas how to improve its performance.

Single thread? Lack of buffer in front of disk writes?

I have SSD but my disk I/O remains pretty slow (iotop shows pretty slow disk writes at <400 K/s!).
Tool seems to be limited by the CPU: a single core is used and is constantly at 100%. Remaining 7 cores remain unused. Looks like it is single-threaded and perhaps flushing after each write to disk?

Benchmarks

Unpacking wikipedia_en_top_mini_2019-09.zim (250M) took nearly 30 minutes:

$ zimdump -V
1.0.5
$ time zimdump -D out-zimdump wikipedia_en_top_mini_2019-09.zim
1573.26s user 47.65s system 99% cpu 27:14.82 total

$ du -sh out-zimdump
758M    out-zimdump

This is super slow comparing to rust-based multicore extract_zim from dignifiedquire/zim. It produces some errors and skips some files (tool is not maintained anymore), but is able to extract most of it under 10 seconds(!):

$ time extract_zim --skip-link wikipedia_en_top_mini_2019-09.zim --out out-extract_zim
Extracting file: wikipedia_en_top_mini_2019-09.zim to out-extract_zim

  Creating map
  Extracting entries: 243
  Spawning 243 threads
...
5.91s user 1.87s system 635% cpu 1.223 total

$ du -sh out-extract_zim
726M    out-extract_zim

Things to try

Applying some/all optimizations from dignifiedquire/zim should make zimdump much, much faster:

  • builds a map of the file first, identifying individual clusters
  • creates a pool of 16 workers that process clusters in parallel
  • all writes to disk are buffered in memory and periodically flushed
    (in Rust this is provided by BufWriter)
@lidel lidel mentioned this issue Nov 8, 2019
0 of 6 tasks complete
@kelson42 kelson42 added the enhancement label Nov 9, 2019
@kelson42 kelson42 changed the title zimdump performance Speed-up zimdump -D (fs dumping feature) Nov 9, 2019
@kelson42 kelson42 added the IPFS label Nov 9, 2019
@kelson42

This comment has been minimized.

Copy link
Contributor

@kelson42 kelson42 commented Nov 9, 2019

@lidel Thank you for this quality ticket, I'm supportive. Will do my best to get this don in January.

@kelson42 kelson42 pinned this issue Nov 9, 2019
@momack2

This comment has been minimized.

Copy link

@momack2 momack2 commented Jan 7, 2020

Thanks @kelson42!! Curious how things are evolving now that we're in the new year - is this still on your agenda this month?

@kelson42

This comment has been minimized.

Copy link
Contributor

@kelson42 kelson42 commented Jan 7, 2020

@momack2 I would like, but we are a bit short on C++ resources currently. It has been posponed to Febuary for the moment. If you can recommend someone, please tell us!

@momack2

This comment has been minimized.

Copy link

@momack2 momack2 commented Jan 7, 2020

I don't know of any C++ devs with bandwidth, but @jnthnvctr might be able to suggest other routes to get this work increased attention. We'd really love to update our distributed wikipedia mirror with snapshots more recent than 2017... ;)

@kelson42

This comment has been minimized.

Copy link
Contributor

@kelson42 kelson42 commented Jan 8, 2020

@momack2 It is just a "small" delay and working already to find someone. Maybe you can retweet https://twitter.com/KiwixOffline/status/1214826834417860609

@mgautierfr

This comment has been minimized.

Copy link
Collaborator

@mgautierfr mgautierfr commented Jan 8, 2020

I didn't know about extract_zim tool. It's nice to see some rust around zim (even if it is not maintained anymore).

I agree with this ticket, zim_tools is a small set of tools and we can improve it a lot.
Looping articles based on the cluster index order instead of the url order is already used in zimrecreate tool. It should not be too difficult to reuse it.
At least we would use the libzim cache system and avoid to decompress the same cluster several times.

@orionseye

This comment was marked as off-topic.

Copy link

@orionseye orionseye commented Jan 14, 2020

Heads up for that.
extract_zim is a beautiful, lightning fast tool.
Took no longer than 1 min to install and run...then extract wikivoyage.zim (800MB) in less than ....guess?

10 sec

Structure?
avoiding zimdump's URI encoded file format, instead it 'reads' the URI part and creates a wonderful file structure. For example, directory '-'
/j
/s
favicon
style.css

@momack2

This comment was marked as off-topic.

Copy link

@momack2 momack2 commented Jan 16, 2020

@dignifiedquire - just FYI your past work is getting some ❤️

@dignifiedquire

This comment was marked as off-topic.

Copy link

@dignifiedquire dignifiedquire commented Jan 19, 2020

Just made some updates which fixes some missed files, though I still have to investigate what exactly the diff between the two tools in output is. I also made it a bit faster (on my machine).

$ time ./target/release/extract_zim --skip-link ~/Downloads/wikipedia_en_top_mini_2019-09.zim --out ./out
Extracting file: /Users/dignifiedquire/Downloads/wikipedia_en_top_mini_2019-09.zim to ./out

  Creating map
  Extracting entries: 243
  Spawning 243 tasks across 16 threads
  Extraction done in 3268ms
  Main page is index
./target/release/extract_zim --skip-link  --out ./out  6.37s user 16.20s system 684% cpu 3.296 total
$ command du -sh out
737M	out
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.