Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smarter download approach #50

Closed
kelson42 opened this Issue Apr 9, 2017 · 9 comments

Comments

Projects
None yet
3 participants
@kelson42
Copy link
Contributor

commented Apr 9, 2017

To download each book, gutenberg2zim tries a list of URL pattern where the book file(s) might be. We have no way to know, for sure, where the files are.

The reason behind this is that over time the Gutenberg project has chosen many different methodologies.

Therefore, to do download file, the script has a hardcoded list of potential URLs where to download the file(s) and the script just go through each of then until one suceeds.

Most of the time, the script makes more tha 4-5 tries before catching the right one. Each try cost time, 05-1 second.

That's why it would be nice to make a smart guess about which URL pattern in the list might better work, and a propose a method.

Each time a URL pattern has been proven successful for a book, then move it from one to the top of the list and always tires the list of pattern from the top to the end.

By doing so, we could have a far more faster download process.

@kelson42 kelson42 added the enhancement label Apr 9, 2017

@kelson42

This comment has been minimized.

Copy link
Contributor Author

commented Apr 13, 2017

The problem is pretty acute, in 4 days, I have only downloaded 10000 books and needed to restart the process many times.

@kelson42

This comment has been minimized.

Copy link
Contributor Author

commented Apr 13, 2017

For now, only readingroo is used, but the requests should also deal with other mirrors http://gutenberg.readingroo.ms/MIRRORS.ALL.utf8

@kelson42

This comment has been minimized.

Copy link
Contributor Author

commented Apr 14, 2017

@rgaudin yes, if the list is clearly written in the source code, migth be easier - even for someone like me - to update it.

@rgaudin

This comment has been minimized.

Copy link
Collaborator

commented Apr 15, 2017

Eventually, there are only two complete mirrors.
Since those are not meant to be scraped completely, they have usage restrictions.
Second largest language (fr) with a few thousand books is usually crawled directly without trouble.

English with dozens of thousands of books raises usage limits. I believe the only viable solution for English (most of the whole project anyway) is to create a local mirror from rsync://readingroo.ms/gutenberg and use it.
We would need to add an option to copy/link from that location directly instead of HTTP requests

What do you think?

@kelson42

This comment has been minimized.

Copy link
Contributor Author

commented Apr 15, 2017

@rgaudin Do we have an idea of the whole size?

@rgaudin

This comment has been minimized.

Copy link
Collaborator

commented Apr 17, 2017

Hum, it is apparently 764GB at the moment, so not really a good alternative.
Clever use of --exclude could drastically reduce its size but I believe effort should be put into our crawler's logic.

Easiest next step from here would be to pause & continue the download process when receiving HTTP 429 status code.

@kelson42, do you agree?

@kelson42

This comment has been minimized.

Copy link
Contributor Author

commented Apr 17, 2017

@rgaudin yes

@dattaz

This comment has been minimized.

Copy link
Member

commented Jul 2, 2017

Maybe we can use a mix of rsync and http, by guessing the url locally:

  1. we get all files URL using rsync : rsync -a --list-only rsync.mirrorservice.org::gutenberg.org
  2. Then we check from the list (from 1.) which is the good url
  3. We get it from serveur using http.
$ time rsync -a --list-only rsync.mirrorservice.org::gutenberg.org | awk '{print $5}' > log_gutenberg

real	12m30,415s
user	0m21,080s
sys	0m11,652s

$wc -l log_gutenberg 
2208308 log_gutenberg
$head -n 500 log_gutenberg
[...]
1/0/0/1006/old/2ddcc10.txt
1/0/0/1006/old/2ddcc10.zip
1/0/0/1007
1/0/0/1007/1007.txt
1/0/0/1007/1007.zip
1/0/0/1007/old
1/0/0/1007/old/3ddcc10.txt
1/0/0/1007/old/3ddcc10.zip

@dattaz dattaz closed this in f9cd116 Jun 22, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.