Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split downloads across mirrors #27

Open
newsch opened this issue Aug 22, 2023 · 2 comments
Open

Split downloads across mirrors #27

newsch opened this issue Aug 22, 2023 · 2 comments

Comments

@newsch
Copy link
Collaborator

newsch commented Aug 22, 2023

As discussed in #22, Wikipedia has a limit of 2 concurrent connections and seems to rate limit each to about 4 MB/s. There are at least two mirrors of the Enterprise dumps.
For the fastest speeds, ideally we could share downloads between wikipedia and the mirrors, or even download different parts of the same file concurrently like aria2c.

Unfortunately, none of the parallel downloaders I've seen allow setting connection limits per host (e.g. 2 for dumps.wikimedia.org, 4 for the rest).

So besides writing our own downloader, to respect the wikimedia limits we could:

  • Keep the 2 threads limit and divide the files across the available hosts
  • Increase the 2 threads limit and only use dumps.wikimedia.org for two files
  • Increase the 2 threads limit and don't use dumps.wikimedia.org for any files
@biodranik
Copy link
Member

What is the simplest solution?

@newsch
Copy link
Collaborator Author

newsch commented Sep 1, 2023

The simplest is to only use a single host.
Beyond that, I think the second option would provide the best throughput increase and still be relatively straightforward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants