Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.download() scalability ? #51

Closed
hugolpz opened this issue Dec 27, 2021 · 6 comments
Closed

.download() scalability ? #51

hugolpz opened this issue Dec 27, 2021 · 6 comments
Assignees
Labels
enhancement New feature or request question Further information is requested

Comments

@hugolpz
Copy link
Collaborator

hugolpz commented Dec 27, 2021

Hi there,
I'm using WikiapiJS to code a wikiapi-egg (script) which will download all Commons files from target categories. My 3 largest target categories currently have about 50k audios files each, files being of 1.5KB each. Do you know:

  • Q1: Does WikiapiJS has such scale up in mind ?
  • Q2: What are the API limitations for such mass download ?
    • Listing items: Mediawiki API has a Categorymembers limit. cmlimit=500 for regular users, cmlimit=5000 if apihighlimits userright.
    • Downloads: I don't see limits on download themselves.
  • Q3: Do you handle categories with more than 500 files successfully ? (API limit)
  • Q4: Do you skip already downloaded files efficiently ? (quickly)
  • Q5: Do you compare local and remote files creation dates so to re-download from Commons when a new version is available ?
  • Q6: What should i avoid to not be blocked ?

Scale up

It's to provides the public direct and convenient dumps of LinguaLibre's audio assets on a per language basis. We want to create periodic (weekly?) dumps on our Lili server.

We want to keep a local dump synchronized based on Wikimedia Commons. We are talking about 700,000 files so far. According to tests duration above, the initial synchronization would take 21 days, that is ok.
But the later "updates" a week later would require about 15 days while only 1~2% of new files (7,000-15,000) will require a download.

Do you have possible optimization at sight ?

WikiapiJS download worked on tiny categories (files =12). See #48 code.
I'm currently reluctant to test further by fear of being banned.


.download() bentchmark (1)

Ok, I decided to test anyway on a category with n=369.

  • Initial attempt :
    • categorymembers=369
    • downloads=369
    • runing time: 16min or 960sec --> 2.7s./file
  • Removed 14 files from local directory
  • Update attempt:
    • categorymembers=369
    • downloads=14
    • runing time: 9min or 540sec --> 38.6s./file
@hugolpz hugolpz added the question Further information is requested label Dec 27, 2021
@kanasimi kanasimi added the enhancement New feature or request label Dec 27, 2021
@kanasimi
Copy link
Owner

kanasimi commented Dec 27, 2021

  1. I did not imagine this, but I reserved the possibility for this.
  2. You know I have the apihighlimits permission in wiki commons, so I often use this... but it should work for users without apihighlimits permission too. There should no limit for downloading large categories.
  3. I have processed for categories with 100K+ files.
  4. Yes, the library will skip files existed.
  5. No. I have not coding this yet.
  6. I have never heard of wiki commons blocking peoples downloading files, so...

@kanasimi
Copy link
Owner

What categories do you want to synchronize?

@hugolpz
Copy link
Collaborator Author

hugolpz commented Dec 28, 2021

Aim is to provide convenient dumps for each category in Category:Lingua_Libre_pronunciation.
The largest ones are 60k (ben) and 250k (fra) files strong. The whole 130 categories contain 700,000 files.

The point for WikiapiJS .download() is scalability, ability to handle such large categories with resilience and speed, their initial download and their later periodic update. Ideally weekly.

@kanasimi
Copy link
Owner

Well, it seems I need to do some works...

@hugolpz
Copy link
Collaborator Author

hugolpz commented Jan 3, 2022

Nice !

Repository owner deleted a comment from kanasimi Jan 13, 2022
Repository owner deleted a comment from kanasimi Jan 13, 2022
Repository owner deleted a comment from kanasimi Jan 13, 2022
Repository owner deleted a comment from kanasimi Jan 14, 2022
Repository owner deleted a comment from kanasimi Jan 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants