`.download()` scalability ? #51

hugolpz · 2021-12-27T15:21:32Z

Hi there,
I'm using WikiapiJS to code a wikiapi-egg (script) which will download all Commons files from target categories. My 3 largest target categories currently have about 50k audios files each, files being of 1.5KB each. Do you know:

Q1: Does WikiapiJS has such scale up in mind ?
Q2: What are the API limitations for such mass download ?
- Listing items: Mediawiki API has a Categorymembers limit. cmlimit=500 for regular users, cmlimit=5000 if apihighlimits userright.
- Downloads: I don't see limits on download themselves.
Q3: Do you handle categories with more than 500 files successfully ? (API limit)
Q4: Do you skip already downloaded files efficiently ? (quickly)
Q5: Do you compare local and remote files creation dates so to re-download from Commons when a new version is available ?
Q6: What should i avoid to not be blocked ?

Scale up

It's to provides the public direct and convenient dumps of LinguaLibre's audio assets on a per language basis. We want to create periodic (weekly?) dumps on our Lili server.

We want to keep a local dump synchronized based on Wikimedia Commons. We are talking about 700,000 files so far. According to tests duration above, the initial synchronization would take 21 days, that is ok.
But the later "updates" a week later would require about 15 days while only 1~2% of new files (7,000-15,000) will require a download.

Do you have possible optimization at sight ?

WikiapiJS download worked on tiny categories (files =12). See #48 code.
I'm currently reluctant to test further by fear of being banned.

`.download()` bentchmark (1)

Ok, I decided to test anyway on a category with n=369.

Initial attempt :
- categorymembers=369
- downloads=369
- runing time: 16min or 960sec --> 2.7s./file
Removed 14 files from local directory
Update attempt:
- categorymembers=369
- downloads=14
- runing time: 9min or 540sec --> 38.6s./file

The text was updated successfully, but these errors were encountered:

kanasimi · 2021-12-27T22:03:15Z

I did not imagine this, but I reserved the possibility for this.
You know I have the apihighlimits permission in wiki commons, so I often use this... but it should work for users without apihighlimits permission too. There should no limit for downloading large categories.
I have processed for categories with 100K+ files.
Yes, the library will skip files existed.
No. I have not coding this yet.
I have never heard of wiki commons blocking peoples downloading files, so...

kanasimi · 2021-12-28T08:52:32Z

What categories do you want to synchronize?

hugolpz · 2021-12-28T09:00:59Z

Aim is to provide convenient dumps for each category in Category:Lingua_Libre_pronunciation.
The largest ones are 60k (ben) and 250k (fra) files strong. The whole 130 categories contain 700,000 files.

The point for WikiapiJS .download() is scalability, ability to handle such large categories with resilience and speed, their initial download and their later periodic update. Ideally weekly.

kanasimi · 2021-12-28T09:55:48Z

Well, it seems I need to do some works...

hugolpz · 2022-01-03T21:37:34Z

Nice !

hugolpz · 2022-01-14T12:31:25Z

This scale up question is handled in two related issues:

hugolpz added the question Further information is requested label Dec 27, 2021

hugolpz assigned kanasimi Dec 27, 2021

kanasimi added the enhancement New feature or request label Dec 27, 2021

hugolpz mentioned this issue Jan 5, 2022

"Too many values supplied for parameter \"pageids\". The limit is 50." #52

Closed

hugolpz mentioned this issue Jan 13, 2022

.download() : reduce calls to api.php, directly hit on https://upload.wikimedia.org #53

Closed

Repository owner deleted a comment from kanasimi Jan 13, 2022

kanasimi closed this as completed in kanasimi/CeJS@1c9dc03 Jan 13, 2022

Repository owner deleted a comment from kanasimi Jan 14, 2022

hugolpz mentioned this issue Jan 14, 2022

.download() : compare local and remote files by timestamps before downloading #55

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`.download()` scalability ? #51

`.download()` scalability ? #51

hugolpz commented Dec 27, 2021 •

edited

kanasimi commented Dec 27, 2021 •

edited by hugolpz

kanasimi commented Dec 28, 2021

hugolpz commented Dec 28, 2021 •

edited

kanasimi commented Dec 28, 2021

hugolpz commented Jan 3, 2022 •

edited

hugolpz commented Jan 14, 2022 •

edited

.download() scalability ? #51

.download() scalability ? #51

Comments

hugolpz commented Dec 27, 2021 • edited

Scale up

.download() bentchmark (1)

kanasimi commented Dec 27, 2021 • edited by hugolpz

kanasimi commented Dec 28, 2021

hugolpz commented Dec 28, 2021 • edited

kanasimi commented Dec 28, 2021

hugolpz commented Jan 3, 2022 • edited

hugolpz commented Jan 14, 2022 • edited

`.download()` scalability ? #51

`.download()` scalability ? #51

hugolpz commented Dec 27, 2021 •

edited

`.download()` bentchmark (1)

kanasimi commented Dec 27, 2021 •

edited by hugolpz

hugolpz commented Dec 28, 2021 •

edited

hugolpz commented Jan 3, 2022 •

edited

hugolpz commented Jan 14, 2022 •

edited