Download script #22

newsch · 2023-07-18T19:28:30Z

Closes #12

Remaining work:

Add handling for error conditions
Document config and error codes

download.sh

biodranik · 2023-07-20T06:07:06Z

download.sh

+
+if [ -z "$URLS" ]
+then
+    log "No dumps available"


"Latest dumps are already downloaded"?

If URLS is empty, then none of the specified languages could be found for the latest dump.

If a newer dump isn't available, it will still check the sizes of the last downloaded dump, and exit with 0.

Good! The goal is to make a cron script that will update files automatically when they are published (and delete old files).

Another question: should previously generated HTML and other temporary files be deleted before relaunching the wikiparser? Does it make sense to cover it in the run script?

They shouldn't need to be.

The temporary files are regenerated each time.
The generated HTML will be overwritten if it is referenced in the new planet file.

If an article isn't extracted from the dump due to #24 or something else, then having the old copy still available might be useful.

But if the HTML simplification is changed, and older articles are no longer referenced in OSM, then they will remain on disk unchanged.

download.sh

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

biodranik

Need to test it on a server )

biodranik · 2023-08-16T22:17:42Z

download.sh

+    LANGUAGES=$(jq -r '(.sections_to_remove | keys | .[])' article_processing_config.json)
+fi
+# shellcheck disable=SC2086 # LANGUAGES is intentionally expanded.
+log "Selected languages:" $LANGUAGES


nit: Can array be used here without a warning?

To convert it to an array with the same semantics it would need to suppress another warning:

# shellcheck disable=SC2206 # Intentionally split on whitespace. LANGUAGES=( $LANGUAGES )

download.sh

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

newsch · 2023-08-17T14:48:31Z

I've looked a little into parallel downloads with programs in the Debian repos:

GNU parallel or GNU xargs works, but you lose the progress bar from wget and no indication of how the downloads are doing:

for url in $URLS; do echo "$url"; done | xargs -L 1 -P 3 wget --no-verbose --continue --directory-prefix "$DOWNLOAD_DIR"

aria2c returned protocol errors:

aria2c -x2 -s2 -c -d ~/Downloads/aria-test \
    https://dumps.wikimedia.org/other/enterprise_html/runs/20230801/dewiki-NS0-20230801-ENTERPRISE-HTML.json.tar.gz \
    https://dumps.wikimedia.org/other/enterprise_html/runs/20230801/eswiki-NS0-20230801-ENTERPRISE-HTML.json.tar.gz

08/17 10:35:22 [NOTICE] Downloading 1 item(s)

08/17 10:35:23 [ERROR] CUID#9 - Download aborted. URI=https://dumps.wikimedia.org/other/enterprise_html/runs/20230801/eswiki-NS0-20230801-ENTERPRISE-HTML.json.tar.gz
Exception: [AbstractCommand.cc:351] errorCode=8 URI=https://dumps.wikimedia.org/other/enterprise_html/runs/20230801/eswiki-NS0-20230801-ENTERPRISE-HTML.json.tar.gz
  -> [HttpResponse.cc:81] errorCode=8 Invalid range header. Request: 130154496-14855176191/29631479446, Response: 130154496-14855176191/25080519092

axel only seems to parallelize a single download

wget2 works great out of the box:

wget2 --progress=bar --continue --directory-prefix "$DOWNLOAD_DIR" $URLS

newsch · 2023-08-17T15:02:10Z

download.sh

+LATEST_LINK="$DUMP_DIR/latest"
+ln -sf "$LATEST_DUMP" "$LATEST_LINK"
+
+# TODO: Remove old dumps?


Do you want the script to handle this?

If it will be running on a cron job, then it might be good to keep 2 copies around.
Otherwise the script could delete the last dump as wikiparser is using it?

Aren't files that were open before their deletion on Linux still accessible?

Dumps are produced regularly, right? We can set a specific schedule.

Script may have an option to automatically delete older dumps.

Aren't files that were open before their deletion on Linux still accessible?

You're right, as long as run.sh is started before download.sh deletes them, it will be able to access the files.

Dumps are produced regularly, right? We can set a specific schedule.

Yes, they're started on the 1st and the 20th of each month, and finished within 3 days it looks like.

Script may have an option to automatically delete older dumps.

👍

I've added a new option:

-D Delete all old dump subdirectories if the latest is downloaded

biodranik · 2023-08-17T23:48:45Z

wget2 works great out of the box:

The default behavior can be like this: use wget2 if it's available, and fall back to a single-threaded download while mentioning a speedup with wget2.

Another important question is if it's ok to overload wiki servers with parallel downloads. Can you please ask them to confirm? Maybe they have a single-threaded policy?

newsch · 2023-08-18T17:18:41Z

Looks like 2 parallel downloads is the max:

If you are reading this on Wikimedia servers, please note that we have rate limited downloaders and we are capping the number of per-ip connections to 2. This will help to ensure that everyone can access the files with reasonable download times. Clients that try to evade these limits may be blocked.

There are at least two mirrors that host some of the latest enterprise dumps:

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

Without -T, ln interprets an existing LATEST_LINK as a directory to place the link in, instead of a link to replace. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

biodranik · 2023-08-18T18:36:00Z

Good, let's track how fast mirrors are updated. We may hardcode or put into readme links to URLs/mirrors and use what is better/faster.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

newsch · 2023-08-21T20:32:01Z

Both of the mirrors have the 2023-08-20 dumps up already.

biodranik

Thanks!

wget2 doesn't resume interrupted downloads.
Don't forget to squash all commits before the merge )

download.sh

biodranik · 2023-08-21T20:54:51Z

download.sh

+
+Arguments:
+    <DUMP_DIR>  An existing directory to store dumps in. Dumps will be grouped
+                into subdirectories by date, and a link 'latest' will point to


Will wikiparser generator properly find/load newer versions from the latest dir without specifying explicit file names?

For the run.sh script, you'll provide a glob of the latest directory:

./run.sh descriptions/ planet.osm.pdf $DUMP_DIR/latest/*

It doesn't have any special handling for the $DUMP_DIR layout.

newsch · 2023-08-21T21:14:38Z

wget2 doesn't resume interrupted downloads

What kind of interruption? It should be able to handle network drops and temporary errors.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

biodranik · 2023-08-29T21:54:31Z

download.sh

+    # While the dump websites are not part of the API, it's still polite to identify yourself.
+    # See https://meta.wikimedia.org/wiki/User-Agent_policy
+    subcommand=$1
+    name="OrganicMapsWikiparserDownloaderBot"


Can spaces be added here?

I haven't seen an example with spaces in the name. All of the browser user agents use CamelCase instead of spaces.

biodranik · 2023-08-29T21:55:56Z

It would be great to test all these PRs on the server with a real data.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

biodranik reviewed Jul 20, 2023

View reviewed changes

newsch added this to the v0.2 milestone Aug 9, 2023

newsch added 12 commits August 16, 2023 13:29

WIP download script

28c17a2

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

Add requested changes

7254bc3

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

Canonicalize input paths

3a4d121

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

Fix check for uninitialized variable

9ee1e8d

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

Improve comments

bae03b9

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

Fix jq list output

fe295b2

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

Use real enterprise dump url

0a1e059

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

Track number of missing dumps

27ff9cb

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

Working downloads

54727b9

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

Store in subdirs

bce44d1

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

Document usage

5077ed0

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

Rename temp dir

4c2c6e9

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

newsch force-pushed the download branch from 3c1a1aa to 4c2c6e9 Compare August 16, 2023 21:12

newsch marked this pull request as ready for review August 16, 2023 21:13

newsch added 4 commits August 16, 2023 17:14

Check for DUMP_DIR existence

af80f2a

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

Mention download.sh in README

98d5a8a

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

Fix typo

30b19ca

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

Fix typo typo

187e294

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

newsch requested a review from biodranik August 16, 2023 21:59

biodranik reviewed Aug 16, 2023

View reviewed changes

newsch added 3 commits August 16, 2023 20:08

Clarify LANGUAGES parsing

b06167f

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

Remove old workaround for lack of pipefail

38faebb

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

Remove extra whitespace

82f2993

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

newsch commented Aug 17, 2023

View reviewed changes

Add option to delete old dumps

9d2d2e5

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

newsch added 3 commits August 18, 2023 14:15

Use wget2 by default

4d91992

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

Write requested help to stdout

5c8be74

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

Fix link replacement

e7b5c19

Without -T, ln interprets an existing LATEST_LINK as a directory to place the link in, instead of a link to replace. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

newsch added 2 commits August 21, 2023 16:26

Make base url configurable

cf1ac05

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

Make concurrent downloads configurable

4be00cd

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

biodranik reviewed Aug 21, 2023

View reviewed changes

newsch added 2 commits August 21, 2023 17:17

Clarify -c behavior

f08fd7d

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

Use custom user agent with email

b118724

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

newsch mentioned this pull request Aug 22, 2023

Split downloads across mirrors #27

Open

biodranik approved these changes Aug 29, 2023

View reviewed changes

Fix typos

b21a999

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

newsch merged commit 481ace4 into main Sep 26, 2023
2 checks passed

newsch deleted the download branch September 26, 2023 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download script #22

Download script #22

newsch commented Jul 18, 2023 •

edited

biodranik Jul 20, 2023

newsch Aug 16, 2023

biodranik Aug 16, 2023

newsch Aug 17, 2023

biodranik left a comment

biodranik Aug 16, 2023

newsch Aug 17, 2023 •

edited

newsch commented Aug 17, 2023

newsch Aug 17, 2023

biodranik Aug 17, 2023

newsch Aug 18, 2023

newsch Aug 18, 2023

biodranik commented Aug 17, 2023

newsch commented Aug 18, 2023

biodranik commented Aug 18, 2023

newsch commented Aug 21, 2023

biodranik left a comment

biodranik Aug 21, 2023

newsch Aug 21, 2023

newsch commented Aug 21, 2023

biodranik Aug 29, 2023

newsch Sep 1, 2023

biodranik commented Aug 29, 2023

Download script #22

Download script #22

Conversation

newsch commented Jul 18, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

biodranik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

newsch Aug 17, 2023 • edited

Choose a reason for hiding this comment

newsch commented Aug 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

biodranik commented Aug 17, 2023

newsch commented Aug 18, 2023

biodranik commented Aug 18, 2023

newsch commented Aug 21, 2023

biodranik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

newsch commented Aug 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

biodranik commented Aug 29, 2023

newsch commented Jul 18, 2023 •

edited

newsch Aug 17, 2023 •

edited