Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download script #22

Merged
merged 28 commits into from Sep 26, 2023
Merged

Download script #22

merged 28 commits into from Sep 26, 2023

Conversation

newsch
Copy link
Collaborator

@newsch newsch commented Jul 18, 2023

Closes #12

Remaining work:

  • Add handling for error conditions
  • Document config and error codes

download.sh Outdated Show resolved Hide resolved
download.sh Outdated Show resolved Hide resolved
download.sh Outdated Show resolved Hide resolved
download.sh Outdated Show resolved Hide resolved

if [ -z "$URLS" ]
then
log "No dumps available"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Latest dumps are already downloaded"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If URLS is empty, then none of the specified languages could be found for the latest dump.

If a newer dump isn't available, it will still check the sizes of the last downloaded dump, and exit with 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good! The goal is to make a cron script that will update files automatically when they are published (and delete old files).

Another question: should previously generated HTML and other temporary files be deleted before relaunching the wikiparser? Does it make sense to cover it in the run script?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They shouldn't need to be.

The temporary files are regenerated each time.
The generated HTML will be overwritten if it is referenced in the new planet file.

If an article isn't extracted from the dump due to #24 or something else, then having the old copy still available might be useful.

But if the HTML simplification is changed, and older articles are no longer referenced in OSM, then they will remain on disk unchanged.

download.sh Outdated Show resolved Hide resolved
download.sh Outdated Show resolved Hide resolved
@newsch newsch added this to the v0.2 milestone Aug 9, 2023
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
@newsch newsch marked this pull request as ready for review August 16, 2023 21:13
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
@newsch newsch requested a review from biodranik August 16, 2023 21:59
Copy link
Member

@biodranik biodranik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to test it on a server )

LANGUAGES=$(jq -r '(.sections_to_remove | keys | .[])' article_processing_config.json)
fi
# shellcheck disable=SC2086 # LANGUAGES is intentionally expanded.
log "Selected languages:" $LANGUAGES
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can array be used here without a warning?

Copy link
Collaborator Author

@newsch newsch Aug 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To convert it to an array with the same semantics it would need to suppress another warning:

# shellcheck disable=SC2206 # Intentionally split on whitespace.
LANGUAGES=( $LANGUAGES )

download.sh Outdated Show resolved Hide resolved
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
@newsch
Copy link
Collaborator Author

newsch commented Aug 17, 2023

I've looked a little into parallel downloads with programs in the Debian repos:

GNU parallel or GNU xargs works, but you lose the progress bar from wget and no indication of how the downloads are doing:

for url in $URLS; do echo "$url"; done | xargs -L 1 -P 3 wget --no-verbose --continue --directory-prefix "$DOWNLOAD_DIR"

aria2c returned protocol errors:

aria2c -x2 -s2 -c -d ~/Downloads/aria-test \
    https://dumps.wikimedia.org/other/enterprise_html/runs/20230801/dewiki-NS0-20230801-ENTERPRISE-HTML.json.tar.gz \
    https://dumps.wikimedia.org/other/enterprise_html/runs/20230801/eswiki-NS0-20230801-ENTERPRISE-HTML.json.tar.gz

08/17 10:35:22 [NOTICE] Downloading 1 item(s)

08/17 10:35:23 [ERROR] CUID#9 - Download aborted. URI=https://dumps.wikimedia.org/other/enterprise_html/runs/20230801/eswiki-NS0-20230801-ENTERPRISE-HTML.json.tar.gz
Exception: [AbstractCommand.cc:351] errorCode=8 URI=https://dumps.wikimedia.org/other/enterprise_html/runs/20230801/eswiki-NS0-20230801-ENTERPRISE-HTML.json.tar.gz
  -> [HttpResponse.cc:81] errorCode=8 Invalid range header. Request: 130154496-14855176191/29631479446, Response: 130154496-14855176191/25080519092

axel only seems to parallelize a single download

wget2 works great out of the box:

wget2 --progress=bar --continue --directory-prefix "$DOWNLOAD_DIR" $URLS

download.sh Outdated
LATEST_LINK="$DUMP_DIR/latest"
ln -sf "$LATEST_DUMP" "$LATEST_LINK"

# TODO: Remove old dumps?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want the script to handle this?

If it will be running on a cron job, then it might be good to keep 2 copies around.
Otherwise the script could delete the last dump as wikiparser is using it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Aren't files that were open before their deletion on Linux still accessible?
  2. Dumps are produced regularly, right? We can set a specific schedule.
  3. Script may have an option to automatically delete older dumps.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Aren't files that were open before their deletion on Linux still accessible?

You're right, as long as run.sh is started before download.sh deletes them, it will be able to access the files.

  1. Dumps are produced regularly, right? We can set a specific schedule.

Yes, they're started on the 1st and the 20th of each month, and finished within 3 days it looks like.

  1. Script may have an option to automatically delete older dumps.

👍

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a new option:

-D      Delete all old dump subdirectories if the latest is downloaded

@biodranik
Copy link
Member

wget2 works great out of the box:

The default behavior can be like this: use wget2 if it's available, and fall back to a single-threaded download while mentioning a speedup with wget2.

Another important question is if it's ok to overload wiki servers with parallel downloads. Can you please ask them to confirm? Maybe they have a single-threaded policy?

@newsch
Copy link
Collaborator Author

newsch commented Aug 18, 2023

Looks like 2 parallel downloads is the max:

If you are reading this on Wikimedia servers, please note that we have rate limited downloaders and we are capping the number of per-ip connections to 2. This will help to ensure that everyone can access the files with reasonable download times. Clients that try to evade these limits may be blocked.

There are at least two mirrors that host some of the latest enterprise dumps:

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Without -T, ln interprets an existing LATEST_LINK as a directory to
place the link in, instead of a link to replace.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
@biodranik
Copy link
Member

Good, let's track how fast mirrors are updated. We may hardcode or put into readme links to URLs/mirrors and use what is better/faster.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
@newsch
Copy link
Collaborator Author

newsch commented Aug 21, 2023

Both of the mirrors have the 2023-08-20 dumps up already.

Copy link
Member

@biodranik biodranik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

  1. wget2 doesn't resume interrupted downloads.
  2. Don't forget to squash all commits before the merge )

download.sh Outdated Show resolved Hide resolved

Arguments:
<DUMP_DIR> An existing directory to store dumps in. Dumps will be grouped
into subdirectories by date, and a link 'latest' will point to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will wikiparser generator properly find/load newer versions from the latest dir without specifying explicit file names?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the run.sh script, you'll provide a glob of the latest directory:

./run.sh descriptions/ planet.osm.pdf $DUMP_DIR/latest/*

It doesn't have any special handling for the $DUMP_DIR layout.

@newsch
Copy link
Collaborator Author

newsch commented Aug 21, 2023

wget2 doesn't resume interrupted downloads

What kind of interruption? It should be able to handle network drops and temporary errors.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
# While the dump websites are not part of the API, it's still polite to identify yourself.
# See https://meta.wikimedia.org/wiki/User-Agent_policy
subcommand=$1
name="OrganicMapsWikiparserDownloaderBot"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can spaces be added here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen an example with spaces in the name. All of the browser user agents use CamelCase instead of spaces.

@biodranik
Copy link
Member

It would be great to test all these PRs on the server with a real data.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
@newsch newsch merged commit 481ace4 into main Sep 26, 2023
2 checks passed
@newsch newsch deleted the download branch September 26, 2023 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automate dump downloading
2 participants