Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download script #22

Merged
merged 28 commits into from Sep 26, 2023
Merged
Changes from 2 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
33 changes: 25 additions & 8 deletions download.sh
Expand Up @@ -11,9 +11,8 @@ Arguments:
Options:
-h Print this help screen
-D Delete all old dump subdirectories if the latest is downloaded
-c <NUM> Number of concurrent downloads to allow. Requires MIRROR to be
set (Wikimedia servers ask for no more than 2). Requires wget2.
Defaults to 2.
-c <NUM> Number of concurrent downloads to allow. Ignored if wget2 is not
present or MIRROR is not set. Defaults to 2.

Environment Variables:
LANGUAGES A whitespace-separated list of wikipedia language codes to
Expand All @@ -38,6 +37,17 @@ Exit codes:
set -euo pipefail
# set -x

build_user_agent() {
# While the dump websites are not part of the API, it's still polite to identify yourself.
# See https://meta.wikimedia.org/wiki/User-Agent_policy
subcommand=$1
name="OrganicMapsWikiparserDownloaderBot"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can spaces be added here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen an example with spaces in the name. All of the browser user agents use CamelCase instead of spaces.

version="1.0"
url="https://github.com/organicmaps/wikiparser"
email="hello@organicmaps.app"
echo -n "$name/$version ($url; $email) $subcommand"
}

# Parse options.
DELETE_OLD_DUMPS=false
CONCURRENT_DOWNLOADS=
Expand Down Expand Up @@ -80,6 +90,8 @@ if [ -n "$CONCURRENT_DOWNLOADS" ]; then
exit 1
fi
if [ -z "${MIRROR:-}" ]; then
# NOTE: Wikipedia requests no more than 2 concurrent downloads.
# See https://dumps.wikimedia.org/ for more info.
echo "WARN: MIRROR is not set; ignoring -n" >&2
CONCURRENT_DOWNLOADS=
fi
Expand All @@ -93,6 +105,7 @@ SCRIPT_PATH=$(pwd)
# Only load library after changing to script directory.
source lib.sh


if [ -n "${MIRROR:-}" ]; then
log "Using mirror '$MIRROR'"
BASE_URL=$MIRROR
Expand Down Expand Up @@ -140,15 +153,19 @@ fi

log "Downloading available dumps"
if type wget2 > /dev/null; then
# NOTE: Wikipedia requests no more than 2 concurrent downloads.
# See https://dumps.wikimedia.org/ for more info.

# shellcheck disable=SC2086 # URLS should be expanded on spaces.
wget2 --max-threads "${CONCURRENT_DOWNLOADS:-2}" --verbose --progress=bar --directory-prefix "$DOWNLOAD_DIR" --continue $URLS
wget2 --verbose --progress=bar --continue \
--user-agent "$(build_user_agent wget2)" \
--max-threads "${CONCURRENT_DOWNLOADS:-2}" \
--directory-prefix "$DOWNLOAD_DIR" \
$URLS
else
log "WARN: wget2 is not available, falling back to sequential downloads"
# shellcheck disable=SC2086 # URLS should be expanded on spaces.
wget --directory-prefix "$DOWNLOAD_DIR" --continue $URLS
wget --continue \
--user-agent "$(build_user_agent wget)" \
--directory-prefix "$DOWNLOAD_DIR" \
$URLS
fi

if [ $MISSING_DUMPS -gt 0 ]; then
Expand Down