New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download script #22
Download script #22
Changes from 19 commits
28c17a2
7254bc3
3a4d121
9ee1e8d
bae03b9
fe295b2
0a1e059
27ff9cb
54727b9
bce44d1
5077ed0
4c2c6e9
af80f2a
98d5a8a
30b19ca
187e294
b06167f
38faebb
82f2993
9d2d2e5
4d91992
5c8be74
e7b5c19
cf1ac05
4be00cd
f08fd7d
b118724
b21a999
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,101 @@ | ||
#! /usr/bin/env bash | ||
USAGE="Usage: ./download.sh <DUMP_DIR> | ||
|
||
Download the latest Wikipedia Enterprise HTML dumps. | ||
|
||
Arguments: | ||
<DUMP_DIR> An existing directory to store dumps in. Dumps will be grouped | ||
into subdirectories by date, and a link 'latest' will point to | ||
the latest complete dump subdirectory, if it exists. | ||
|
||
Environment Variables: | ||
LANGUAGES A whitespace-separated list of wikipedia language codes to | ||
download dumps of. | ||
Defaults to the languages in 'article_processing_config.json'. | ||
See <https://meta.wikimedia.org/wiki/List_of_Wikipedias>. | ||
|
||
Exit codes: | ||
0 The lastest dumps are already present or were downloaded successfully. | ||
1 Argument error. | ||
16 Some of languages were not available to download. The latest dump may | ||
be in progress, or some of the specified languages may not exist. | ||
_ Subprocess error. | ||
" | ||
|
||
set -euo pipefail | ||
# set -x | ||
|
||
if [ -z "${1:-}" ]; then | ||
echo -n "$USAGE" >&2 | ||
exit 1 | ||
fi | ||
|
||
# The parent directory to store groups of dumps in. | ||
DUMP_DIR=$(readlink -f "$1") | ||
shift | ||
|
||
if [ ! -d "$DUMP_DIR" ]; then | ||
echo "DUMP_DIR '$DUMP_DIR' does not exist" >&2 | ||
exit 1 | ||
fi | ||
|
||
# Ensure we're running in the directory of this script. | ||
SCRIPT_PATH=$(dirname "$0") | ||
cd "$SCRIPT_PATH" | ||
SCRIPT_PATH=$(pwd) | ||
|
||
# Only load library after changing to script directory. | ||
source lib.sh | ||
|
||
if [ -z "${LANGUAGES:-}" ]; then | ||
# Load languages from config. | ||
LANGUAGES=$(jq -r '(.sections_to_remove | keys | .[])' article_processing_config.json) | ||
fi | ||
# shellcheck disable=SC2086 # LANGUAGES is intentionally expanded. | ||
log "Selected languages:" $LANGUAGES | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: Can array be used here without a warning? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To convert it to an array with the same semantics it would need to suppress another warning:
|
||
|
||
log "Fetching run index" | ||
# The date of the latest dump, YYYYMMDD. | ||
LATEST_DUMP=$(wget 'https://dumps.wikimedia.org/other/enterprise_html/runs/' --no-verbose -O - \ | ||
| grep -Po '(?<=href=")[^"]*' | grep -P '\d{8}' | sort -r | head -n1) | ||
LATEST_DUMP="${LATEST_DUMP%/}" | ||
|
||
log "Checking latest dump $LATEST_DUMP" | ||
|
||
URLS= | ||
MISSING_DUMPS=0 | ||
for lang in $LANGUAGES; do | ||
url="https://dumps.wikimedia.org/other/enterprise_html/runs/${LATEST_DUMP}/${lang}wiki-NS0-${LATEST_DUMP}-ENTERPRISE-HTML.json.tar.gz" | ||
if ! wget --no-verbose --method=HEAD "$url"; then | ||
MISSING_DUMPS=$(( MISSING_DUMPS + 1 )) | ||
log "Dump for '$lang' does not exist at '$url'" | ||
continue | ||
fi | ||
URLS="$URLS $url" | ||
done | ||
|
||
if [ -z "$URLS" ]; then | ||
log "No dumps available" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "Latest dumps are already downloaded"? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If URLS is empty, then none of the specified languages could be found for the latest dump. If a newer dump isn't available, it will still check the sizes of the last downloaded dump, and exit with 0. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good! The goal is to make a cron script that will update files automatically when they are published (and delete old files). Another question: should previously generated HTML and other temporary files be deleted before relaunching the wikiparser? Does it make sense to cover it in the run script? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. They shouldn't need to be. The temporary files are regenerated each time. If an article isn't extracted from the dump due to #24 or something else, then having the old copy still available might be useful. But if the HTML simplification is changed, and older articles are no longer referenced in OSM, then they will remain on disk unchanged. |
||
exit 16 | ||
fi | ||
|
||
# The subdir to store the latest dump in. | ||
DOWNLOAD_DIR="$DUMP_DIR/$LATEST_DUMP" | ||
if [ ! -e "$DOWNLOAD_DIR" ]; then | ||
mkdir "$DOWNLOAD_DIR" | ||
fi | ||
|
||
log "Downloading available dumps" | ||
# shellcheck disable=SC2086 # URLS should be expanded on spaces. | ||
wget --directory-prefix "$DOWNLOAD_DIR" --continue $URLS | ||
|
||
if [ $MISSING_DUMPS -gt 0 ]; then | ||
log "$MISSING_DUMPS dumps not available yet" | ||
exit 16 | ||
fi | ||
|
||
log "Linking 'latest' to '$LATEST_DUMP'" | ||
LATEST_LINK="$DUMP_DIR/latest" | ||
ln -sf "$LATEST_DUMP" "$LATEST_LINK" | ||
|
||
# TODO: Remove old dumps? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you want the script to handle this? If it will be running on a cron job, then it might be good to keep 2 copies around. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
You're right, as long as
Yes, they're started on the 1st and the 20th of each month, and finished within 3 days it looks like.
👍 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've added a new option:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will wikiparser generator properly find/load newer versions from the latest dir without specifying explicit file names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the
run.sh
script, you'll provide a glob of the latest directory:It doesn't have any special handling for the
$DUMP_DIR
layout.