Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download script #22

Merged
merged 28 commits into from Sep 26, 2023
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
26 changes: 26 additions & 0 deletions README.md
Expand Up @@ -10,6 +10,32 @@ OpenStreetMap commonly stores these as [`wikipedia*=`](https://wiki.openstreetma
[`article_processing_config.json`](article_processing_config.json) should be updated when adding a new language.
It defines article sections that are not important for users and should be removed from the extracted HTML.

## Downloading Dumps

[Enterprise HTML dumps, updated twice a month, are publicly accessible](https://dumps.wikimedia.org/other/enterprise_html/).

For the wikiparser you'll want the ["NS0"](https://en.wikipedia.org/wiki/Wikipedia:Namespace) "ENTERPRISE-HTML" `.json.tar.gz` files.

They are gzipped tar files containing a single file of newline-delimited JSON matching the [Wikimedia Enterprise API schema](https://enterprise.wikimedia.com/docs/data-dictionary/).

The included [`download.sh`](./download.sh) script handles downloading the latest set of dumps in specific languages.
It maintains a directory with the following layout:
```
<DUMP_DIR>/
├── latest -> 20230701/
├── 20230701/
│ ├── dewiki-NS0-20230701-ENTERPRISE-HTML.json.tar.gz
│ ├── enwiki-NS0-20230701-ENTERPRISE-HTML.json.tar.gz
│ ├── eswiki-NS0-20230701-ENTERPRISE-HTML.json.tar.gz
│ ...
├── 20230620/
│ ├── dewiki-NS0-20230620-ENTERPRISE-HTML.json.tar.gz
│ ├── enwiki-NS0-20230620-ENTERPRISE-HTML.json.tar.gz
│ ├── eswiki-NS0-20230620-ENTERPRISE-HTML.json.tar.gz
│ ...
...
```

## Usage

To use with the map generator, see the [`run.sh` script](run.sh) and its own help documentation.
Expand Down
101 changes: 101 additions & 0 deletions download.sh
@@ -0,0 +1,101 @@
#! /usr/bin/env bash
USAGE="Usage: ./download.sh <DUMP_DIR>

Download the latest Wikipedia Enterprise HTML dumps.

Arguments:
<DUMP_DIR> An existing directory to store dumps in. Dumps will be grouped
into subdirectories by date, and a link 'latest' will point to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will wikiparser generator properly find/load newer versions from the latest dir without specifying explicit file names?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the run.sh script, you'll provide a glob of the latest directory:

./run.sh descriptions/ planet.osm.pdf $DUMP_DIR/latest/*

It doesn't have any special handling for the $DUMP_DIR layout.

the latest complete dump subdirectory, if it exists.

Environment Variables:
LANGUAGES A whitespace-separated list of wikipedia language codes to
download dumps of.
Defaults to the languages in 'article_processing_config.json'.
See <https://meta.wikimedia.org/wiki/List_of_Wikipedias>.

Exit codes:
0 The lastest dumps are already present or were downloaded successfully.
1 Argument error.
16 Some of languages were not available to download. The latest dump may
be in progress, or some of the specified languages may not exist.
_ Subprocess error.
"

set -euo pipefail
# set -x

if [ -z "${1:-}" ]; then
echo -n "$USAGE" >&2
exit 1
fi

# The parent directory to store groups of dumps in.
DUMP_DIR=$(readlink -f "$1")
shift

if [ ! -d "$DUMP_DIR" ]; then
echo "DUMP_DIR '$DUMP_DIR' does not exist" >&2
exit 1
fi

# Ensure we're running in the directory of this script.
SCRIPT_PATH=$(dirname "$0")
cd "$SCRIPT_PATH"
SCRIPT_PATH=$(pwd)

# Only load library after changing to script directory.
source lib.sh

if [ -z "${LANGUAGES:-}" ]; then
# Load languages from config.
LANGUAGES=$(jq -r '(.sections_to_remove | keys | .[])' article_processing_config.json)
fi
# shellcheck disable=SC2086 # LANGUAGES is intentionally expanded.
log "Selected languages:" $LANGUAGES
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can array be used here without a warning?

Copy link
Collaborator Author

@newsch newsch Aug 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To convert it to an array with the same semantics it would need to suppress another warning:

# shellcheck disable=SC2206 # Intentionally split on whitespace.
LANGUAGES=( $LANGUAGES )


log "Fetching run index"
# The date of the latest dump, YYYYMMDD.
LATEST_DUMP=$(wget 'https://dumps.wikimedia.org/other/enterprise_html/runs/' --no-verbose -O - \
| grep -Po '(?<=href=")[^"]*' | grep -P '\d{8}' | sort -r | head -n1)
LATEST_DUMP="${LATEST_DUMP%/}"

log "Checking latest dump $LATEST_DUMP"

URLS=
MISSING_DUMPS=0
for lang in $LANGUAGES; do
url="https://dumps.wikimedia.org/other/enterprise_html/runs/${LATEST_DUMP}/${lang}wiki-NS0-${LATEST_DUMP}-ENTERPRISE-HTML.json.tar.gz"
if ! wget --no-verbose --method=HEAD "$url"; then
MISSING_DUMPS=$(( MISSING_DUMPS + 1 ))
log "Dump for '$lang' does not exist at '$url'"
continue
fi
URLS="$URLS $url"
done

if [ -z "$URLS" ]; then
log "No dumps available"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Latest dumps are already downloaded"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If URLS is empty, then none of the specified languages could be found for the latest dump.

If a newer dump isn't available, it will still check the sizes of the last downloaded dump, and exit with 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good! The goal is to make a cron script that will update files automatically when they are published (and delete old files).

Another question: should previously generated HTML and other temporary files be deleted before relaunching the wikiparser? Does it make sense to cover it in the run script?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They shouldn't need to be.

The temporary files are regenerated each time.
The generated HTML will be overwritten if it is referenced in the new planet file.

If an article isn't extracted from the dump due to #24 or something else, then having the old copy still available might be useful.

But if the HTML simplification is changed, and older articles are no longer referenced in OSM, then they will remain on disk unchanged.

exit 16
fi

# The subdir to store the latest dump in.
DOWNLOAD_DIR="$DUMP_DIR/$LATEST_DUMP"
if [ ! -e "$DOWNLOAD_DIR" ]; then
mkdir "$DOWNLOAD_DIR"
fi

log "Downloading available dumps"
# shellcheck disable=SC2086 # URLS should be expanded on spaces.
wget --directory-prefix "$DOWNLOAD_DIR" --continue $URLS

if [ $MISSING_DUMPS -gt 0 ]; then
log "$MISSING_DUMPS dumps not available yet"
exit 16
fi

log "Linking 'latest' to '$LATEST_DUMP'"
LATEST_LINK="$DUMP_DIR/latest"
ln -sf "$LATEST_DUMP" "$LATEST_LINK"

# TODO: Remove old dumps?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want the script to handle this?

If it will be running on a cron job, then it might be good to keep 2 copies around.
Otherwise the script could delete the last dump as wikiparser is using it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Aren't files that were open before their deletion on Linux still accessible?
  2. Dumps are produced regularly, right? We can set a specific schedule.
  3. Script may have an option to automatically delete older dumps.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Aren't files that were open before their deletion on Linux still accessible?

You're right, as long as run.sh is started before download.sh deletes them, it will be able to access the files.

  1. Dumps are produced regularly, right? We can set a specific schedule.

Yes, they're started on the 1st and the 20th of each month, and finished within 3 days it looks like.

  1. Script may have an option to automatically delete older dumps.

👍

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a new option:

-D      Delete all old dump subdirectories if the latest is downloaded