-
Notifications
You must be signed in to change notification settings - Fork 7
Download script #22
Download script #22
Conversation
|
||
if [ -z "$URLS" ] | ||
then | ||
log "No dumps available" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Latest dumps are already downloaded"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If URLS is empty, then none of the specified languages could be found for the latest dump.
If a newer dump isn't available, it will still check the sizes of the last downloaded dump, and exit with 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good! The goal is to make a cron script that will update files automatically when they are published (and delete old files).
Another question: should previously generated HTML and other temporary files be deleted before relaunching the wikiparser? Does it make sense to cover it in the run script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They shouldn't need to be.
The temporary files are regenerated each time.
The generated HTML will be overwritten if it is referenced in the new planet file.
If an article isn't extracted from the dump due to #24 or something else, then having the old copy still available might be useful.
But if the HTML simplification is changed, and older articles are no longer referenced in OSM, then they will remain on disk unchanged.
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to test it on a server )
LANGUAGES=$(jq -r '(.sections_to_remove | keys | .[])' article_processing_config.json) | ||
fi | ||
# shellcheck disable=SC2086 # LANGUAGES is intentionally expanded. | ||
log "Selected languages:" $LANGUAGES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can array be used here without a warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To convert it to an array with the same semantics it would need to suppress another warning:
# shellcheck disable=SC2206 # Intentionally split on whitespace.
LANGUAGES=( $LANGUAGES )
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
I've looked a little into parallel downloads with programs in the Debian repos: GNU parallel or GNU xargs works, but you lose the progress bar from wget and no indication of how the downloads are doing:
|
download.sh
Outdated
LATEST_LINK="$DUMP_DIR/latest" | ||
ln -sf "$LATEST_DUMP" "$LATEST_LINK" | ||
|
||
# TODO: Remove old dumps? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want the script to handle this?
If it will be running on a cron job, then it might be good to keep 2 copies around.
Otherwise the script could delete the last dump as wikiparser is using it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Aren't files that were open before their deletion on Linux still accessible?
- Dumps are produced regularly, right? We can set a specific schedule.
- Script may have an option to automatically delete older dumps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Aren't files that were open before their deletion on Linux still accessible?
You're right, as long as run.sh
is started before download.sh
deletes them, it will be able to access the files.
- Dumps are produced regularly, right? We can set a specific schedule.
Yes, they're started on the 1st and the 20th of each month, and finished within 3 days it looks like.
- Script may have an option to automatically delete older dumps.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a new option:
-D Delete all old dump subdirectories if the latest is downloaded
The default behavior can be like this: use wget2 if it's available, and fall back to a single-threaded download while mentioning a speedup with wget2. Another important question is if it's ok to overload wiki servers with parallel downloads. Can you please ask them to confirm? Maybe they have a single-threaded policy? |
Looks like 2 parallel downloads is the max:
There are at least two mirrors that host some of the latest enterprise dumps: |
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Without -T, ln interprets an existing LATEST_LINK as a directory to place the link in, instead of a link to replace. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Good, let's track how fast mirrors are updated. We may hardcode or put into readme links to URLs/mirrors and use what is better/faster. |
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Both of the mirrors have the 2023-08-20 dumps up already. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
- wget2 doesn't resume interrupted downloads.
- Don't forget to squash all commits before the merge )
|
||
Arguments: | ||
<DUMP_DIR> An existing directory to store dumps in. Dumps will be grouped | ||
into subdirectories by date, and a link 'latest' will point to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will wikiparser generator properly find/load newer versions from the latest dir without specifying explicit file names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the run.sh
script, you'll provide a glob of the latest directory:
./run.sh descriptions/ planet.osm.pdf $DUMP_DIR/latest/*
It doesn't have any special handling for the $DUMP_DIR
layout.
What kind of interruption? It should be able to handle network drops and temporary errors. |
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
# While the dump websites are not part of the API, it's still polite to identify yourself. | ||
# See https://meta.wikimedia.org/wiki/User-Agent_policy | ||
subcommand=$1 | ||
name="OrganicMapsWikiparserDownloaderBot" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can spaces be added here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't seen an example with spaces in the name. All of the browser user agents use CamelCase instead of spaces.
It would be great to test all these PRs on the server with a real data. |
Closes #12
Remaining work: