-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix weekly full index creation #304
Comments
Original issue seems to be failing download of latest baseline dump from persephone in Manually downloaded to For a permanent solution, we need to fix the automated download. It might make sense to get it over HTTP in general (as I did manually above). The |
Indexing worked and JP confirmed to use http://index.hbz-nrw.de/alephxml/export/ Next: set up baseline downloads via HTTP in server setup starting from crontab for |
Adding script changes below as affected script is not under version control. Replaced the old content of DIR=/files/open_data/open/DE-605/mabxml
oldFile=$(ls $DIR/DE-605-aleph-base*2*.tar.gz)
oldUpdateFiles=$(ls $DIR/DE-605-aleph-update-marcxchange-*.tar.gz)
ssh admin@persephone 'cd /data/alephxml/export/baseline/ ; a=$(ls -cR | grep tar.gz | head -n 1); a=$(find . -name $a) ; scp $a hduser@gaia:/files/open_data/open/DE-605/mabxml'
#mv $oldFile /files/open_data/closed/hbzvk/index.hbz-nrw.de/alephxml/clobs/baseline/DE-605-aleph-newestBackupOfMonth-$(date +%m).tar.gz
#for i in $oldUpdateFiles; do rm $i; done With new content: #!/bin/bash
set -euo pipefail # See http://redsymbol.net/articles/unofficial-bash-strict-mode/
IFS=$'\n\t'
BASELINE_ROOT="http://index.hbz-nrw.de/alephxml/export/baseline"
# Date of the latest baseline dump, with trailing slash removed, e.g. "2016042319"
BASELINE_DATE="$(curl $BASELINE_ROOT/ | grep '20' | cut -d '"' -f2 | tail -n 1 | rev | cut -c 2- | rev)"
# URL of actual baseline file, e.g. http://index.hbz-nrw.de/alephxml/export/baseline/2016042319/DE-605-aleph-baseline-marcxchange-2016042319.tar.gz
BASELINE_URL="$BASELINE_ROOT/$BASELINE_DATE/DE-605-aleph-baseline-marcxchange-$BASELINE_DATE.tar.gz"
echo "Getting baseline from $BASELINE_URL"
cd /files/open_data/open/DE-605/mabxml
wget --no-verbose $BASELINE_URL
# See also https://github.com/hbz/lobid/issues/304 And changed crontab entry to redirect output to a log file: ssh gaia 'cron/copyNewestFullDump.sh > cron/copyNewestFullDump.log 2>&1 ; [...]' Tested trigger from crontab for |
Reopening: weekly updates don't pick up latest baseline. Times at http://index.hbz-nrw.de/alephxml/export/baseline/ look good, crontab for
#!/bin/bash
set -euo pipefail # See http://redsymbol.net/articles/unofficial-bash-strict-mode/
IFS=$'\n\t'
echo "Copy newest baseline, date: $(date)"
BASELINE_ROOT="http://index.hbz-nrw.de/alephxml/export/baseline"
# Date of the latest baseline dump, with trailing slash removed, e.g. "2016042319"
BASELINE_DATE="$(curl $BASELINE_ROOT/ | grep '20' | cut -d '"' -f2 | tail -n 1 | rev | cut -c 2- | rev)"
# File name, e.g. DE-605-aleph-baseline-marcxchange-2016042319.tar.gz
BASELINE_FILE="DE-605-aleph-baseline-marcxchange-$BASELINE_DATE.tar.gz"
# URL of actual baseline file, e.g. http://index.hbz-nrw.de/alephxml/export/baseline/2016042319/DE-605-aleph-baseline-marcxchange-2016042319.tar.gz
BASELINE_URL="$BASELINE_ROOT/$BASELINE_DATE/$BASELINE_FILE"
echo "Getting baseline from $BASELINE_URL"
cd /files/open_data/open/DE-605/mabxml
if [ -f $BASELINE_FILE ]; then
echo "File already exists, exit 1"
exit 1
fi
wget --no-verbose $BASELINE_URL
# See also https://github.com/hbz/lobid/issues/304
|
Logging output confirms that the timing should be correct: Running Not switching to new index, as it would be missing updates. To save space, we should delete it. |
Note: also the http://lobid.org/download/dumps/DE-605/mabxml/ is messed up - these files are build by the script, but the crucila commands were outcommented even in the original fiel, see #304 (comment) . Commented them in so that the old files will be moved. (for diffs, I made copies of the files suffxing a timestamp). |
Me also don't comprehend the cause of the problem. Therefore added a debug parameter to the script to have more information ("bash -x ..."). Also, the nfs server |
Same issue, took |
Still not clear. Did the following though:
We have to wait till next saturday. Since the resources are updated in the productive older index from 2016-05-07 this is no problem. As long as there was no mapping in the data transformations since then (which would only apply to the updates, not the base). Is this so, @fsteeg ? Otherwise I would make sure that all updates are indexed into the newest index from 2016-05-20 and switch to this index. |
No, there were no transformation changes that are not productive yet, so +1 for keeping the old index. For next week's run, maybe we should try a bigger time change, like 9 hours (Saturday afternoon)? |
Bigger time change is an option, agreed. But implies to have daily updates accordingly later for saturday AND to have an extra entry in crontab for the daily update at saturday. Foremost I would want to know what's going on there and thus just wait what the logs tell us next time before we increase the time of getting and feeding the base data. |
The cause of the phenomenon was the rsyncing of the file to the webserver at 6:01, thus preserving the timestamp on the file, thus confusion. |
Weekly full index creation failed due to changes in server infrastructure we depend on. Affects API 1.x and data 2.0.
The text was updated successfully, but these errors were encountered: