Fix weekly full index creation #304

fsteeg · 2016-04-25T08:45:00Z

Weekly full index creation failed due to changes in server infrastructure we depend on. Affects API 1.x and data 2.0.

fsteeg · 2016-04-25T12:02:34Z

Original issue seems to be failing download of latest baseline dump from persephone in gaia:/opt/hadoop/cron/copyNewestFullDump.sh which is called from hduser@weywot1 crontab (can't connect via SSH, maybe a missing key or account on the new persephone system).

Manually downloaded to gaia:/files/open_data/open/DE-605/mabxml with wget http://index.hbz-nrw.de/alephxml/export/baseline/2016042319/DE-605-aleph-baseline-marcxchange-2016042319.tar.gz, manually set alias and started full indexing (as in hduser@weywot1 crontab).

For a permanent solution, we need to fix the automated download. It might make sense to get it over HTTP in general (as I did manually above). The wget above took about 2 minutes for 7.5 GB, so no issue there. Contacted JP to make sure getting it from http://index.hbz-nrw.de makes sense.

fsteeg · 2016-04-26T09:23:09Z

Indexing worked and JP confirmed to use http://index.hbz-nrw.de/alephxml/export/

Next: set up baseline downloads via HTTP in server setup starting from crontab for hduser@weywot1

fsteeg · 2016-04-26T11:46:18Z

Adding script changes below as affected script is not under version control.

Replaced the old content of gaia:/opt/hadoop/cron/copyNewestFullDump.sh:

DIR=/files/open_data/open/DE-605/mabxml
oldFile=$(ls $DIR/DE-605-aleph-base*2*.tar.gz)
oldUpdateFiles=$(ls  $DIR/DE-605-aleph-update-marcxchange-*.tar.gz)
ssh admin@persephone 'cd /data/alephxml/export/baseline/ ; a=$(ls -cR | grep tar.gz | head -n 1); a=$(find . -name $a) ; scp $a hduser@gaia:/files/open_data/open/DE-605/mabxml'
#mv $oldFile /files/open_data/closed/hbzvk/index.hbz-nrw.de/alephxml/clobs/baseline/DE-605-aleph-newestBackupOfMonth-$(date +%m).tar.gz
#for i in $oldUpdateFiles; do rm $i; done

With new content:

#!/bin/bash
set -euo pipefail # See http://redsymbol.net/articles/unofficial-bash-strict-mode/
IFS=$'\n\t'

BASELINE_ROOT="http://index.hbz-nrw.de/alephxml/export/baseline"
# Date of the latest baseline dump, with trailing slash removed, e.g. "2016042319"
BASELINE_DATE="$(curl $BASELINE_ROOT/ | grep '20' | cut -d '"' -f2 | tail -n 1 | rev | cut -c 2- | rev)"
# URL of actual baseline file, e.g. http://index.hbz-nrw.de/alephxml/export/baseline/2016042319/DE-605-aleph-baseline-marcxchange-2016042319.tar.gz
BASELINE_URL="$BASELINE_ROOT/$BASELINE_DATE/DE-605-aleph-baseline-marcxchange-$BASELINE_DATE.tar.gz"
echo "Getting baseline from $BASELINE_URL"
cd /files/open_data/open/DE-605/mabxml
wget --no-verbose $BASELINE_URL

# See also https://github.com/hbz/lobid/issues/304

And changed crontab entry to redirect output to a log file:

ssh gaia 'cron/copyNewestFullDump.sh > cron/copyNewestFullDump.log 2>&1 ; [...]'

Tested trigger from crontab for hduser@weywot1, closing.

fsteeg · 2016-05-09T07:28:20Z

Reopening: weekly updates don't pick up latest baseline. Times at http://index.hbz-nrw.de/alephxml/export/baseline/ look good, crontab for hduser@weywot1 timed at 5:20, it should see the latest baseline. Manual execution of script yields correct baseline. Added debug output of actual date in the script, see current content below. Keeping previous baseline index including all updates as productive index.

hduser@gaia:/opt/hadoop/cron/copyNewestFullDump.sh:

#!/bin/bash
set -euo pipefail # See http://redsymbol.net/articles/unofficial-bash-strict-mode/
IFS=$'\n\t'

echo "Copy newest baseline, date: $(date)"
BASELINE_ROOT="http://index.hbz-nrw.de/alephxml/export/baseline"
# Date of the latest baseline dump, with trailing slash removed, e.g. "2016042319"
BASELINE_DATE="$(curl $BASELINE_ROOT/ | grep '20' | cut -d '"' -f2 | tail -n 1 | rev | cut -c 2- | rev)"
# File name, e.g. DE-605-aleph-baseline-marcxchange-2016042319.tar.gz
BASELINE_FILE="DE-605-aleph-baseline-marcxchange-$BASELINE_DATE.tar.gz"
# URL of actual baseline file, e.g. http://index.hbz-nrw.de/alephxml/export/baseline/2016042319/DE-605-aleph-baseline-marcxchange-2016042319.tar.gz
BASELINE_URL="$BASELINE_ROOT/$BASELINE_DATE/$BASELINE_FILE"
echo "Getting baseline from $BASELINE_URL"
cd /files/open_data/open/DE-605/mabxml
if [ -f $BASELINE_FILE ]; then
    echo "File already exists, exit 1"
    exit 1
fi
wget --no-verbose $BASELINE_URL

# See also https://github.com/hbz/lobid/issues/304

fsteeg · 2016-05-17T07:16:37Z

Logging output confirms that the timing should be correct: Sa 14. Mai 05:20:01 CEST 2016, but got DE-605-aleph-baseline-marcxchange-2016050614.tar.gz, even though according to http://index.hbz-nrw.de/alephxml/export/baseline/2016051314/, the latest dump was written on 13-May-2016 23:04.

Not switching to new index, as it would be missing updates. To save space, we should delete it.

dr0i · 2016-05-19T13:50:16Z

Note: also the http://lobid.org/download/dumps/DE-605/mabxml/ is messed up - these files are build by the script, but the crucila commands were outcommented even in the original fiel, see #304 (comment) . Commented them in so that the old files will be moved. (for diffs, I made copies of the files suffxing a timestamp).

dr0i · 2016-05-19T14:39:13Z

Me also don't comprehend the cause of the problem. Therefore added a debug parameter to the script to have more information ("bash -x ..."). Also, the nfs server demeter is now unmounted (not sure if this has something to do with it).
@fsteeg again, please check on monday if this is working. Will, if necessary, analyze further at tuesday.

fsteeg · 2016-05-23T07:08:18Z

Same issue, took DE-605-aleph-baseline-marcxchange-2016051314.tar.gz, no additional output in log.

dr0i · 2016-05-24T09:35:38Z

Still not clear. Did the following though:

Re "no additional output": forgot to add the parameter to the cron call at saturday (only added it to the test call)
increased execution time of script +20 min to 5:40
fixed the cleaning of old dumps, see http://lobid.org/download/dumps/DE-605/mabxml/

We have to wait till next saturday. Since the resources are updated in the productive older index from 2016-05-07 this is no problem. As long as there was no mapping in the data transformations since then (which would only apply to the updates, not the base). Is this so, @fsteeg ? Otherwise I would make sure that all updates are indexed into the newest index from 2016-05-20 and switch to this index.

fsteeg · 2016-05-24T09:45:30Z

No, there were no transformation changes that are not productive yet, so +1 for keeping the old index.

For next week's run, maybe we should try a bigger time change, like 9 hours (Saturday afternoon)?

dr0i · 2016-05-24T09:59:56Z

Bigger time change is an option, agreed. But implies to have daily updates accordingly later for saturday AND to have an extra entry in crontab for the daily update at saturday. Foremost I would want to know what's going on there and thus just wait what the logs tell us next time before we increase the time of getting and feeding the base data.

dr0i · 2016-06-07T11:52:30Z

The cause of the phenomenon was the rsyncing of the file to the webserver at 6:01, thus preserving the timestamp on the file, thus confusion.
Modified the cron to start at 6:15.
That worked well.
Closing.

fsteeg self-assigned this Apr 25, 2016

fsteeg added the working label Apr 25, 2016

This was referenced Apr 25, 2016

Include names of corporate bodies without GND ID #302

Closed

Adjust context and morph to use new namespace for RDA properties hbz/lobid-resources#60

Closed

fsteeg mentioned this issue Apr 26, 2016

Resources missing #305

Closed

fsteeg closed this as completed Apr 26, 2016

fsteeg removed the working label Apr 26, 2016

fsteeg reopened this May 9, 2016

fsteeg removed their assignment May 17, 2016

fsteeg added the ready label May 17, 2016

dr0i self-assigned this May 17, 2016

fsteeg added the bug label May 18, 2016

dr0i assigned fsteeg and unassigned dr0i May 19, 2016

dr0i added review and removed ready labels May 19, 2016

fsteeg assigned dr0i and unassigned fsteeg May 23, 2016

dr0i closed this as completed Jun 7, 2016

dr0i removed the review label Jun 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix weekly full index creation #304

Fix weekly full index creation #304

fsteeg commented Apr 25, 2016 •

edited

Loading

fsteeg commented Apr 25, 2016

fsteeg commented Apr 26, 2016

fsteeg commented Apr 26, 2016

fsteeg commented May 9, 2016

fsteeg commented May 17, 2016

dr0i commented May 19, 2016 •

edited

Loading

dr0i commented May 19, 2016

fsteeg commented May 23, 2016

dr0i commented May 24, 2016 •

edited

Loading

fsteeg commented May 24, 2016

dr0i commented May 24, 2016

dr0i commented Jun 7, 2016

Fix weekly full index creation #304

Fix weekly full index creation #304

Comments

fsteeg commented Apr 25, 2016 • edited Loading

fsteeg commented Apr 25, 2016

fsteeg commented Apr 26, 2016

fsteeg commented Apr 26, 2016

fsteeg commented May 9, 2016

fsteeg commented May 17, 2016

dr0i commented May 19, 2016 • edited Loading

dr0i commented May 19, 2016

fsteeg commented May 23, 2016

dr0i commented May 24, 2016 • edited Loading

fsteeg commented May 24, 2016

dr0i commented May 24, 2016

dr0i commented Jun 7, 2016

fsteeg commented Apr 25, 2016 •

edited

Loading

dr0i commented May 19, 2016 •

edited

Loading

dr0i commented May 24, 2016 •

edited

Loading