Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix weekly full index creation #304

Closed
fsteeg opened this issue Apr 25, 2016 · 12 comments
Closed

Fix weekly full index creation #304

fsteeg opened this issue Apr 25, 2016 · 12 comments
Assignees
Labels

Comments

@fsteeg
Copy link
Member

fsteeg commented Apr 25, 2016

Weekly full index creation failed due to changes in server infrastructure we depend on. Affects API 1.x and data 2.0.

@fsteeg
Copy link
Member Author

fsteeg commented Apr 25, 2016

Original issue seems to be failing download of latest baseline dump from persephone in gaia:/opt/hadoop/cron/copyNewestFullDump.sh which is called from hduser@weywot1 crontab (can't connect via SSH, maybe a missing key or account on the new persephone system).

Manually downloaded to gaia:/files/open_data/open/DE-605/mabxml with wget http://index.hbz-nrw.de/alephxml/export/baseline/2016042319/DE-605-aleph-baseline-marcxchange-2016042319.tar.gz, manually set alias and started full indexing (as in hduser@weywot1 crontab).

For a permanent solution, we need to fix the automated download. It might make sense to get it over HTTP in general (as I did manually above). The wget above took about 2 minutes for 7.5 GB, so no issue there. Contacted JP to make sure getting it from http://index.hbz-nrw.de makes sense.

@fsteeg
Copy link
Member Author

fsteeg commented Apr 26, 2016

Indexing worked and JP confirmed to use http://index.hbz-nrw.de/alephxml/export/

Next: set up baseline downloads via HTTP in server setup starting from crontab for hduser@weywot1

@fsteeg
Copy link
Member Author

fsteeg commented Apr 26, 2016

Adding script changes below as affected script is not under version control.

Replaced the old content of gaia:/opt/hadoop/cron/copyNewestFullDump.sh:

DIR=/files/open_data/open/DE-605/mabxml
oldFile=$(ls $DIR/DE-605-aleph-base*2*.tar.gz)
oldUpdateFiles=$(ls  $DIR/DE-605-aleph-update-marcxchange-*.tar.gz)
ssh admin@persephone 'cd /data/alephxml/export/baseline/ ; a=$(ls -cR | grep tar.gz | head -n 1); a=$(find . -name $a) ; scp $a hduser@gaia:/files/open_data/open/DE-605/mabxml'
#mv $oldFile /files/open_data/closed/hbzvk/index.hbz-nrw.de/alephxml/clobs/baseline/DE-605-aleph-newestBackupOfMonth-$(date +%m).tar.gz
#for i in $oldUpdateFiles; do rm $i; done

With new content:

#!/bin/bash
set -euo pipefail # See http://redsymbol.net/articles/unofficial-bash-strict-mode/
IFS=$'\n\t'

BASELINE_ROOT="http://index.hbz-nrw.de/alephxml/export/baseline"
# Date of the latest baseline dump, with trailing slash removed, e.g. "2016042319"
BASELINE_DATE="$(curl $BASELINE_ROOT/ | grep '20' | cut -d '"' -f2 | tail -n 1 | rev | cut -c 2- | rev)"
# URL of actual baseline file, e.g. http://index.hbz-nrw.de/alephxml/export/baseline/2016042319/DE-605-aleph-baseline-marcxchange-2016042319.tar.gz
BASELINE_URL="$BASELINE_ROOT/$BASELINE_DATE/DE-605-aleph-baseline-marcxchange-$BASELINE_DATE.tar.gz"
echo "Getting baseline from $BASELINE_URL"
cd /files/open_data/open/DE-605/mabxml
wget --no-verbose $BASELINE_URL

# See also https://github.com/hbz/lobid/issues/304

And changed crontab entry to redirect output to a log file:

ssh gaia 'cron/copyNewestFullDump.sh > cron/copyNewestFullDump.log 2>&1 ; [...]'

Tested trigger from crontab for hduser@weywot1, closing.

@fsteeg fsteeg closed this as completed Apr 26, 2016
@fsteeg fsteeg removed the working label Apr 26, 2016
@fsteeg fsteeg reopened this May 9, 2016
@fsteeg
Copy link
Member Author

fsteeg commented May 9, 2016

Reopening: weekly updates don't pick up latest baseline. Times at http://index.hbz-nrw.de/alephxml/export/baseline/ look good, crontab for hduser@weywot1 timed at 5:20, it should see the latest baseline. Manual execution of script yields correct baseline. Added debug output of actual date in the script, see current content below. Keeping previous baseline index including all updates as productive index.

hduser@gaia:/opt/hadoop/cron/copyNewestFullDump.sh:

#!/bin/bash
set -euo pipefail # See http://redsymbol.net/articles/unofficial-bash-strict-mode/
IFS=$'\n\t'

echo "Copy newest baseline, date: $(date)"
BASELINE_ROOT="http://index.hbz-nrw.de/alephxml/export/baseline"
# Date of the latest baseline dump, with trailing slash removed, e.g. "2016042319"
BASELINE_DATE="$(curl $BASELINE_ROOT/ | grep '20' | cut -d '"' -f2 | tail -n 1 | rev | cut -c 2- | rev)"
# File name, e.g. DE-605-aleph-baseline-marcxchange-2016042319.tar.gz
BASELINE_FILE="DE-605-aleph-baseline-marcxchange-$BASELINE_DATE.tar.gz"
# URL of actual baseline file, e.g. http://index.hbz-nrw.de/alephxml/export/baseline/2016042319/DE-605-aleph-baseline-marcxchange-2016042319.tar.gz
BASELINE_URL="$BASELINE_ROOT/$BASELINE_DATE/$BASELINE_FILE"
echo "Getting baseline from $BASELINE_URL"
cd /files/open_data/open/DE-605/mabxml
if [ -f $BASELINE_FILE ]; then
    echo "File already exists, exit 1"
    exit 1
fi
wget --no-verbose $BASELINE_URL

# See also https://github.com/hbz/lobid/issues/304

@fsteeg
Copy link
Member Author

fsteeg commented May 17, 2016

Logging output confirms that the timing should be correct: Sa 14. Mai 05:20:01 CEST 2016, but got DE-605-aleph-baseline-marcxchange-2016050614.tar.gz, even though according to http://index.hbz-nrw.de/alephxml/export/baseline/2016051314/, the latest dump was written on 13-May-2016 23:04.

Running curl http://index.hbz-nrw.de/alephxml/export/baseline/ | grep '20' | cut -d '"' -f2 | tail -n 1 | rev | cut -c 2- | rev (see script above) now gives the correct result, 2016051314. Perhaps the file timestamp is misleading, and we should schedule the cron job for a later time, @dr0i?

Not switching to new index, as it would be missing updates. To save space, we should delete it.

@fsteeg fsteeg removed their assignment May 17, 2016
@fsteeg fsteeg added the ready label May 17, 2016
@dr0i dr0i self-assigned this May 17, 2016
@fsteeg fsteeg added the bug label May 18, 2016
@dr0i
Copy link
Member

dr0i commented May 19, 2016

Note: also the http://lobid.org/download/dumps/DE-605/mabxml/ is messed up - these files are build by the script, but the crucila commands were outcommented even in the original fiel, see #304 (comment) . Commented them in so that the old files will be moved. (for diffs, I made copies of the files suffxing a timestamp).

@dr0i
Copy link
Member

dr0i commented May 19, 2016

Me also don't comprehend the cause of the problem. Therefore added a debug parameter to the script to have more information ("bash -x ..."). Also, the nfs server demeter is now unmounted (not sure if this has something to do with it).
@fsteeg again, please check on monday if this is working. Will, if necessary, analyze further at tuesday.

@dr0i dr0i assigned fsteeg and unassigned dr0i May 19, 2016
@dr0i dr0i added review and removed ready labels May 19, 2016
@fsteeg
Copy link
Member Author

fsteeg commented May 23, 2016

Same issue, took DE-605-aleph-baseline-marcxchange-2016051314.tar.gz, no additional output in log.

@fsteeg fsteeg assigned dr0i and unassigned fsteeg May 23, 2016
@dr0i
Copy link
Member

dr0i commented May 24, 2016

Still not clear. Did the following though:

  • Re "no additional output": forgot to add the parameter to the cron call at saturday (only added it to the test call)
  • increased execution time of script +20 min to 5:40
  • fixed the cleaning of old dumps, see http://lobid.org/download/dumps/DE-605/mabxml/

We have to wait till next saturday. Since the resources are updated in the productive older index from 2016-05-07 this is no problem. As long as there was no mapping in the data transformations since then (which would only apply to the updates, not the base). Is this so, @fsteeg ? Otherwise I would make sure that all updates are indexed into the newest index from 2016-05-20 and switch to this index.

@fsteeg
Copy link
Member Author

fsteeg commented May 24, 2016

No, there were no transformation changes that are not productive yet, so +1 for keeping the old index.

For next week's run, maybe we should try a bigger time change, like 9 hours (Saturday afternoon)?

@dr0i
Copy link
Member

dr0i commented May 24, 2016

Bigger time change is an option, agreed. But implies to have daily updates accordingly later for saturday AND to have an extra entry in crontab for the daily update at saturday. Foremost I would want to know what's going on there and thus just wait what the logs tell us next time before we increase the time of getting and feeding the base data.

@dr0i
Copy link
Member

dr0i commented Jun 7, 2016

The cause of the phenomenon was the rsyncing of the file to the webserver at 6:01, thus preserving the timestamp on the file, thus confusion.
Modified the cron to start at 6:15.
That worked well.
Closing.

@dr0i dr0i closed this as completed Jun 7, 2016
@dr0i dr0i removed the review label Jun 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants