Rose Ana database corruption on Lustre filesystem #1897

ScottWales · 2016-05-20T05:36:16Z

We have been seeing intermittent rose ana failures in our nightly tests, with a message of:

[FAIL] database disk image is malformed

After some investigation, it turns out our HPC only supports the file locking that sqlite uses within a single node (using Lustre's localflock option). This means if multiple rose ana tasks start at the same time on different nodes SQLite will not see any locks, corrupting the file.

It is still possible to obtain a lock on our system by attempting to open a file in exclusive mode (open('foo.lock','x')). Can a config option be added to use a file-based lock for database writes, or as the database is non-essential can this error be caught and ignored?

cc @MartinDix

The text was updated successfully, but these errors were encountered:

MartinDix · 2016-05-20T05:39:59Z

As a work around we'll use a cylc queue so that only one rose-ana task can run at a time.

matthewrmshin · 2016-10-31T15:46:39Z

@arjclark to talk to @stevewardle.

arjclark · 2016-11-01T09:51:25Z

@stevewardle to check with partners (in particular @ScottWales) if this problem still exists with new rose ana introduced in #1996 once new release is out.

arjclark · 2016-12-05T09:22:30Z

@stevewardle - any update on this?

MartinDix · 2017-01-25T03:17:55Z

Still a problem at NCI. rose ana now uses

        lock = open(lockfile, "w")
        fcntl.flock(lock, fcntl.LOCK_EX)

This doesn't work across nodes at our site. Testing this on two different login nodes shows that they can both obtain the lock simultaneously. Same behavior on /short, /home and /g/data filesystems.

matthewrmshin added this to the soon milestone May 20, 2016

matthewrmshin assigned scwhitehouse and stevewardle and unassigned scwhitehouse May 20, 2016

sadielbartholomew added built-in apps bug? labels Aug 23, 2019

matthewrmshin modified the milestones: soon, beyond-next-feature Sep 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rose Ana database corruption on Lustre filesystem #1897

Rose Ana database corruption on Lustre filesystem #1897

ScottWales commented May 20, 2016

MartinDix commented May 20, 2016

matthewrmshin commented Oct 31, 2016

arjclark commented Nov 1, 2016

arjclark commented Dec 5, 2016

MartinDix commented Jan 25, 2017

Rose Ana database corruption on Lustre filesystem #1897

Rose Ana database corruption on Lustre filesystem #1897

Comments

ScottWales commented May 20, 2016

MartinDix commented May 20, 2016

matthewrmshin commented Oct 31, 2016

arjclark commented Nov 1, 2016

arjclark commented Dec 5, 2016

MartinDix commented Jan 25, 2017