Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rose Ana database corruption on Lustre filesystem #1897

Open
ScottWales opened this issue May 20, 2016 · 5 comments
Open

Rose Ana database corruption on Lustre filesystem #1897

ScottWales opened this issue May 20, 2016 · 5 comments

Comments

@ScottWales
Copy link
Contributor

We have been seeing intermittent rose ana failures in our nightly tests, with a message of:

[FAIL] database disk image is malformed

After some investigation, it turns out our HPC only supports the file locking that sqlite uses within a single node (using Lustre's localflock option). This means if multiple rose ana tasks start at the same time on different nodes SQLite will not see any locks, corrupting the file.

It is still possible to obtain a lock on our system by attempting to open a file in exclusive mode (open('foo.lock','x')). Can a config option be added to use a file-based lock for database writes, or as the database is non-essential can this error be caught and ignored?

cc @MartinDix

@MartinDix
Copy link
Contributor

As a work around we'll use a cylc queue so that only one rose-ana task can run at a time.

@matthewrmshin matthewrmshin added this to the soon milestone May 20, 2016
@matthewrmshin
Copy link
Member

@arjclark to talk to @stevewardle.

@arjclark
Copy link
Contributor

arjclark commented Nov 1, 2016

@stevewardle to check with partners (in particular @ScottWales) if this problem still exists with new rose ana introduced in #1996 once new release is out.

@arjclark
Copy link
Contributor

arjclark commented Dec 5, 2016

@stevewardle - any update on this?

@MartinDix
Copy link
Contributor

Still a problem at NCI. rose ana now uses

        lock = open(lockfile, "w")
        fcntl.flock(lock, fcntl.LOCK_EX)

This doesn't work across nodes at our site. Testing this on two different login nodes shows that they can both obtain the lock simultaneously. Same behavior on /short, /home and /g/data filesystems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
issues prioritisation: utilities
TODO: task-run/app-run
Development

No branches or pull requests

7 participants