-
Notifications
You must be signed in to change notification settings - Fork 89
Change filelock behaviour #1906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was quick.
Can you test this on Windows?
createDBFolder :: Monad m => HasFS m h -> m () | ||
createDBFolder hasFS = do | ||
createDirectoryIfMissing hasFS True root | ||
void $ hOpen hasFS pFile (WriteMode AllowExisting) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quoting what you quoted in #1903 (comment):
[..], on Windows 'System.IO.openFile' for a lock file will fail when the lock is held
So this is wrong as it tries to open the lock file even if it's locked.
Question: does tryLockFile
work when the file doesn't exist yet? If so, then that would be easier. Ah, see https://github.com/takano-akio/filelock/blob/9681ff960d2695cd67fe749edab0e2af06c6f829/System/FileLock/Internal/LockFileEx.hsc#L40
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's the content of the file that is locked, not the file as a whole. That's why the failure in the logs ocured on reading the file, not opening the file.
Note that the file locking functions in base
only work on open files.
root = mkFsPath [] | ||
pFile = fsPathFromList ["dblock"] | ||
|
||
-- | We try to lock the db multiple times, before we give up and throw an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said in #1903 (comment), I wouldn't retry this until we know we need it.
UPDATE: I was mistaken, retries are needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's wait until we're sure: #1903 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use a timeout on a blocking lock acquire, we don't need retries.
void $ hOpen hasFS pFile (WriteMode AllowExisting) | ||
where | ||
root = mkFsPath [] | ||
pFile = fsPathFromList ["dblock"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will go in light of my comment above it, but otherwise should have used dbLockFile
here.
ouroboros-consensus/src/Ouroboros/Consensus/Storage/ImmutableDB/Impl.hs
Outdated
Show resolved
Hide resolved
threadDelay $ (n + 1) * 100000 | ||
lockAttempt $ n + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go for retries, let's use exponential backoff, not linear. Also we should document the behaviour: first wait x millisec, then ..., upto ..., etc. And we should indeed trace something, but this requires using a Tracer
, and it's a bit annoying to have to add an extra tracer (cardano-node
will have to be aware of it) just to trace one message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the base
package implementation of file locking on both posix, Linux and Windows will support the blocking lock acquite being interrupted by a timeout (async exception).
This would allow for a simple scheme:
res <- timeout 2000000 $ hLock lockfile ExclusiveLock
http://hackage.haskell.org/package/base-4.12.0.0/docs/GHC-IO-Handle-Lock.html
And if we're only blocking for up to one or two seconds, we don't need any new tracing I think.
#1903 (comment)
IMHO, there's nothing wrong with using the same file for the magic id and the file lock. But if you do, certainly you must open the lock file, lock it and hold it open for the duration. Only when it's opened and locked can you try to read the content. But it's also ok to separate these into different files. |
What I recently learned (while working on |
2db434d
to
76c93c1
Compare
import Ouroboros.Consensus.Util.IOLike | ||
|
||
-- We use an empty file as a lock of the db. Some systems may delete the empty | ||
-- file when all its handles are closed. This is not an issue, since the file is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some systems may delete the empty file when all its handles are closed.
Does Windows do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I haven't seen it on windows or linux, but it's possible to happen.
mlock <- timeout (Time.secondsToDiffTime 2) $ wait a | ||
case mlock of | ||
Nothing -> throwM $ DbLocked lockFilePath | ||
Just lock -> finally action (unlockFile lock) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some mask
ing missing. The simplest approach is to write a acquireLock
and a releaseLock
function and use those in bracket
, then bracket
will take care of the masking for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrBliss what do you think should be masked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if an async exception is thrown after obtaining the lock and before finally
? That's the thing with async exceptions, they can happen at any point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we should mask while waiting on a timeout. Let me check it..
-- tryL (withLock touchLock) >>= | ||
-- (@?= (Left (DbLocked (dbPath </> T.unpack dbLockFile)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is commented out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as I explain above the test This test succeeds, but may leave a forgotten lock to the file
and this lock is only cleaned when all tests finish and the process dies. So probably we shouldn't include it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as I explain above the test
This test succeeds, but may leave a forgotten lock to the file
and this lock is only cleaned when all tests finish and the process dies. So probably we shouldn't include it?
But now we are not even testing whether the locking actually works.
this lock is only cleaned when all tests finish and the process dies
That's fine, or what's wrong with that?
ouroboros-consensus/src/Ouroboros/Consensus/Storage/ImmutableDB/Impl.hs
Outdated
Show resolved
Hide resolved
I tested flock from base, but I couldn't make it work properly. It just wouldn't lock the files effectively and I din't manage to get down to why exactly this happened. So I used the |
169d469
to
5732e3d
Compare
withLockDB hasFS dbPath action = do | ||
createDirectoryIfMissing hasFS True root | ||
-- We want to avoid blocking the main thread at an uninterruptible ffi, to | ||
-- avoid unresponsiveness to timeouts and ^C. So we use async and let a new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- avoid unresponsiveness to timeouts and ^C. So we use async and let a new | |
-- avoid unresponsiveness to timeouts and ^C. So we use 'async' and let a new |
-- avoid unresponsiveness to timeouts and ^C. So we use async and let a new | ||
-- thread do the actual ffi call. | ||
-- | ||
-- We shouldn't be tempted to use `withAsync`, which is usually mentioned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- We shouldn't be tempted to use `withAsync`, which is usually mentioned | |
-- We shouldn't be tempted to use 'withAsync', which is usually mentioned |
-- We shouldn't be tempted to use `withAsync`, which is usually mentioned | ||
-- as a better alternative, or try to synchronously cancel the forked | ||
-- thread during cleanup, since this would block the main thread and negate | ||
-- the whole point of using async. We try our best to clean resources, but |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- the whole point of using async. We try our best to clean resources, but | |
-- the whole point of using 'async'. We try our best to clean up resources, but |
Nothing -> throwM $ DbLocked lockFilePath | ||
Just lock -> unlockFile lock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern match should be done in the acquisition of the lock so that we throw that exception in case of a timeout, instead of passing Nothing
to the action
, which ignores it. This also means that cleanup
will be just unlockFile
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrBliss I'm not sure I understand. This pattern matches if we got a timeout or async returned succesfully. So it can't be performed by the async thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When waiting on the async
that calls lockFile
times out, Nothing
is returned. This is Nothing
is then passed to (\_ -> action)
which ignores it. So even when we don't have the lock do we execute the action. After the action
is done or when action
throws an exception, we go to cleanup
. cleanup
then throws a DbLocked
exception in case it got Nothing
, which is too late, as we already executed the action
.
What we should do is (untype-checked):
createDirectoryIfMissing hasFS True root
bracket acquireLock unLockFile (const action)
where
-- We want to avoid blocking ...
acquireLock :: IO FileLock
acquireLock = do
lockFileAsync <- async (lockFile lockFilePath Exclusive)
mbLock <- timeout (Time.secondsToDiffTime 2) $ wait lockFileAsync
case mbLock of
-- We timed out while waiting on the lock, the db is still locked
Nothing -> throwM $ DbLocked lockFilePath
Just lock -> return lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh right, fixed
|
||
-- | We use an empty file as a lock of the db so that the database cannot be | ||
-- opened by more than one process. We wait up to 2 seconds to take the lock, | ||
-- before timeout. Some systems may delete the empty file when all its handles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- before timeout. Some systems may delete the empty file when all its handles | |
-- before timing out and throwing a 'DbLocked' exception. Some systems may delete the empty file when all its handles |
Yeah, I agree, that's a good reason. |
9626a35
to
0a7bee1
Compare
(I have temporarily added the commits from #1851 to this branch to do a final check on Windows. I won't merge these commits, because building/testing everything on Windows is too slow because of the lack of caching.) |
dfdcda6
to
0a7bee1
Compare
bors merge |
Build failed |
bors merge |
Build failed |
0a7bee1
to
a87a1ac
Compare
bors merge |
Build failed |
6d06fbf
to
7ed5339
Compare
ouroboros-consensus/src/Ouroboros/Consensus/Util/FileLocking.hs
Outdated
Show resolved
Hide resolved
ouroboros-consensus/ouroboros-consensus-test-infra/src/Test/Util/FileLocking.hs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, @kderme good work on the FFI analysis!
I also really like how the new property looks now.
7ed5339
to
09699ac
Compare
071f557
to
eab641c
Compare
eab641c
to
99849fd
Compare
bors merge |
#1903