Skip to content

Conversation

kderme
Copy link
Contributor

@kderme kderme commented Apr 6, 2020

@mrBliss mrBliss added the consensus issues related to ouroboros-consensus label Apr 6, 2020
Copy link
Contributor

@mrBliss mrBliss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was quick.

Can you test this on Windows?

createDBFolder :: Monad m => HasFS m h -> m ()
createDBFolder hasFS = do
createDirectoryIfMissing hasFS True root
void $ hOpen hasFS pFile (WriteMode AllowExisting)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quoting what you quoted in #1903 (comment):

[..], on Windows 'System.IO.openFile' for a lock file will fail when the lock is held

So this is wrong as it tries to open the lock file even if it's locked.

Question: does tryLockFile work when the file doesn't exist yet? If so, then that would be easier. Ah, see https://github.com/takano-akio/filelock/blob/9681ff960d2695cd67fe749edab0e2af06c6f829/System/FileLock/Internal/LockFileEx.hsc#L40

Copy link
Contributor

@dcoutts dcoutts Apr 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's the content of the file that is locked, not the file as a whole. That's why the failure in the logs ocured on reading the file, not opening the file.

Note that the file locking functions in base only work on open files.

root = mkFsPath []
pFile = fsPathFromList ["dblock"]

-- | We try to lock the db multiple times, before we give up and throw an
Copy link
Contributor

@mrBliss mrBliss Apr 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said in #1903 (comment), I wouldn't retry this until we know we need it.

UPDATE: I was mistaken, retries are needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wait until we're sure: #1903 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use a timeout on a blocking lock acquire, we don't need retries.

void $ hOpen hasFS pFile (WriteMode AllowExisting)
where
root = mkFsPath []
pFile = fsPathFromList ["dblock"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will go in light of my comment above it, but otherwise should have used dbLockFile here.

Comment on lines 50 to 51
threadDelay $ (n + 1) * 100000
lockAttempt $ n + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go for retries, let's use exponential backoff, not linear. Also we should document the behaviour: first wait x millisec, then ..., upto ..., etc. And we should indeed trace something, but this requires using a Tracer, and it's a bit annoying to have to add an extra tracer (cardano-node will have to be aware of it) just to trace one message.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the base package implementation of file locking on both posix, Linux and Windows will support the blocking lock acquite being interrupted by a timeout (async exception).

This would allow for a simple scheme:

res <- timeout 2000000 $ hLock lockfile ExclusiveLock

http://hackage.haskell.org/package/base-4.12.0.0/docs/GHC-IO-Handle-Lock.html

And if we're only blocking for up to one or two seconds, we don't need any new tracing I think.
#1903 (comment)

@dcoutts
Copy link
Contributor

dcoutts commented Apr 6, 2020

IMHO, there's nothing wrong with using the same file for the magic id and the file lock. But if you do, certainly you must open the lock file, lock it and hold it open for the duration. Only when it's opened and locked can you try to read the content.

But it's also ok to separate these into different files.

@coot
Copy link
Contributor

coot commented Apr 12, 2020

What I recently learned (while working on ghc-tags-plugin, which is also using flock locking mechanism available in base) that one also needs to fsync (or fdatasync) before closing the file handle; In a highly concurrent setting, I've seen another thread (system thread / process) not reading the written but not fsync-ed content - I am not sure if this related in any way to ChainDB though.

@kderme kderme force-pushed the kderme/lockfile branch 3 times, most recently from 2db434d to 76c93c1 Compare April 16, 2020 07:46
import Ouroboros.Consensus.Util.IOLike

-- We use an empty file as a lock of the db. Some systems may delete the empty
-- file when all its handles are closed. This is not an issue, since the file is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some systems may delete the empty file when all its handles are closed.

Does Windows do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I haven't seen it on windows or linux, but it's possible to happen.

mlock <- timeout (Time.secondsToDiffTime 2) $ wait a
case mlock of
Nothing -> throwM $ DbLocked lockFilePath
Just lock -> finally action (unlockFile lock)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some masking missing. The simplest approach is to write a acquireLock and a releaseLock function and use those in bracket, then bracket will take care of the masking for us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrBliss what do you think should be masked?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if an async exception is thrown after obtaining the lock and before finally? That's the thing with async exceptions, they can happen at any point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we should mask while waiting on a timeout. Let me check it..

Comment on lines 154 to 155
-- tryL (withLock touchLock) >>=
-- (@?= (Left (DbLocked (dbPath </> T.unpack dbLockFile))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is commented out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as I explain above the test This test succeeds, but may leave a forgotten lock to the file and this lock is only cleaned when all tests finish and the process dies. So probably we shouldn't include it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as I explain above the test This test succeeds, but may leave a forgotten lock to the file and this lock is only cleaned when all tests finish and the process dies. So probably we shouldn't include it?

But now we are not even testing whether the locking actually works.

this lock is only cleaned when all tests finish and the process dies

That's fine, or what's wrong with that?

@kderme
Copy link
Contributor Author

kderme commented Apr 22, 2020

I tested flock from base, but I couldn't make it work properly. It just wouldn't lock the files effectively and I din't manage to get down to why exactly this happened. So I used the filelock package. Also I preferred to use a different file from the DBMarker, since filelock suggests against accessing the locked file for other purposes.

@kderme kderme force-pushed the kderme/lockfile branch 2 times, most recently from 169d469 to 5732e3d Compare April 22, 2020 14:30
withLockDB hasFS dbPath action = do
createDirectoryIfMissing hasFS True root
-- We want to avoid blocking the main thread at an uninterruptible ffi, to
-- avoid unresponsiveness to timeouts and ^C. So we use async and let a new
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
-- avoid unresponsiveness to timeouts and ^C. So we use async and let a new
-- avoid unresponsiveness to timeouts and ^C. So we use 'async' and let a new

-- avoid unresponsiveness to timeouts and ^C. So we use async and let a new
-- thread do the actual ffi call.
--
-- We shouldn't be tempted to use `withAsync`, which is usually mentioned
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
-- We shouldn't be tempted to use `withAsync`, which is usually mentioned
-- We shouldn't be tempted to use 'withAsync', which is usually mentioned

-- We shouldn't be tempted to use `withAsync`, which is usually mentioned
-- as a better alternative, or try to synchronously cancel the forked
-- thread during cleanup, since this would block the main thread and negate
-- the whole point of using async. We try our best to clean resources, but
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
-- the whole point of using async. We try our best to clean resources, but
-- the whole point of using 'async'. We try our best to clean up resources, but

Comment on lines 44 to 45
Nothing -> throwM $ DbLocked lockFilePath
Just lock -> unlockFile lock
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern match should be done in the acquisition of the lock so that we throw that exception in case of a timeout, instead of passing Nothing to the action, which ignores it. This also means that cleanup will be just unlockFile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrBliss I'm not sure I understand. This pattern matches if we got a timeout or async returned succesfully. So it can't be performed by the async thread.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When waiting on the async that calls lockFile times out, Nothing is returned. This is Nothing is then passed to (\_ -> action) which ignores it. So even when we don't have the lock do we execute the action. After the action is done or when action throws an exception, we go to cleanup. cleanup then throws a DbLocked exception in case it got Nothing, which is too late, as we already executed the action.

What we should do is (untype-checked):

    createDirectoryIfMissing hasFS True root
    bracket acquireLock unLockFile (const action)
  where
    -- We want to avoid blocking ...
    acquireLock :: IO FileLock
    acquireLock = do
      lockFileAsync <- async (lockFile lockFilePath Exclusive)
      mbLock <- timeout (Time.secondsToDiffTime 2) $ wait lockFileAsync
      case mbLock of
        -- We timed out while waiting on the lock, the db is still locked
        Nothing   -> throwM $ DbLocked lockFilePath
        Just lock -> return lock

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right, fixed


-- | We use an empty file as a lock of the db so that the database cannot be
-- opened by more than one process. We wait up to 2 seconds to take the lock,
-- before timeout. Some systems may delete the empty file when all its handles
Copy link
Contributor

@mrBliss mrBliss Apr 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
-- before timeout. Some systems may delete the empty file when all its handles
-- before timing out and throwing a 'DbLocked' exception. Some systems may delete the empty file when all its handles

@mrBliss
Copy link
Contributor

mrBliss commented Apr 22, 2020

Also I preferred to use a different file from the DBMarker, since filelock suggests against accessing the locked file for other purposes.

Yeah, I agree, that's a good reason.

@mrBliss
Copy link
Contributor

mrBliss commented Apr 23, 2020

(I have temporarily added the commits from #1851 to this branch to do a final check on Windows. I won't merge these commits, because building/testing everything on Windows is too slow because of the lack of caching.)

@mrBliss mrBliss linked an issue Apr 23, 2020 that may be closed by this pull request
@mrBliss
Copy link
Contributor

mrBliss commented Apr 23, 2020

bors merge

iohk-bors bot added a commit that referenced this pull request Apr 23, 2020
1906: Change filelock behaviour r=mrBliss a=kderme

#1903


Co-authored-by: kderme <k.dermenz@gmail.com>
Co-authored-by: Thomas Winant <thomas@well-typed.com>
@iohk-bors
Copy link
Contributor

iohk-bors bot commented Apr 23, 2020

Build failed

@mrBliss
Copy link
Contributor

mrBliss commented Apr 23, 2020

bors merge

iohk-bors bot added a commit that referenced this pull request Apr 23, 2020
1906: Change filelock behaviour r=mrBliss a=kderme

#1903


Co-authored-by: kderme <k.dermenz@gmail.com>
Co-authored-by: Thomas Winant <thomas@well-typed.com>
@iohk-bors
Copy link
Contributor

iohk-bors bot commented Apr 23, 2020

Build failed

@mrBliss
Copy link
Contributor

mrBliss commented Apr 24, 2020

bors merge

iohk-bors bot added a commit that referenced this pull request Apr 24, 2020
1906: Change filelock behaviour r=mrBliss a=kderme

#1903


Co-authored-by: kderme <k.dermenz@gmail.com>
Co-authored-by: Thomas Winant <thomas@well-typed.com>
@iohk-bors
Copy link
Contributor

iohk-bors bot commented Apr 24, 2020

Build failed

Copy link
Contributor

@edsko edsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, @kderme good work on the FFI analysis!

I also really like how the new property looks now.

@mrBliss mrBliss force-pushed the kderme/lockfile branch 2 times, most recently from 071f557 to eab641c Compare April 24, 2020 11:11
@mrBliss
Copy link
Contributor

mrBliss commented Apr 24, 2020

bors merge

@mrBliss mrBliss linked an issue Apr 24, 2020 that may be closed by this pull request
@iohk-bors
Copy link
Contributor

iohk-bors bot commented Apr 24, 2020

@iohk-bors iohk-bors bot merged commit 521286c into master Apr 24, 2020
@iohk-bors iohk-bors bot deleted the kderme/lockfile branch April 24, 2020 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

consensus issues related to ouroboros-consensus

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Test file locking with lazy release Rapid cardano-node restart causes node to crash due to file locking

5 participants