Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rclone mount: race condition with multiple RWFileHandle's #2034

Closed
B4dM4n opened this issue Jan 30, 2018 · 5 comments
Closed

rclone mount: race condition with multiple RWFileHandle's #2034

B4dM4n opened this issue Jan 30, 2018 · 5 comments

Comments

@B4dM4n
Copy link
Collaborator

B4dM4n commented Jan 30, 2018

Following the forum thread Extracting to ACD mount possible? I'm now creating this issue.

I found a race condition involving multiple open *vfs.RWFileHandle's for one file.

This is a log containing a failed upload: rclone_extract_ptr_empty.log. The state after this log is the following:

  • Test/JD/rand4/rand.bin size=0 mtime=2018-01-29 21:25:18.826 +0100 (mtime of the source)
  • Test/JD/rand4/rand.bin size=0 mtime=2018-01-30 14:46:04.??? +0100 (mtime of upload time)

I assume the problem is *vfs.RWFileHandle.close() not being synchronized between multiple files.
The log file above shows two close() executing in parallel.

The first RWFileHandle belongs to a new empty file. The second one the handle being written to. During the close() of the first handle, the empty cache file gets Move'ed to the backend. After the move is completed the cache file get's deleted.
When the second close() tries to move the cache file, which is already gone.

This could also be related to #1181.

I think, close() should check whether there is another RWFileHandle opened for it's file. Hopefully this fixes the race condition.

What is your rclone version (eg output from rclone -V)

rclone v1.39-096-gcf6d522d

Which OS you are using and how many bits (eg Windows 7, 64 bit)

Linux 4.14.13-1-ARCH #1 SMP PREEMPT Wed Jan 10 11:14:50 UTC 2018 x86_64 GNU/Linux

Which cloud storage system are you using? (eg Google Drive)

Google Drive

The command you were trying to run (eg rclone copy /tmp remote:tmp)

rclone mount -vv --dir-cache-time 24h --vfs-cache-mode writes TestDrive:Test/JD /mnt/JD/extracted

@ncw
Copy link
Member

ncw commented Jan 31, 2018

I have to say I've been kind of ignoring the multiple file opens on the same file problem and hoping it will never happen :-)

I think, close() should check whether there is another RWFileHandle opened for it's file. Hopefully this fixes the race condition.

OK, so assume we put a mutex in File for the use of Handles.

We should probably use it in Open and close()

If close() decided it didn't need to transfer the file then it should exit as it currently does.

However if it did need to transfer the file and there were other writers then it should set a pending transfer flag in the File object and exit for the next close() to check.

How does something like that sound?

@B4dM4n
Copy link
Collaborator Author

B4dM4n commented Feb 1, 2018

I have to say I've been kind of ignoring the multiple file opens on the same file problem and hoping it will never happen :-)

You never heard of Murphy's law, don't you? 😆

We should probably use it in Open and close()

I'm not sure whether open needs to be protected by the same mutex. Only file.addWriter and file.delWriter should need to be synchronized.

This is a list of functions i think need to be synchronized:

  • newRWFileHandle: synchronized via cache.open + addWriter via file.muWriter
  • *RWFileHandle.openPending: synchronized via file.muOpen
  • *RWFileHandle.close(upload phase): synchronized via file.muClose + delWriter via file.muWriter
  • *RWFileHandle.close(remove phase): synchronized via cache.close

And now the part how this works in my head 😄

openPending needs to be synchronized to other openPending of the same file, so only one copy gets downloaded. A open with O_TRUNC could cancel a running copy and clear the cache file.

RWFileHandle.close is mostly run with file.muClose held. file.delWriter is called to determine whether this handle was the last writer and when this is true do a operations.CopyFile to upload the cache file. Now file.muClose is released and cache.close is called, which removes the cache file when no reader is present.

It's pretty late here, so i can not guarantee this locking scheme is complete. I will look at it again tomorrow.

@ncw
Copy link
Member

ncw commented Feb 1, 2018

openPending needs to be synchronized to other openPending of the same file, so only one copy gets downloaded.

I think openPending should probably be synchronized to close also, so file.muOpen and file.muClose above should be the same mutex, otherwise won't we have the problem where the file is being both downloaded and uploaded at the same time?

A open with O_TRUNC could cancel a running copy and clear the cache file.

Do you see there being a separate download phase which is running without the file.muOpen?

(Aside: one thing I'd like to do at some point is run the download in the background in openPending to allow the file to be accessed before it is completely downloaded, and just block on operations which require bits of the file which aren't downloaded yet. That is a whole heap of extra complication I suggest we don't attempt yet!)

RWFileHandle.close is mostly run with file.muClose held. file.delWriter is called to determine whether this handle was the last writer and when this is true do a operations.CopyFile to upload the cache file.

Though note not all the close() operations do transfer the file.

It's pretty late here, so i can not guarantee this locking scheme is complete. I will look at it again tomorrow.

I think it is an excellent start :-)

I think for the next evolution we should try to keep the changes as minimal as possible even if it is less than perfectly efficient - that has been my strategy in the VFS layer, correctness over efficiency. Not always achieved though!

@B4dM4n
Copy link
Collaborator Author

B4dM4n commented Feb 1, 2018

I think openPending should probably be synchronized to close also, so file.muOpen and file.muClose above should be the same mutex, otherwise won't we have the problem where the file is being both downloaded and uploaded at the same time?

A download is only done once per cache file (this is how it should work) and close is doing an openPending. This way only one can happen at the same time.

I want to make sure open and close use separate locks to allow the cache file to be opened while being uploaded.

A open with O_TRUNC could cancel a running copy and clear the cache file.

Do you see there being a separate download phase which is running without the file.muOpen?

This was just an idea which could speed up the open process in case the file get's opened with O_TRUNC, while another handle is downloading it.

Though note not all the close() operations do transfer the file.

Right, only when the last writer is closing and the cache file has be modified.

The modified flag count be passed via the delWriter call like this:

func delWriter(h Handle, modified bool) (lastWriterAndModified bool)

I think for the next evolution we should try to keep the changes as minimal as possible even if it is less than perfectly efficient - that has been my strategy in the VFS layer, correctness over efficiency. Not always achieved though!

You are right, the vfs layer is pretty complicated. As it is mostly working great we should be careful to not break any existing use case.

I will start working today on the changes we outlined until now.

@ncw
Copy link
Member

ncw commented Feb 1, 2018

A download is only done once per cache file (this is how it should work) and close is doing an openPending. This way only one can happen at the same time.

OK

I want to make sure open and close use separate locks to allow the cache file to be opened while being uploaded.

This was just an idea which could speed up the open process in case the file get's opened with O_TRUNC, while another handle is downloading it.

Oh I see... You'd have to block if it wasn't opened with truncate though. I guess if opened read only you are OK, but if not then I guess you have to block until the upload has finished otherwise?

Getting complicated :-)

The modified flag count be passed via the delWriter call like this:

Looks perfect.

You are right, the vfs layer is pretty complicated. As it is mostly working great we should be careful to not break any existing use case.

There are some exhaustive tests for the RW handles, however there aren't any tests for concurrent opens, so it might be worth writing some of those first, see them break then fix them!

I will start working today on the changes we outlined until now.

Excellent :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants