Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async file operations #24509

Closed
PVince81 opened this issue May 9, 2016 · 17 comments
Closed

Async file operations #24509

PVince81 opened this issue May 9, 2016 · 17 comments

Comments

@PVince81
Copy link
Contributor

PVince81 commented May 9, 2016

Some file operations, especially deleting from external storage can take a long time as it needs to download the files to trash in the background.

Also there were talks about async PUT: #12097

So opening this ticket to discuss the possibility of having asynchronous file operations.
The good part is that we already have file locking, so it might be possible to leverage this to avoid concurrency issues.

Also, need to make sure we stay compatible to Webdav. So async operations would have to be custom-Webdav/extensions..

@DeepDiver1975

@PVince81
Copy link
Contributor Author

Another reason: when someone is syncing a folder and there are big files to be downloaded, another use will not be able to delete or rename that folder due to locking and it causes bad UX. See #21574

Maybe some kind of queue of file operations could help.

CC @dragotin

@PVince81
Copy link
Contributor Author

I see more and more reports of people having discrepancies due to PHP timeouts.

Unfortunately we can't rollback a FS change from a killed PHP process. However if we had some kind of journal or operations queue, it should be possible to either redo or rollback the last operation. This all fits well with the "async file operation" concept.

CC @butonic

@PVince81
Copy link
Contributor Author

If one day we ever go the "Webdav sync" route, that one will need a table containing all changes.
Coincidentally if we had async file operations we would need such table too. Maybe the table would get pre-populated with "pending" changes. Those changes could then be reverted in the event of PHP timeout.

@DeepDiver1975 @butonic

@PVince81
Copy link
Contributor Author

PVince81 commented Apr 3, 2017

Some further ideas:

  1. The most extreme: get rid of long running operations. This means no more recursive MOVE or DELETE. Expect clients to do a MOVE or DELETE for every single file. Puts the burden on the client devs and also on the network. Not ideal but could work.

  2. Similar to 1) but find a way to execute such operations in batches. Likely too complicated for clients to implement as every client would need to implement that.

  3. Set the size of the parent folder to "-1" when doing long running operations. This way if the PHP process is killed, the background scanner will find these entries and rescan them in the background.

@jvillafanez

@PVince81
Copy link
Contributor Author

PVince81 commented Apr 3, 2017

or well, get rid of the filecache...

@PVince81
Copy link
Contributor Author

PVince81 commented Apr 3, 2017

  1. get rid of "path" column, because that's what actually needs adjusting and takes a long time: https://github.com/owncloud/core/blob/v9.1.4/lib/private/Files/Cache/Cache.php#L521

I tried renaming "test" to "test2" with a lot of children inside. In theory it's only about renaming "test" to "test2" without touching ever children. Even the file ids stay the same.
Still, it will iterate over all these entries to adjust the "path" and "path_hash" columns.

@PVince81
Copy link
Contributor Author

PVince81 commented Apr 3, 2017

Maybe we do need closure tables to get rid of the "path" column: #4209.

While closure tables might not increase regular read speed, if it can help solve the timeout issues on long-running MOVE or DELETE then they might be worth it. Data loss 🔔

@PVince81
Copy link
Contributor Author

PVince81 commented Apr 3, 2017

A good read but probably not useful as it will likely not work on shared hosters: http://symcbean.blogspot.de/2010/02/php-and-long-running-processes.html

@PVince81
Copy link
Contributor Author

PVince81 commented Apr 3, 2017

If we do make a request async (like DELETE or MOVE), we could use this approach: http://restcookbook.com/Resources/asynchroneous-operations/

But not sure how standard Webdav clients would react... Or we'd need to optimistically tell them that we succeeded even though we just queued the request.

@PVince81
Copy link
Contributor Author

PVince81 commented Apr 3, 2017

Oh oh, looks like 202 might be acceptable, see https://msdn.microsoft.com/en-us/library/aa142865(v=exchg.65).aspx which says that it could be used for DELETE.

@PVince81
Copy link
Contributor Author

PVince81 commented Apr 3, 2017

I hacked Sabre locally for a quick test:

  • cadaver accepts 202 for DELETE and MOVE and considers it a success
  • dolphin's Webdav client however says "An unexpected errror (202) occurred..."

@butonic
Copy link
Member

butonic commented Sep 25, 2017

Large file uploads also require this. The assemble step can take a long time. Not only the file chunks need to be assembled, but also the antivirus scan will kick in or any other postprocessing. IMO we should show the upload is completed and mark the file as 'in postprocessing'. Probably even exposing this in the web interface. A PROPFIND will be able to get the metadata, but actually accessing the file should cause a 403 Forbidden together with a Retry-After header? Marking a file as 'in postprocessing' may lead to a new lifetime column, eg to also mark files as deleted. Hm what do we have: Receiving chunks, assembling file, antivirus scan, content extraction (for workflow), indexing (for search), thumbnail generation, deleted. Those can roughly be separated into where is the file stored and what is done with the content. In that light for federated shares a status like 'cached locally' would make sense. But I don't know if it makes sense to fit all these into a single column. It does make sense to have a common pipeline for files that applications can then hook into ... hm need to think on this further.

00006520

@jvillafanez
Copy link
Member

I think the key point here is to have 2 different processes:

  1. User makes a request and start process A
  2. Process A spawns process B to perform the upload / download / whatever
  3. Process A returns something to the user while process B is still running
  4. User can check at some point the status of the process B

If I remember correctly, there is a trick we can use to spawn the process B in a async way without cli access although I don't remember if there are some caveats to take into account.

Taking into account what we have, we'll need to expose at least one additional endpoint for each async operation we want and at least an additional endpoint to check the operation status. For example, for uploads we'll have the sync upload (we can use whatever we're doing right now and the same endpoint), and the async upload which will trigger the sync one at some point. We'll need additional columns / tables in the DB to track the status of the sync operations so we can poll for changes and check when the sync operation is finished.

Although this doesn't seem too intrusive, we'll need to take into account the sync operations needs to notify its status somehow so the users can check the status periodically.
In addition, we'll need to:

  • Check how we can integrate (if possible) these new async endpoints in webdav. Define what these endpoints should return.
  • Find a place to store the operation status. To be decided if it will be a column in the filecache table, or a new table for all the async operations, or any other place.

Note that these endpoints don't need to rely on webdav, so worst case we can use these async operations ourselves even though 3rdparty software would still use the sync ones through webdav.

@guruz
Copy link
Contributor

guruz commented Oct 2, 2017

@butonic FYI we had the asynchronous PUT implemented in a private fork of the client and server some time ago. It worked by returning (as header) a "poll URL" from the PUT of the latest chunk. The client would (after having uploaded all chunks) check a poll URL every few seconds to see if the file was uploaded the backend.

Contact @ogoffart or me if you want more info and/or sources.

@DeepDiver1975
Copy link
Member

All these ideas are pointless as long as we have no active job execution mechanism in place.

@PVince81
Copy link
Contributor Author

PVince81 commented Oct 9, 2017

There is also an additional challenge. While blocking access to a pending file is one thing, what happens with external storage ?

It seems that we need to first upload the pending file to some local invisible temporary space in which it is assembled/virus-scanned, etc, and then upload it to the final storage. But that would cause delays.

Or upload it as a temporary part file like we already do. Part files are invisible to the clients.
We then only need to have the part file stay at the end of the upload and have background processes run on this part file. For this we need to track uploads and part files.

@PVince81
Copy link
Contributor Author

We could also use "part folders" for some operations: #13756

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants