Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data loss on rename of a 49 GB folder #13391

Closed
MorrisJobke opened this issue Jan 15, 2015 · 71 comments · Fixed by #17017
Closed

Data loss on rename of a 49 GB folder #13391

MorrisJobke opened this issue Jan 15, 2015 · 71 comments · Fixed by #17017

Comments

@MorrisJobke
Copy link
Contributor

I accidentially renamed a folder on my production instance:

  • the folder was filled up with nearly 20k files in a total size of 48,7 GB
  • I renamed the folder in the web UI
  • shortly (2-5 seconds) after I noticed this I shut down the client to avoid bigger trouble (as this folder was set up as a synced folder inside the client)
  • the spinner spinned forever

Notes

  • I have a database dump here
  • I have all apache logs here
  • I have the owncloud.log here
  • I have a filesystem snapshot

If someone wants to help me with digging in the debris is welcome.

Access log:

The rename:

127.0.0.1 - - [15/Jan/2015:13:32:39 +0100] "GET /index.php/apps/files/ajax/rename.php?dir=%2F&newname=Bildersd&file=Bilder HTTP/1.1" 503 1216

The access log filtered for the folder Bilder:

...
127.0.0.1 - - [15/Jan/2015:13:24:36 +0100] "PROPFIND /remote.php/webdav/Bilder HTTP/1.1" 207 411
127.0.0.1 - - [15/Jan/2015:13:25:04 +0100] "PROPFIND /remote.php/webdav/Bilder HTTP/1.1" 207 411
127.0.0.1 - - [15/Jan/2015:13:25:34 +0100] "PROPFIND /remote.php/webdav/Bilder HTTP/1.1" 207 411
127.0.0.1 - - [15/Jan/2015:13:26:06 +0100] "PROPFIND /remote.php/webdav/Bilder HTTP/1.1" 207 411
127.0.0.1 - - [15/Jan/2015:13:26:39 +0100] "PROPFIND /remote.php/webdav/Bilder HTTP/1.1" 207 16662
127.0.0.1 - - [15/Jan/2015:13:27:36 +0100] "PROPFIND /remote.php/webdav/Bilder HTTP/1.1" 207 411
127.0.0.1 - - [15/Jan/2015:13:28:06 +0100] "PROPFIND /remote.php/webdav/Bilder HTTP/1.1" 207 411
127.0.0.1 - - [15/Jan/2015:13:28:36 +0100] "PROPFIND /remote.php/webdav/Bilder HTTP/1.1" 207 411
127.0.0.1 - - [15/Jan/2015:13:29:06 +0100] "PROPFIND /remote.php/webdav/Bilder HTTP/1.1" 207 411
127.0.0.1 - - [15/Jan/2015:13:29:36 +0100] "PROPFIND /remote.php/webdav/Bilder HTTP/1.1" 207 411
127.0.0.1 - - [15/Jan/2015:13:30:05 +0100] "PROPFIND /remote.php/webdav/Bilder HTTP/1.1" 207 411
127.0.0.1 - - [15/Jan/2015:13:30:34 +0100] "PROPFIND /remote.php/webdav/Bilder HTTP/1.1" 207 411
127.0.0.1 - - [15/Jan/2015:13:31:04 +0100] "PROPFIND /remote.php/webdav/Bilder HTTP/1.1" 207 411
127.0.0.1 - - [15/Jan/2015:13:31:36 +0100] "PROPFIND /remote.php/webdav/Bilder HTTP/1.1" 207 411
127.0.0.1 - - [15/Jan/2015:13:32:09 +0100] "PROPFIND /remote.php/webdav/Bilder HTTP/1.1" 207 16662

Nothing special in php-fpm.log or apache error log.

The folder is successfully renamed (in database and in the filesystem), but all database entries are gone (files are still there in the filesystem). Just a forced rescan with the occ command line tool was able to get them back into the database. Browsing the folder in the web UI didn't trigger the update of the file cache.

! For the user (without admin rights) it's not possible to get back the data from the server. It's simply not shown.

I will try to investigate further and try to reproduce.

cc @karlitschek @DeepDiver1975 FYI could get a showstopper soon - I opened this ticket to document my process

@MorrisJobke MorrisJobke self-assigned this Jan 15, 2015
@MorrisJobke
Copy link
Contributor Author

To clarify: the rename happend through the web UI

@PVince81
Copy link
Contributor

127.0.0.1 - - [15/Jan/2015:13:32:39 +0100] "GET /index.php/apps/files/ajax/rename.php?dir=%2F&newname=Bildersd&file=Bilder HTTP/1.1" 503

503 ? That's "service unavailable" and shouldn't even trigger an actual rename.

@PVince81
Copy link
Contributor

You said you didn't have any other storages than home:: and the root, so this excludes the case of an unavailable external storage.

@MorrisJobke
Copy link
Contributor Author

@PVince81 Yes. There is no external storage.

Regards the 503: It's the only rename request and it definetly gets renamed. Could that be caused by the timeout?

@PVince81
Copy link
Contributor

Not sure. Do you think php-fpm would decide to send 503 by itself when a timeout occurs ?

If the server was not available / maintenance mode, our Sabre plugin should kick in very early and prevent any file operations.

@PVince81
Copy link
Contributor

You could check the owncloud.log from around the time it happened (mind the timezone/utc differences)

@MorrisJobke
Copy link
Contributor Author

I noticed that a rename caused a lot of database queries (600 for a folder with 507 files in it). It needs to update the path of all elements. I guess this caused the timeout and PHP-FPM will kill the process once the timeout is hit.

https://blackfire.io/profiles/a33715d9-0191-4f9a-ad4a-2f3166d71584/graph

@MorrisJobke
Copy link
Contributor Author

@icewind1991 What is the reason to store the full path? Isn't knowing the parent enough to generate the full path?

@MorrisJobke
Copy link
Contributor Author

@DeepDiver1975 @karlitschek I would rate this a bit higher. I talk to @icewind1991 and he would like to come up with a partly improving change, but reducing the load (especially the SQL queries) in a way like it was done for the delete operation isn't possible for 8.0 (#13394).

On the one hand this would require bigger changes but on the other hand this will cause critical problems (and even data loss) on renaming folders with many children.

Is this rated a showstopper or not?

@icewind1991
Copy link
Contributor

@icewind1991 What is the reason to store the full path? Isn't knowing the parent enough to generate the full path?

Yes, but most queries we do are by path

@MorrisJobke
Copy link
Contributor Author

@icewind1991 I guess it scales better if you simply traverse the file tree. And you can cache this too. The current approach doesn't scale in any direction. :(

@PVince81
Copy link
Contributor

I still don't understand why renaming a simple folder in place could run into a timeout (not even moving it to another location)
Was that folder shared with many people ?

@MorrisJobke
Copy link
Contributor Author

@PVince81 No. Have a look at the path column. it contains the full path and this needs to be updated for every child element. With many childs this could cause a huge processing action.

@PVince81
Copy link
Contributor

Ah right... the DB update :-/

@PVince81
Copy link
Contributor

PVince81 commented Feb 6, 2015

A ticket should be either technical debt or a bug.
I'd rather this is a bug. It might be caused by legacy code, but is still a bug.

@PVince81
Copy link
Contributor

PVince81 commented Feb 6, 2015

@MorrisJobke have you been able to find any more clues ?

@MorrisJobke
Copy link
Contributor Author

@PVince81 It's simply just the massive amount of DB updates. And the executing process got killed before it can finish this task. Nothing we can change for now :(

@christianrj
Copy link

This is really a showstopper bug (as you can see in #10711). Our manager is considering to stop using Owncloud because of all these rename and sync problems that never gets fixed. I hope that you can fix all of these problems, because for us right now, oC can't be used in production. Thanks!

@PVince81 PVince81 added this to the 8.1-next milestone Feb 6, 2015
@PVince81
Copy link
Contributor

PVince81 commented Feb 6, 2015

@DeepDiver1975 I've set this to 8.1, this should definitely be looked into.

It might take some time to debug because this bug is difficult to reproduce consistently.

I suspect that the part that handles renames will need to be rewritten to use a different approach, either by using part folders #13756 or updating the cache for each file one by one, as proposed here #13775 instead of doing a bulk update at the end.

@MorrisJobke
Copy link
Contributor Author

It might take some time to debug because this bug is difficult to reproduce consistently.

? You can't reproduce this? But the rename takes ages for you too, didn't it?

@PVince81
Copy link
Contributor

PVince81 commented Feb 6, 2015

The few times I tried I couldn't reproduce the issue. At least in my case there was no data loss / deletion from the sync client.

@PVince81
Copy link
Contributor

PVince81 commented Feb 6, 2015

Either the rename operation needs to take longer than one sync cycle, which means the sync client would try and access an inconsistent DB state. Or the rename must run into a PHP timeout where the PHP process gets killed (php-fpm case)

Maybe case 1 can be simulated by adding a few sleep() operations in the code to slow down renaming.

@PVince81
Copy link
Contributor

@icewind1991 it didn't work, still happening.

How about the hasUpdated approach you suggested ?

@PVince81
Copy link
Contributor

Work in progress here #16963, searching for alternative approaches to lock the cache/scanner

@PVince81
Copy link
Contributor

These two PRs together #17017 and #16963 make the problem disappear (with locking enabled)

@MorrisJobke
Copy link
Contributor Author

@PVince81 @icewind1991 Thanks for this! You all rock :)

@beejee
Copy link

beejee commented Jun 22, 2015

Hi,

can it be that #15702 is related to this?
I can test any changes/fixes you require me to execute to get you any feedback on this.

Thanks!

@PVince81
Copy link
Contributor

Not necessarily. This ticket here is about files randomly disappearing, it is not consistent.
The ticket you linked against is about SWIFT.

@PVince81
Copy link
Contributor

If you have a test instance where you can test 8.1, you could enable file locking, see https://doc.owncloud.org/server/8.1/admin_manual/configuration_files/files_locking_experimental.html

@beejee
Copy link

beejee commented Jun 22, 2015

Ah indeed, I responded too fast and only noticed after the difference after. I will build a testsetup with 8.1 to experiment with the filelocking and present some feedback on the other thread.

Thanks.

@iGadget
Copy link

iGadget commented Aug 3, 2015

So will the file locking solution also work when you rename a large folder and undo that rename within a few seconds? How would that work out?

@PVince81
Copy link
Contributor

PVince81 commented Aug 3, 2015

If you undo the rename while the operation is still in progress you will get a message like "folder is currently busy" and will need to try again later. If done through the sync client, the sync client will automatically retry later.

@PVince81
Copy link
Contributor

PVince81 commented Aug 3, 2015

On another note, @icewind1991 had a POC fix that should accelerate renaming of database entries: #13956

@simopal6
Copy link

Excuse me for intruding, but it is not clear to me if the problem still happens or not.

@alantygel
Copy link

It just happened on my server.

We are still using owncloud 8.0 . Upgrading to 9 will solve the problem?

@PVince81
Copy link
Contributor

PVince81 commented Dec 9, 2016

@alantygel yes, because OC 9 has some locking mechanism to avoid this kind of race conditions

@lock
Copy link

lock bot commented Aug 3, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Aug 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.