New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deleting datafiles from draft versions: Eliminate the possibility of a partial failure. #5535
Comments
Discussed in sprint planning today. @landreev @scolapasta and other interested parties will discuss and estimate. |
Part of the task is to figure out what approach to take here. |
In the past we have discussed taking an extra step of never completely deleting files at all while inside the application. Instead when a datafile is deleted, we move the physical file aside - rename it "{filename}.DELETED", for ex. And then the assumption is that it's up to the admins to delete all these files outside the app, once it's been verified that they are no longer referenced. (with the additional possibility for the admin to choose to keep these "deleted" files indefinitely, as an extra backup option). |
FWIW - just saw #4573 when looking for something else - seems like that's another example of this (though maybe the problem that triggered the db and storage getting out of sync was resolved there). Also FWIW: If there weren't nested transactions involved, it seems like doing the delete in a finally clause might help - only do the file delete if the db actions succeed without issue. |
The entries from the actionlogrecord:
|
stored physical files that may not be associated with any DvObjects; (ref #5535)
database delete transaction. (#5535)
…iated with the DataFile deleted via SWORD. (#5535)
…on of files, when datafiles are deleted via an UpdateDatasetVersionCommand (as opposed to deleting them one by one via individual DeleteDataFileCommands). (#5535)
Making a PR; but please don't drag this into the "official" code review column just yet. We're still need to finalize/decide if this is a sensible approach. |
@scolapasta We also agreed that we'd first produce an implementation that proves the concept, writing that code for finalizing the deletes and inserting it around the corresponding commands, wherever they are called. Then we would investigate potentially expanding the command execution engine, to be able to specify such functionality in a programmatic way; as some "before" and "after" fragments that could be supplied with individual commands. (*) The DeleteDatasetCommand is currently implemented to redirect to DeleteDatasetVersion internally; but the command is still used throughout the application. |
To clarify/expand, it's not just that this extra handling had to be added in too many places across the application. In some cases some functionality (like permissions) that could previously be hidden/contained inside the commands had to be exposed and duplicated outside them. A good example is the changes in Datasets.java. The whole point of our command framework was that you could just fire the command from an API wrapper without checking permissions and such; as you can see I had to duplicate a lot of it in the API code, in order to be able to select the correct physical storage locations that need to be destroyed after the command succeeds. |
Any possibility the storage classes could be made transactional? Would that help in keeping the issue localized? |
@qqmyers I believe it could be done, yes. In a super simplified form, we could replace the deleting of the file with renaming it ".DELETED". If you need to roll the transaction back, you simply rename the file again... |
@landreev - I was thinking that there's a way that classes called by a command could be included in the transaction started for that command and that the calling of the commit or rollback (doing things like you say) would then be handled by the overall transaction logic. Not my forte though. Making commands be required to be part of transaction but not necessarily starting a new one so you could nest them would also be a nice addition (in that case a deletePhysicalFileCommand would work as I was thinking above - just not having it as a command, so it can't be called directly on it's own... so maybe I'm just suggesting a variant of what was already discussed.) |
In retrospect, we closed #4573 prematurely. We had eliminated some known/reproducible scenarios described there. But as we know now there were more potential ways to get a file in this "half-deleted" state left behind. To be precise, we knew it was still possible. Just by virtue of a permanent action being performed inside a command that does reversible changes in the database. But we were hoping it would take some rare combination of events for it to actually happen. What started this investigation, a specific file lost, did appear to be an exotic enough scenario (see the actionlogrecord fragment above). It must have involved a user trying to delete a file on an application node that was freezing, then giving up, and making some other changes on the other node; then, some time later, the first node coming back to life and attempting to complete the delete/update... which failed with an optimisticlock, but erased the file in the process. There are however much simpler ways to reproduce this condition: (@kcondon - this could be useful for QA)
|
What fun! I can see two possible sub-cases here - if the first transaction has completed, I think the second could check the db to see if the modified date changed relative to the objects it has (or if a draft now exists that wasn't there before), before trying to delete files. If the first transaction is ongoing (e.g. lots of files in the dataset), the edit lock from #5551 might help. (FWIW: It's not clear that this lock is needed for its original purpose (see discussion there), but it does record the fact that a transaction is in progress so others should be able to fail early, i.e. before affecting files...). |
We will continue investigating and improving this issue (of simultaneous/competing updates). Locks are probably the ultimate safe way to solve this. I'm not sure it could be solved reliably by checking the modification time stamps alone. (It would likely not cover the "exotic", but real-life scenario described above). But all that said, an optimistic lock exception is not by itself the end of the world. It's actually a good thing, that you can't make a modification to an object that another process/user/etc. is already modifying in parallel. But some foolproof system of locks would make it a less confusing user experience. This PR is just to fix a particularly nasty problem - where a user is left with what's looking like a file in their dataset, with the physical file no longer there. Not being able to delete a file, because you are, or somebody else is modifying it in parallel is ok; as long as you are left with that un-deleted datafile intact - showing on the page, and still there on the filesystem/s3/etc. |
Tested scenario above as well as simple delete scenarios in various dataset states. Ready for merge, pending some unit test review by Leonid. |
@kcondon I have found and fixed one problem: in the S3 driver, cached auxiliary files (for ex., thumbnails for image files) were left behind after a delete. |
(Will post more details below)
As a quick summary, when we delete an unpublished datafile, we delete both the database object and the physical file.
This is done inside the DeleteDatafileComand, which in turn is called from inside UpdateDatasetversionCommand.
This leaves many possibilities for things to go wrong half-way into the process:
For example, DeleteDatafile succeeds, then the top level UpdateDatasetversion fails. This reverses the transaction, restoring the datafile in the database; but the physical file is already gone.
This can be extremely confusing/dangerous.
The text was updated successfully, but these errors were encountered: