Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deleting datafiles from draft versions: Eliminate the possibility of a partial failure. #5535

Closed
landreev opened this issue Feb 13, 2019 · 16 comments
Assignees
Milestone

Comments

@landreev
Copy link
Contributor

(Will post more details below)
As a quick summary, when we delete an unpublished datafile, we delete both the database object and the physical file.
This is done inside the DeleteDatafileComand, which in turn is called from inside UpdateDatasetversionCommand.
This leaves many possibilities for things to go wrong half-way into the process:
For example, DeleteDatafile succeeds, then the top level UpdateDatasetversion fails. This reverses the transaction, restoring the datafile in the database; but the physical file is already gone.

This can be extremely confusing/dangerous.

@djbrooke
Copy link
Contributor

Discussed in sprint planning today. @landreev @scolapasta and other interested parties will discuss and estimate.

@landreev
Copy link
Contributor Author

Part of the task is to figure out what approach to take here.
We are in agreement that permanent physical removal of files should be taken out of the command transactions that can be reversed. There's a couple of ways to go about achieving this.
One would be to simply declare that the delete datafile command DOES NOT handle the removal of the physical file at all. That it must be done outside of the command. And is thus the responsibility of the developer to take care of it; once it has been confirmed that the delete command has succeeded. This can be achieved fairly easily.
The other possibility that was discussed was to expand the command execution framework programmatically, adding some "onSuccess" mechanism for calling things after all the reversible tasks have been completed. This is obviously more work; with the advantage being that it could be reused for other purposes. (note specifically that it would need to be able to handle execution of nested commands, such as described above; in that case the safe onSuccess point is the completion of the UpdateDatasetVersionCommand, not the inner DeleteDataFileCommand).
I would go with the former for the sake of fixing something potentially dangerous asap, in a way that does not require extensive D&D. I don't think there's anything hacky about this approach either. We have in the past similarly taken out the code for physically saving files from inside the Create Version command; for similar reasons

@landreev
Copy link
Contributor Author

In the past we have discussed taking an extra step of never completely deleting files at all while inside the application. Instead when a datafile is deleted, we move the physical file aside - rename it "{filename}.DELETED", for ex. And then the assumption is that it's up to the admins to delete all these files outside the app, once it's been verified that they are no longer referenced. (with the additional possibility for the admin to choose to keep these "deleted" files indefinitely, as an extra backup option).
Could be easily implemented as a configurable option.

@qqmyers
Copy link
Member

qqmyers commented Feb 14, 2019

FWIW - just saw #4573 when looking for something else - seems like that's another example of this (though maybe the problem that triggered the db and storage getting out of sync was resolved there).

Also FWIW: If there weren't nested transactions involved, it seems like doing the delete in a finally clause might help - only do the file delete if the db actions succeed without issue.

@landreev
Copy link
Contributor Author

The entries from the actionlogrecord:
(interesting start and end times):

0a91078f-9224-4382-a3d4-4c88d2507046    OK      edu.harvard.iq.dataverse.engine.command.impl.DeleteDataFileCommand      Command 2018-12-29 03:47:53.876 :[XXX]        2018-12-28 22:59:17.8   @YYY

fbd2c18b-e23e-4c9c-862d-b649bda869ee    InternalError   edu.harvard.iq.dataverse.engine.command.impl.UpdateDatasetVersionCommand        Command 2018-12-29 03:47:53.886 :[XXX]  (Exception [EclipseLink-5006] (Eclipse Persistence Services - 2.5.2.v20140319-9ad6abd): org.eclipse.persistence.exceptions.OptimisticLockException
Exception Description: The object [[DatasetVersion id:ZZZ]] cannot be updated because it has changed or been deleted since it was last read. Class> edu.harvard.iq.dataverse.DatasetVersion Primary Key> NNN)    2018-12-28 22:59:17.434 @YYY

landreev added a commit that referenced this issue Feb 26, 2019
stored physical files that may not be associated with any DvObjects;
(ref #5535)
landreev added a commit that referenced this issue Feb 27, 2019
landreev added a commit that referenced this issue Feb 27, 2019
landreev added a commit that referenced this issue Feb 28, 2019
landreev added a commit that referenced this issue Feb 28, 2019
…on of files, when

datafiles are deleted via an UpdateDatasetVersionCommand (as opposed to deleting them
one by one via individual DeleteDataFileCommands). (#5535)
landreev added a commit that referenced this issue Mar 1, 2019
@landreev
Copy link
Contributor Author

landreev commented Mar 1, 2019

Making a PR; but please don't drag this into the "official" code review column just yet. We're still need to finalize/decide if this is a sensible approach.

@landreev landreev mentioned this issue Mar 1, 2019
5 tasks
@landreev
Copy link
Contributor Author

landreev commented Mar 1, 2019

@scolapasta
As we previously agreed, the only way to ensure we are not deleting any physical files before we finalize deleting the corresponding objects from the database was to move it completely out of the body of the command.
This required modifying our StorageIO framework; which only allowed to access the physical file object in the context of going through the corresponding dvObject. It had to be extended to be able to access and delete a storage location directly - since the whole point is to delete it once the dvobject has been confirmed to no longer exist.

We also agreed that we'd first produce an implementation that proves the concept, writing that code for finalizing the deletes and inserting it around the corresponding commands, wherever they are called. Then we would investigate potentially expanding the command execution engine, to be able to specify such functionality in a programmatic way; as some "before" and "after" fragments that could be supplied with individual commands.
I have produced this proof-of-concept implementation; but I'm wondering now if we should at least consider going directly to that "stage 2" plan. The amount of extra code that had to be added is manageable; but it had to be inserted in quite a few places throughout the system. The DeleteDataFileCommand itself is only called directly once in the entire application (from EditDataFilesPage). Elsewhere, files are deleted by means of calling UpdateDatasetVersion, DeleteDatasetVersion, DeleteDataset (*) and DestroyDataset commands - which in turn call the delete file command(s) eventually. I had to add this special handling in 10 different places total (I hope I got them all). Which is kind of a lot... It's definitely not a sustainable long-term solution - to just say that this is the responsibility of the developer, to remember that some commands need this kind of special treatment. But I'm wondering if we should go directly to a cleaner one even in a short term.
On the plus side, the worst thing that can happen now is that some stray, un-deleted physical files may end up left over on storage volumes. Which is WAY BETTER than the other way around.

(*) The DeleteDatasetCommand is currently implemented to redirect to DeleteDatasetVersion internally; but the command is still used throughout the application.

@landreev
Copy link
Contributor Author

landreev commented Mar 1, 2019

To clarify/expand, it's not just that this extra handling had to be added in too many places across the application. In some cases some functionality (like permissions) that could previously be hidden/contained inside the commands had to be exposed and duplicated outside them. A good example is the changes in Datasets.java. The whole point of our command framework was that you could just fire the command from an API wrapper without checking permissions and such; as you can see I had to duplicate a lot of it in the API code, in order to be able to select the correct physical storage locations that need to be destroyed after the command succeeds.

@qqmyers
Copy link
Member

qqmyers commented Mar 1, 2019

Any possibility the storage classes could be made transactional? Would that help in keeping the issue localized?

@landreev
Copy link
Contributor Author

landreev commented Mar 1, 2019

@qqmyers I believe it could be done, yes. In a super simplified form, we could replace the deleting of the file with renaming it ".DELETED". If you need to roll the transaction back, you simply rename the file again...
But then we're back to the problem of having to finalize (or roll back) that "file transaction" outside of the body of the command. So I don't feel like it is any simpler, then to just have to delete outside the command (?)
I still think @scolapasta's original idea - to modify the command engine, adding support for programmatically defining methods that need to be executed before and/or after it executes a command, or a nested tree of commands, in a single transaction - was a good long term solution.

@qqmyers
Copy link
Member

qqmyers commented Mar 1, 2019

@landreev - I was thinking that there's a way that classes called by a command could be included in the transaction started for that command and that the calling of the commit or rollback (doing things like you say) would then be handled by the overall transaction logic. Not my forte though.

Making commands be required to be part of transaction but not necessarily starting a new one so you could nest them would also be a nice addition (in that case a deletePhysicalFileCommand would work as I was thinking above - just not having it as a command, so it can't be called directly on it's own... so maybe I'm just suggesting a variant of what was already discussed.)

@landreev
Copy link
Contributor Author

landreev commented Mar 7, 2019

In retrospect, we closed #4573 prematurely. We had eliminated some known/reproducible scenarios described there. But as we know now there were more potential ways to get a file in this "half-deleted" state left behind.

To be precise, we knew it was still possible. Just by virtue of a permanent action being performed inside a command that does reversible changes in the database. But we were hoping it would take some rare combination of events for it to actually happen.

What started this investigation, a specific file lost, did appear to be an exotic enough scenario (see the actionlogrecord fragment above). It must have involved a user trying to delete a file on an application node that was freezing, then giving up, and making some other changes on the other node; then, some time later, the first node coming back to life and attempting to complete the delete/update... which failed with an optimisticlock, but erased the file in the process.

There are however much simpler ways to reproduce this condition: (@kcondon - this could be useful for QA)

  • create a draft; upload at least one file;
  • open the draft in 2 browser windows;
  • in 1 window, select a file, go to "edit files" -> "metadata", select the file, go to "edit" -> "delete";
  • when you get "the file will be deleted once you hit save", don't click save just yet!
  • in the 2nd window, select the same file, go to edit metadata, change the file name, save;
  • go back to window 1, complete the delete by clicking "save".
  • the delete will fail w/ an optimisticlock (because of the change made in the other browser); but the file on the filesystem will be deleted.
    Now the dataset is showing a file that cannot be downloaded/is not on the filesystem.

@qqmyers
Copy link
Member

qqmyers commented Mar 7, 2019

What fun! I can see two possible sub-cases here - if the first transaction has completed, I think the second could check the db to see if the modified date changed relative to the objects it has (or if a draft now exists that wasn't there before), before trying to delete files. If the first transaction is ongoing (e.g. lots of files in the dataset), the edit lock from #5551 might help. (FWIW: It's not clear that this lock is needed for its original purpose (see discussion there), but it does record the fact that a transaction is in progress so others should be able to fail early, i.e. before affecting files...).

@landreev
Copy link
Contributor Author

landreev commented Mar 7, 2019

We will continue investigating and improving this issue (of simultaneous/competing updates). Locks are probably the ultimate safe way to solve this. I'm not sure it could be solved reliably by checking the modification time stamps alone. (It would likely not cover the "exotic", but real-life scenario described above).

But all that said, an optimistic lock exception is not by itself the end of the world. It's actually a good thing, that you can't make a modification to an object that another process/user/etc. is already modifying in parallel. But some foolproof system of locks would make it a less confusing user experience.

This PR is just to fix a particularly nasty problem - where a user is left with what's looking like a file in their dataset, with the physical file no longer there. Not being able to delete a file, because you are, or somebody else is modifying it in parallel is ok; as long as you are left with that un-deleted datafile intact - showing on the page, and still there on the filesystem/s3/etc.

@kcondon
Copy link
Contributor

kcondon commented Mar 11, 2019

Tested scenario above as well as simple delete scenarios in various dataset states. Ready for merge, pending some unit test review by Leonid.

@kcondon kcondon removed their assignment Mar 11, 2019
@kcondon kcondon self-assigned this Mar 11, 2019
@landreev
Copy link
Contributor Author

@kcondon I have found and fixed one problem: in the S3 driver, cached auxiliary files (for ex., thumbnails for image files) were left behind after a delete.
Moving it back into QA. Should be ready for real now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants