Deleting datafiles from draft versions: Eliminate the possibility of a partial failure. #5535

landreev · 2019-02-13T18:58:50Z

(Will post more details below)
As a quick summary, when we delete an unpublished datafile, we delete both the database object and the physical file.
This is done inside the DeleteDatafileComand, which in turn is called from inside UpdateDatasetversionCommand.
This leaves many possibilities for things to go wrong half-way into the process:
For example, DeleteDatafile succeeds, then the top level UpdateDatasetversion fails. This reverses the transaction, restoring the datafile in the database; but the physical file is already gone.

This can be extremely confusing/dangerous.

djbrooke · 2019-02-13T19:32:01Z

Discussed in sprint planning today. @landreev @scolapasta and other interested parties will discuss and estimate.

landreev · 2019-02-14T16:01:52Z

Part of the task is to figure out what approach to take here.
We are in agreement that permanent physical removal of files should be taken out of the command transactions that can be reversed. There's a couple of ways to go about achieving this.
One would be to simply declare that the delete datafile command DOES NOT handle the removal of the physical file at all. That it must be done outside of the command. And is thus the responsibility of the developer to take care of it; once it has been confirmed that the delete command has succeeded. This can be achieved fairly easily.
The other possibility that was discussed was to expand the command execution framework programmatically, adding some "onSuccess" mechanism for calling things after all the reversible tasks have been completed. This is obviously more work; with the advantage being that it could be reused for other purposes. (note specifically that it would need to be able to handle execution of nested commands, such as described above; in that case the safe onSuccess point is the completion of the UpdateDatasetVersionCommand, not the inner DeleteDataFileCommand).
I would go with the former for the sake of fixing something potentially dangerous asap, in a way that does not require extensive D&D. I don't think there's anything hacky about this approach either. We have in the past similarly taken out the code for physically saving files from inside the Create Version command; for similar reasons

landreev · 2019-02-14T16:05:52Z

In the past we have discussed taking an extra step of never completely deleting files at all while inside the application. Instead when a datafile is deleted, we move the physical file aside - rename it "{filename}.DELETED", for ex. And then the assumption is that it's up to the admins to delete all these files outside the app, once it's been verified that they are no longer referenced. (with the additional possibility for the admin to choose to keep these "deleted" files indefinitely, as an extra backup option).
Could be easily implemented as a configurable option.

qqmyers · 2019-02-14T21:02:08Z

FWIW - just saw #4573 when looking for something else - seems like that's another example of this (though maybe the problem that triggered the db and storage getting out of sync was resolved there).

Also FWIW: If there weren't nested transactions involved, it seems like doing the delete in a finally clause might help - only do the file delete if the db actions succeed without issue.

landreev · 2019-02-15T20:39:04Z

The entries from the actionlogrecord:
(interesting start and end times):

0a91078f-9224-4382-a3d4-4c88d2507046    OK      edu.harvard.iq.dataverse.engine.command.impl.DeleteDataFileCommand      Command 2018-12-29 03:47:53.876 :[XXX]        2018-12-28 22:59:17.8   @YYY

fbd2c18b-e23e-4c9c-862d-b649bda869ee    InternalError   edu.harvard.iq.dataverse.engine.command.impl.UpdateDatasetVersionCommand        Command 2018-12-29 03:47:53.886 :[XXX]  (Exception [EclipseLink-5006] (Eclipse Persistence Services - 2.5.2.v20140319-9ad6abd): org.eclipse.persistence.exceptions.OptimisticLockException
Exception Description: The object [[DatasetVersion id:ZZZ]] cannot be updated because it has changed or been deleted since it was last read. Class> edu.harvard.iq.dataverse.DatasetVersion Primary Key> NNN)    2018-12-28 22:59:17.434 @YYY

stored physical files that may not be associated with any DvObjects; (ref #5535)

database delete transaction. (#5535)

…s on the dataset page. (#5535)

…iated with the DataFile deleted via SWORD. (#5535)

…on of files, when datafiles are deleted via an UpdateDatasetVersionCommand (as opposed to deleting them one by one via individual DeleteDataFileCommands). (#5535)

…5535)

landreev · 2019-03-01T15:14:07Z

Making a PR; but please don't drag this into the "official" code review column just yet. We're still need to finalize/decide if this is a sensible approach.

landreev · 2019-03-01T16:11:53Z

@scolapasta
As we previously agreed, the only way to ensure we are not deleting any physical files before we finalize deleting the corresponding objects from the database was to move it completely out of the body of the command.
This required modifying our StorageIO framework; which only allowed to access the physical file object in the context of going through the corresponding dvObject. It had to be extended to be able to access and delete a storage location directly - since the whole point is to delete it once the dvobject has been confirmed to no longer exist.

We also agreed that we'd first produce an implementation that proves the concept, writing that code for finalizing the deletes and inserting it around the corresponding commands, wherever they are called. Then we would investigate potentially expanding the command execution engine, to be able to specify such functionality in a programmatic way; as some "before" and "after" fragments that could be supplied with individual commands.
I have produced this proof-of-concept implementation; but I'm wondering now if we should at least consider going directly to that "stage 2" plan. The amount of extra code that had to be added is manageable; but it had to be inserted in quite a few places throughout the system. The DeleteDataFileCommand itself is only called directly once in the entire application (from EditDataFilesPage). Elsewhere, files are deleted by means of calling UpdateDatasetVersion, DeleteDatasetVersion, DeleteDataset (*) and DestroyDataset commands - which in turn call the delete file command(s) eventually. I had to add this special handling in 10 different places total (I hope I got them all). Which is kind of a lot... It's definitely not a sustainable long-term solution - to just say that this is the responsibility of the developer, to remember that some commands need this kind of special treatment. But I'm wondering if we should go directly to a cleaner one even in a short term.
On the plus side, the worst thing that can happen now is that some stray, un-deleted physical files may end up left over on storage volumes. Which is WAY BETTER than the other way around.

(*) The DeleteDatasetCommand is currently implemented to redirect to DeleteDatasetVersion internally; but the command is still used throughout the application.

landreev · 2019-03-01T18:36:06Z

To clarify/expand, it's not just that this extra handling had to be added in too many places across the application. In some cases some functionality (like permissions) that could previously be hidden/contained inside the commands had to be exposed and duplicated outside them. A good example is the changes in Datasets.java. The whole point of our command framework was that you could just fire the command from an API wrapper without checking permissions and such; as you can see I had to duplicate a lot of it in the API code, in order to be able to select the correct physical storage locations that need to be destroyed after the command succeeds.

qqmyers · 2019-03-01T18:38:35Z

Any possibility the storage classes could be made transactional? Would that help in keeping the issue localized?

landreev · 2019-03-01T21:04:23Z

@qqmyers I believe it could be done, yes. In a super simplified form, we could replace the deleting of the file with renaming it ".DELETED". If you need to roll the transaction back, you simply rename the file again...
But then we're back to the problem of having to finalize (or roll back) that "file transaction" outside of the body of the command. So I don't feel like it is any simpler, then to just have to delete outside the command (?)
I still think @scolapasta's original idea - to modify the command engine, adding support for programmatically defining methods that need to be executed before and/or after it executes a command, or a nested tree of commands, in a single transaction - was a good long term solution.

qqmyers · 2019-03-01T21:26:53Z

@landreev - I was thinking that there's a way that classes called by a command could be included in the transaction started for that command and that the calling of the commit or rollback (doing things like you say) would then be handled by the overall transaction logic. Not my forte though.

Making commands be required to be part of transaction but not necessarily starting a new one so you could nest them would also be a nice addition (in that case a deletePhysicalFileCommand would work as I was thinking above - just not having it as a command, so it can't be called directly on it's own... so maybe I'm just suggesting a variant of what was already discussed.)

…the DataFileDeleteCommand. (#5535)

landreev · 2019-03-07T16:06:49Z

In retrospect, we closed #4573 prematurely. We had eliminated some known/reproducible scenarios described there. But as we know now there were more potential ways to get a file in this "half-deleted" state left behind.

To be precise, we knew it was still possible. Just by virtue of a permanent action being performed inside a command that does reversible changes in the database. But we were hoping it would take some rare combination of events for it to actually happen.

What started this investigation, a specific file lost, did appear to be an exotic enough scenario (see the actionlogrecord fragment above). It must have involved a user trying to delete a file on an application node that was freezing, then giving up, and making some other changes on the other node; then, some time later, the first node coming back to life and attempting to complete the delete/update... which failed with an optimisticlock, but erased the file in the process.

There are however much simpler ways to reproduce this condition: (@kcondon - this could be useful for QA)

create a draft; upload at least one file;
open the draft in 2 browser windows;
in 1 window, select a file, go to "edit files" -> "metadata", select the file, go to "edit" -> "delete";
when you get "the file will be deleted once you hit save", don't click save just yet!
in the 2nd window, select the same file, go to edit metadata, change the file name, save;
go back to window 1, complete the delete by clicking "save".
the delete will fail w/ an optimisticlock (because of the change made in the other browser); but the file on the filesystem will be deleted.
Now the dataset is showing a file that cannot be downloaded/is not on the filesystem.

qqmyers · 2019-03-07T16:43:08Z

What fun! I can see two possible sub-cases here - if the first transaction has completed, I think the second could check the db to see if the modified date changed relative to the objects it has (or if a draft now exists that wasn't there before), before trying to delete files. If the first transaction is ongoing (e.g. lots of files in the dataset), the edit lock from #5551 might help. (FWIW: It's not clear that this lock is needed for its original purpose (see discussion there), but it does record the fact that a transaction is in progress so others should be able to fail early, i.e. before affecting files...).

landreev · 2019-03-07T18:19:23Z

We will continue investigating and improving this issue (of simultaneous/competing updates). Locks are probably the ultimate safe way to solve this. I'm not sure it could be solved reliably by checking the modification time stamps alone. (It would likely not cover the "exotic", but real-life scenario described above).

But all that said, an optimistic lock exception is not by itself the end of the world. It's actually a good thing, that you can't make a modification to an object that another process/user/etc. is already modifying in parallel. But some foolproof system of locks would make it a less confusing user experience.

This PR is just to fix a particularly nasty problem - where a user is left with what's looking like a file in their dataset, with the physical file no longer there. Not being able to delete a file, because you are, or somebody else is modifying it in parallel is ok; as long as you are left with that un-deleted datafile intact - showing on the page, and still there on the filesystem/s3/etc.

kcondon · 2019-03-11T16:22:11Z

Tested scenario above as well as simple delete scenarios in various dataset states. Ready for merge, pending some unit test review by Leonid.

… S3 driver. (#5535)

landreev · 2019-03-12T00:41:39Z

@kcondon I have found and fixed one problem: in the S3 driver, cached auxiliary files (for ex., thumbnails for image files) were left behind after a delete.
Moving it back into QA. Should be ready for real now.

djbrooke added Status: Ready and removed Status: Ready labels Feb 13, 2019

djbrooke assigned landreev and scolapasta Feb 13, 2019

qqmyers mentioned this issue Feb 15, 2019

As a superuser, I want to make metadata updates without changing dataset version #4760

Closed

djbrooke added Status: Development and removed Status: This/Next Sprint labels Feb 19, 2019

djbrooke unassigned scolapasta Feb 19, 2019

djbrooke removed the ready for estimation label Feb 19, 2019

djbrooke mentioned this issue Feb 20, 2019

Deleting files: selecting 50 files to delete from a large dataset should result in a success message #5567

Closed

landreev added a commit that referenced this issue Feb 26, 2019

Experimental StorageIO framework allowing direct access to

7a321b5

stored physical files that may not be associated with any DvObjects; (ref #5535)

landreev added a commit that referenced this issue Feb 27, 2019

The new framework for deleting the physical file outside of the

ac17050

database delete transaction. (#5535)

landreev added a commit that referenced this issue Feb 27, 2019

new physical file delete framework, applied to bulk draft file delete…

507918c

…s on the dataset page. (#5535)

landreev added a commit that referenced this issue Feb 28, 2019

added the extra code to finalize the physical removal of a file assoc…

3a56aba

…iated with the DataFile deleted via SWORD. (#5535)

landreev added a commit that referenced this issue Feb 28, 2019

finalize delete of the physical file, when deleting via filepage (#5535)

7d4049c

landreev added a commit that referenced this issue Feb 28, 2019

same for the destroyDataset API call (#5535)

58555db

landreev added a commit that referenced this issue Feb 28, 2019

further reorganization of physical file deletes via the datasets API (#…

2761b20

…5535)

landreev added a commit that referenced this issue Mar 1, 2019

more code reorganization (#5535)

afec5e0

landreev mentioned this issue Mar 1, 2019

5535 safer file deletes #5592

Merged

5 tasks

landreev assigned scolapasta Mar 1, 2019

landreev added a commit that referenced this issue Mar 4, 2019

added the id of the datafile, for the actionlogrecord description of …

21bcfa8

…the DataFileDeleteCommand. (#5535)

landreev added Status: Code Review and removed Status: Development labels Mar 5, 2019

landreev added a commit that referenced this issue Mar 5, 2019

the extra lines for cleaning up the cached aux files. (#5535)

64eaf58

djbrooke unassigned landreev Mar 5, 2019

scolapasta added Status: QA and removed Status: Code Review labels Mar 6, 2019

scolapasta removed their assignment Mar 6, 2019

kcondon self-assigned this Mar 7, 2019

pdurbin mentioned this issue Mar 7, 2019

Parallel edits cause multiple draft versions #5550

Closed

kcondon removed their assignment Mar 11, 2019

kcondon added Status: Development and removed Status: QA labels Mar 11, 2019

kcondon self-assigned this Mar 11, 2019

landreev added a commit that referenced this issue Mar 12, 2019

Added handling of cached aux files to the direct access delete in the…

6d223e3

… S3 driver. (#5535)

landreev added Status: QA and removed Status: Development labels Mar 12, 2019

kcondon closed this as completed Mar 12, 2019

kcondon removed the Status: QA label Mar 12, 2019

djbrooke added this to the 4.12 milestone Mar 13, 2019

mheppler mentioned this issue Jun 19, 2019

File Upload - Delete warning message incorrectly displayed #4758

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deleting datafiles from draft versions: Eliminate the possibility of a partial failure. #5535

Deleting datafiles from draft versions: Eliminate the possibility of a partial failure. #5535

landreev commented Feb 13, 2019

djbrooke commented Feb 13, 2019

landreev commented Feb 14, 2019

landreev commented Feb 14, 2019

qqmyers commented Feb 14, 2019

landreev commented Feb 15, 2019

landreev commented Mar 1, 2019

landreev commented Mar 1, 2019

landreev commented Mar 1, 2019

qqmyers commented Mar 1, 2019

landreev commented Mar 1, 2019

qqmyers commented Mar 1, 2019

landreev commented Mar 7, 2019

qqmyers commented Mar 7, 2019

landreev commented Mar 7, 2019

kcondon commented Mar 11, 2019

landreev commented Mar 12, 2019

Deleting datafiles from draft versions: Eliminate the possibility of a partial failure. #5535

Deleting datafiles from draft versions: Eliminate the possibility of a partial failure. #5535

Comments

landreev commented Feb 13, 2019

djbrooke commented Feb 13, 2019

landreev commented Feb 14, 2019

landreev commented Feb 14, 2019

qqmyers commented Feb 14, 2019

landreev commented Feb 15, 2019

landreev commented Mar 1, 2019

landreev commented Mar 1, 2019

landreev commented Mar 1, 2019

qqmyers commented Mar 1, 2019

landreev commented Mar 1, 2019

qqmyers commented Mar 1, 2019

landreev commented Mar 7, 2019

qqmyers commented Mar 7, 2019

landreev commented Mar 7, 2019

kcondon commented Mar 11, 2019

landreev commented Mar 12, 2019