Dataset files cleanup #9132

ErykKul · 2022-11-04T16:13:44Z

What this PR does / why we need it:
In order to conserve storage space, you might want to clean up files in datasets' storage location. Most importantly, when uploading files directly, or by user action, there could be files that failed to upload fully and remain unlinked to the dataset. With the new API call, you can remove all the files not linked to the dataset.

Which issue(s) this PR closes:

Closes #9130

Special notes for your reviewer:
This implementation removes all the files that are not in the file list of a specific dataset. Maybe the thumbnails and export cache files should remain in the storage? It could be useful to be able to remove these files, e.g., in datasets that are not accessed very often. You could also have thumbnails of the files from previous versions of the dataset, possibly no longer present the newest version, etc.

For now, I have kept it simple, and the code removes all not linked files. If required, I could add a functionality that files with names that start with a name of a linked file are left untouched.

Most importantly, I have this code only tested with local file storage. This code could be dangerous if used in production without prior testing...

coveralls · 2022-11-04T16:16:40Z

Coverage: 19.967% (-0.03%) from 20.0% when pulling 635f32e on ErykKul:dataset_files_cleanup into ecc23c0 on IQSS:develop.

qqmyers · 2022-11-04T16:26:34Z

@ErykKul - haven't yet looked at code, but thanks for this! I was just considering starting in on it. I would definitely like to see this expanded to preserve all the files that could be there, at least as an option (maybe the default) - thumbs, cached exports, aux files, external dataset thumbs, provenance file, original files (for ingested tabular files). For S3, something that would also remove partial multipart uploads would be useful but could/should probably be a separate PR. A list-only version that just identify extra files without deleting would be useful as well.

I'm happy to contribute, test on S3, etc. to help you move this forward - just let me know when/where you want help. As a first step I'll try to look through your current code soon.

src/main/java/edu/harvard/iq/dataverse/api/Datasets.java

…terface

donsizemore · 2022-11-07T12:20:40Z

Yes, thank you for this. Is it possible to have the endpoint default to a listonly or dryrun mode, then actually fire when passed an execute or force flag? Just so admins can see what would be deleted first...

ErykKul · 2022-11-08T17:01:43Z

I have added the dryrun query parameter. I have also updated the filter to include the export files:

Predicate<String> filter = f -> {
    return !f.startsWith("export_") || files.stream().noneMatch(x -> f.startsWith(x));
};

mreekie · 2023-01-04T16:41:31Z

It's about cleaning up files in S3 [edit by leonid: not just S3 - any file storage]. Files like those from direct upload. It's cleaning up cruft. You need to be sure that you are only deleting the cruft and not a file that is actually still required. This one needs some care in looking at.
It's possible that the reviewer may find something significant to change.
There are no tests defined for this.
So the work should include adding tests by the orginal developer. (This will likely block progress at first)
Size 33

mreekie · 2023-01-04T16:45:55Z

This may be related to work Oliver did on #8983

landreev · 2023-01-19T16:00:37Z

Thank you for this PR.
For some reason I have always assumed it was easier to perform this cleanup outside of the application, with scripts that can directly access both the storage and the database. (This is how I've been doing it on Harvard prod. servers). I did realize at some point that we never even mention this issue in the guides (even saying "hey, it's the responsibility of the admin to clean up your storage and delete leftover junk, from incomplete uploads and/or deletes, etc. etc." would be better than nothing). I was thinking of making a documentation PR for this, but your solution is of course much better.
I'm looking at the code now/making sure no "good" files are being overlooked. (We may want to make the warnings contain even stronger language when we release this, about the feature being experimental and potentially dangerous, how nobody should even think of using it in production without ensuring they have full backups of everything, etc. etc.)

src/main/java/edu/harvard/iq/dataverse/api/Datasets.java

landreev · 2023-01-20T01:59:47Z

... I have also updated the filter to include the export files:
...
    return !f.startsWith("export_") || ...;

This is the biggest potential problem I'm seeing so far. I don't think it is safe at all, to assume that the only useful dataset-level auxiliary files are the cached metadata export files. The export files are at least mostly safe to erase (they will be regenerated automatically when needed). But the dataset_logo* files (custom dataset thumbnails) that it would end up deleting with the filter above are not easily recoverable. The only other real-life examples of non-export dataset-level aux. files I can think of right away were some GEO/WorldMap-specific files that are no longer in use. But we should definitely assume that it is possible, for such files to exist. I.e., if we hard-code the valid filename prefixes that are in use now, future developers will need to know/remember to add to the list if they ever need to add aux. files that start with something else (and that would be asking for trouble).

Not sure what the best/cleanest way to deal with this is. It's possible that the best we can do is to only delete the files with the filenames that match ^[0-9a-f]{11}-[0-9a-f]{12}\.?.*. That should catch all the unlinked datafile-level main files and their derivatives under our current standard file naming scheme, I think. This will NOT catch some ancient files for a (very few) legacy installations - such as Harvard and Odum and a couple of others - that have grandfathered files that don't look like this. But I don't think this is a big issue - if any of us old-timer Dataverses still have some junk files from a decade+ back wasting storage space - that's probably on us.

Anyone? - Does the regex above really match all possible (reasonably modern) datafile-level physical files? (Does this work with RemoteOverlay?)

OK, we'll figure this out. (Sorry to be difficult :)

qqmyers · 2023-01-20T18:10:05Z

I agree that just deleting datafiles not in the database ( ^[0-9a-f]{11}-[0-9a-f]{12}\..* ) would be a safer option. I didn't check that regex just now but the new code that assures you can only send in valid storageidentifiers should be using a regex so we can compare with that.

There are definitely more files that just export_* (which as you say, will be regenerated) at the dataset level that need to remain - the json provenance file is one of them - that is user input and can't be recreated. Right now, creating a whitelist would be more trouble than a blacklist approach. If/when the dataset-level aux files are organized like the file level ones (all with the same naming pattern), a whitelist might be easier.

Not sure what that means for this PR. Having the ability to clean failed uploads is probably 90% of the problem and arguably safer, so it would be nice if that were an/the option. If there's interest in trying to assemble a whitelist of all additional files that can exist, I'd be happy to help trying to track down the sources, but I think it would be OK to make that future work as well.

One other thing to be aware of - it is probably an infrequent case but files in a dataset may not all be in the same store. One can change the store used by the dataset to save new files after some files have been uploaded. I think the auxiliary files for the dataset are always in the store for the dataset but the datafiles and their auxfiles could be spread out.

landreev · 2023-01-20T21:15:28Z

I agree that just deleting datafiles not in the database ( ^[0-9a-f]{11}-[0-9a-f]{12}\.?.* ) would be a safer option.

I think we should just go ahead and adopt the regex approach. In practice, I can't think of anything else that could be wasting any appreciable amount of space, other than leftover datafiles, from failed uploads or deletes.

I would NOT be super comfortable with the idea of a white list; can't think of a truly sure way to keep it up-to-date.

If/when the dataset-level aux files are organized like the file level ones (all with the same naming pattern)

Yeah, we messed up this part, the storage names for the dataset-level files. It's because of this also that StorageIO can listAuxObjects for a file; but not for a dataset. We save a file-level aux file with the tag "xyz" as "${basename}.xyz", but on the dataset level, we save it as just "xyz". Either some reserved dataset-level basename, or simply ".xyz" would be a way better scheme. (obviously, this is way outside this PR)

landreev · 2023-01-20T21:27:38Z

(sorry, that regex I first posted was specifically for aux. files - i.e., for filenames with extensions; it needed one extra character to match the main files too; I corrected it in my comments above)

…set_files_cleanup

…each version separately

landreev · 2023-01-23T16:26:03Z

Thank you @ErykKul. I'm looking at the latest changes now.

src/main/java/edu/harvard/iq/dataverse/api/Datasets.java

landreev · 2023-01-24T21:31:10Z

Sorry for the delay with reviewing the PR; I keep getting distracted by other things, but will try to wrap it up asap.
I just tested it with filesystem-stored files; happy to see that my regex appears to be working (w/ the extra backslash to escape the literal . - yes). I mean, I only tried it with egrep, wasn't 100% sure if it was going to work w/ Java matcher.
I do want to test it on s3 some more, so I'll go do that.
(I may suggest that we tighten that regex a bit further, let me think about that).

landreev · 2023-01-24T22:07:57Z

The first part of our generated physical file name - [0-9a-f]{11} - that's the timestamp of when the file was created, the number of milliseconds since 01-01-1970, hex-encoded.
If my quick math is correct, the regex will be valid for 500+ more years, before the number becomes 12 digit... so, I think we are good?

src/main/java/edu/harvard/iq/dataverse/api/Datasets.java

landreev · 2023-01-24T22:49:53Z

doc/sphinx-guides/source/api/native-api.rst

+
+You might also want to remove cached export files or some temporary files, thumbnails, etc.
+
+All the files stored in the Dataset storage location that are not in the file list of that Dataset can be removed, as shown in the example below.


Is it necessary to clarify here, that only the files that are looking like leftover datafiles will be deleted here, that Dataverse will skip any files that it can't identify as such for sure? So, if they happen to have anything else useful stored in the same directory - not that we ever recommend doing that - it will be spared?
I'm not sure about this, whether it is in fact necessary to explain this. So I'm leaving this up to you.

But it would be great to add a reminder here that this is an experimental feature, like in the release note; and a suggestion to make sure their backups are up-to-date before attempting this on prod. servers. And also, could you please document the "dryrun" parameter, and tell them sternly, to start with that. Thanks!

Thanks. I will rework the documentation to reflect the current solution and add the "experimental feature" warning there. The dry run parameter is a very good idea for testing before using it, and should be explained (I did not do that yet after adding the feature...). I will give extra warning to the swift users, as it is not yet tested on swift...

@landreev
Thank you for your review! I have done the remaining changes and updated the documentation. Can you have a look at the latest commits? Thanks.

landreev · 2023-01-24T23:20:54Z

I tested it under S3 a bit, it's all looking good - thank you!
I don't think it's necessary, to mess with that regex any further.
So, the LAST thing I'm worried about: swift. Do you happen to have a way to test it? Because I don't think we do at this point.

ErykKul · 2023-01-25T09:24:00Z

Swift is a problem. I can not test it neither. The current solution with your regex is much safer, we do not attempt removing files that do not match the pattern. And, if the remove fails, the file remains in the system. I think it is OK to release it with additional warnings for swift that it is not tested. The better solution would be to find someone who uses swift and can test it...

pdurbin · 2023-01-25T14:03:10Z

Borealis (née Scholars Portal) is using Swift, I believe. They wrote "Ontario Library Research Cloud (SWIFT-S3 setup), locally managed with 5 nodes across Canada" under "indicate where Dataverse deployed" in the 2022 GDCC survey. They also ticked the box for using Swift:

I'm not sure who would know for sure or who might be able to test this PR so I'll just spam some folks from Borealis (sorry! 😅 ): @amberleahey @lubitchv @JayanthyChengan @meghangoodchild

landreev

I am going to go ahead and approve the PR. Thank you for all the work on this!
As for swift - thank you @pdurbin for spamming some potential testers of this functionality.

I personally don't think the swift part of it should hold the PR, or be considered a blocker. So, if testing under swift becomes even remotely problematic, I'm just going to suggest that the cleanup implementation in the swift driver be commented out and replaced with a "not yet supported" exception, and a message to any potential swift user suggesting that they review the code and complete the task of adding the functionality.

mreekie · 2023-01-26T20:53:33Z

sprint kickoff

include in QATail for Jan 11, 2023

ErykKul and others added 3 commits October 6, 2022 11:40

dataset files cleanup

b59f483

Merge branch 'IQSS:develop' into dataset_files_cleanup

fbf4838

added documentation

d10b154

qqmyers reviewed Nov 4, 2022

View reviewed changes

src/main/java/edu/harvard/iq/dataverse/api/Datasets.java Outdated Show resolved Hide resolved

ErykKul and others added 3 commits November 4, 2022 18:23

replaced listAllFiles and deleteFile with cleanUp in the StorageIO in…

9ba760d

…terface

better filter for files to delete

0017069

Merge branch 'IQSS:develop' into dataset_files_cleanup

d9d6463

ErykKul added 3 commits November 7, 2022 13:59

updated filter: exlude export files

f0ac872

bugfix in filter

bcaeb9f

added dryrun query parameter

503b9a3

DieuwertjeBloemen added this to Medium priority in KU Leuven RDR Jan 4, 2023

mreekie added the Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) label Jan 4, 2023

mreekie added this to Ready for Review in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) via automation Jan 11, 2023

landreev self-requested a review January 12, 2023 16:05

landreev self-assigned this Jan 12, 2023

scolapasta moved this from Ready for Review to Review 🔎 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jan 17, 2023

landreev reviewed Jan 20, 2023

View reviewed changes

src/main/java/edu/harvard/iq/dataverse/api/Datasets.java Outdated Show resolved Hide resolved

ErykKul added 2 commits January 23, 2023 10:19

Merge branch 'develop' of https://github.com/IQSS/dataverse into data…

f7e9b1b

…set_files_cleanup

simplified for loop: loop directly on DataFiles of dataset, not over …

aab0f2a

…each version separately

scolapasta assigned ErykKul Jan 23, 2023

landreev unassigned ErykKul Jan 23, 2023

landreev reviewed Jan 24, 2023

View reviewed changes

src/main/java/edu/harvard/iq/dataverse/api/Datasets.java Outdated Show resolved Hide resolved

landreev reviewed Jan 24, 2023

View reviewed changes

src/main/java/edu/harvard/iq/dataverse/api/Datasets.java Outdated Show resolved Hide resolved

landreev reviewed Jan 24, 2023

View reviewed changes

landreev assigned ErykKul Jan 24, 2023

ErykKul added 2 commits January 25, 2023 12:41

removed split limit when splitting the storage identifier

e895445

legacy files generate only a warning now, no error is returned

222b56a

documentation update

635f32e

landreev approved these changes Jan 25, 2023

View reviewed changes

IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from Review 🔎 to Ready for QA ⏩ Jan 25, 2023

landreev unassigned landreev and ErykKul Jan 25, 2023

kcondon moved this from Ready for QA ⏩ to QA ✅ in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jan 26, 2023

kcondon self-assigned this Jan 26, 2023

mreekie added Size: SprintTail Issues that have made it into Ready for QA or further right and removed Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) labels Jan 26, 2023

kcondon merged commit 6bbbf38 into IQSS:develop Jan 27, 2023

IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from QA ✅ to Done 🚀 Jan 27, 2023

pdurbin added this to the 5.13 milestone Jan 27, 2023

mikeaintworkin mentioned this pull request Jan 29, 2023

Retrieve a list of the issues & PRs in the global backlog and include Status thisaintwork/experimentsWithGithub#10

Closed

DieuwertjeBloemen moved this from Medium priority to Closed in KU Leuven RDR Mar 7, 2023

ErykKul mentioned this pull request Oct 18, 2023

Support for s3 without tagging (e.g., s3 api on top of swift) #10022

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset files cleanup #9132

Dataset files cleanup #9132

ErykKul commented Nov 4, 2022

coveralls commented Nov 4, 2022 •

edited

qqmyers commented Nov 4, 2022

donsizemore commented Nov 7, 2022

ErykKul commented Nov 8, 2022

mreekie commented Jan 4, 2023 •

edited by landreev

mreekie commented Jan 4, 2023

landreev commented Jan 19, 2023

landreev commented Jan 20, 2023 •

edited

qqmyers commented Jan 20, 2023

landreev commented Jan 20, 2023 •

edited

landreev commented Jan 20, 2023

landreev commented Jan 23, 2023

landreev commented Jan 24, 2023

landreev commented Jan 24, 2023

landreev Jan 24, 2023 •

edited

landreev Jan 24, 2023

ErykKul Jan 25, 2023

ErykKul Jan 25, 2023

landreev commented Jan 24, 2023

ErykKul commented Jan 25, 2023

pdurbin commented Jan 25, 2023

landreev left a comment

mreekie commented Jan 26, 2023


		You might also want to remove cached export files or some temporary files, thumbnails, etc.

		All the files stored in the Dataset storage location that are not in the file list of that Dataset can be removed, as shown in the example below.

Dataset files cleanup #9132

Dataset files cleanup #9132

Conversation

ErykKul commented Nov 4, 2022

coveralls commented Nov 4, 2022 • edited

qqmyers commented Nov 4, 2022

donsizemore commented Nov 7, 2022

ErykKul commented Nov 8, 2022

mreekie commented Jan 4, 2023 • edited by landreev

mreekie commented Jan 4, 2023

landreev commented Jan 19, 2023

landreev commented Jan 20, 2023 • edited

qqmyers commented Jan 20, 2023

landreev commented Jan 20, 2023 • edited

landreev commented Jan 20, 2023

landreev commented Jan 23, 2023

landreev commented Jan 24, 2023

landreev commented Jan 24, 2023

landreev Jan 24, 2023 • edited

Choose a reason for hiding this comment

landreev Jan 24, 2023

Choose a reason for hiding this comment

ErykKul Jan 25, 2023

Choose a reason for hiding this comment

ErykKul Jan 25, 2023

Choose a reason for hiding this comment

landreev commented Jan 24, 2023

ErykKul commented Jan 25, 2023

pdurbin commented Jan 25, 2023

landreev left a comment

Choose a reason for hiding this comment

mreekie commented Jan 26, 2023

coveralls commented Nov 4, 2022 •

edited

mreekie commented Jan 4, 2023 •

edited by landreev

landreev commented Jan 20, 2023 •

edited

landreev commented Jan 20, 2023 •

edited

landreev Jan 24, 2023 •

edited