Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate the physical files when publishing a dataset #6558

Closed
landreev opened this issue Jan 22, 2020 · 8 comments · Fixed by #6790
Closed

Validate the physical files when publishing a dataset #6558

landreev opened this issue Jan 22, 2020 · 8 comments · Fixed by #6790
Assignees

Comments

@landreev
Copy link
Contributor

This is a proposal/an idea to consider.

Analyzing the issues we've had with users' physical files over the years, most were caused by something that went wrong with the files still being in the Draft/unpublished state.

This is because we only attempt/allow to delete the physical file before the datafile gets published. And that's when something can go wrong - most commonly, an attempted delete or ingest would remove or corrupt the physical file, while preserving the datafile entry in the database. Then the user publishes the version, without realizing the physical file is missing.

This makes me think we should consider going through the files as they are being published, and validating/confirming that they are still present, that the checksums add up, etc.
Somewhat similarly to how we attempt to register global ids for datafiles on publish.

Doing so would prevent most, if not all possible situations when a physical file is missing from a published dataset.

@djbrooke djbrooke moved this from Needs Discussion Before Ready to Ready 🙋 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Feb 14, 2020
@scolapasta
Copy link
Contributor

@landreev agreed this is a good idea and can reuse the publishing in progress locks that file PIDs use. (it could also conceivably use the publish workflows that @michbarsinai added).File exists and checksum matches seem like the right checks.

@djbrooke
Copy link
Contributor

djbrooke commented Feb 18, 2020

I moved this to ready. Since this is a case where there would need to be intervention from the installation's support contact, the red alert message to contact support should be presented.

@djbrooke
Copy link
Contributor

djbrooke commented Feb 26, 2020

  • Could take place before we initiate File PID registration (note this would introduce a new lock for those installations not using File PIDs)
  • Should we update the messaging on the "Locks in progress..."?
  • The person should receive a message to contact the installation support if this check fails
  • In future cases such as TRSAs, we may only be able to validate the files that Dataverse manages
  • We should get an idea of the performance hit here (we are OK with some hit for the sake of data integrity)

@sekmiller sekmiller self-assigned this Mar 20, 2020
@sekmiller sekmiller moved this from IQSS Sprint 2/26 - 3/11 to IQSS Team - In Progress 💻 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Mar 20, 2020
@sekmiller sekmiller removed their assignment Mar 20, 2020
@sekmiller sekmiller moved this from IQSS Team - In Progress 💻 to IQSS Sprint 2/26 - 3/11 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Mar 20, 2020
@djbrooke djbrooke moved this from IQSS Sprint 2/26 - 3/11 to Ready 🙋 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Mar 20, 2020
@djbrooke djbrooke moved this from Up Next to IQSS Team - In Progress 💻 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Mar 23, 2020
@landreev
Copy link
Contributor Author

landreev commented Apr 1, 2020

We ran into a problem of creating a RestAssured failure test - specifically, how to create an invalid datafile, with the checksum that doesn't match the content. Since RestAssured tests are run remotely, there's no direct access to the filesystem to mess with the file (delete/overwrite it, etc) after the datafile was created via API.
I've proposed to address it in #6783, since this would be easy to achieve when S3 is used for storage.

landreev added a commit that referenced this issue Apr 7, 2020
…ve failed validation,

that need to be fixed or removed before the dataset can be published. (#6558)
@landreev
Copy link
Contributor Author

landreev commented Apr 9, 2020

@kcondon - The scenario you brought up yesterday - when somebody is publishing a version with nothing but a single typo correction in the metadata (and no new files added) - it really does feel wrong to go through validating every file in the dataset, again.
Should I go ahead and change it to validating new (unpublished) files only?

@scolapasta
Copy link
Contributor

I haven't looked at the details, but (related to what you suggest) only do these checks if/when files change? So even if uploading a new file, maybe we still check them all, but we don't have to do checks if only metadata changes. Or similarly, do check for any major version. (this could still mean only metadata changes, as the user can select that, but a) may make the logic simple, and b) might be a good check to happen in those circumstances anyway.

@kcondon
Copy link
Contributor

kcondon commented Apr 9, 2020

@landreev Well, your validation scenario is to confirm that the set of files was not modified in some way due to an errant write operation, such as an unsuccessful delete. I was focusing more on validation of new files so did not consider as deeply what you were trying to detect. I suppose that, however unlikely, any write operation might impact files so reverifying the set makes sense, contingent on performance.

I just wonder whether there might ultimately be a ux impact, as there is now, when updating a dataset with a large number of files but where the current changes are minimal. Now, all file doi metadata entries are rewritten at DataCite, regardless of what has changed and that can take a while with lots of files.

@landreev
Copy link
Contributor Author

landreev commented Apr 9, 2020

... to confirm that the set of files was not modified in some way due to an errant write operation, such as an unsuccessful delete. ...

We never try to delete the physical file associated with the datafile, once it's published. At the same time, it's not entirely accurate to say that we NEVER touch the physical file once the datafile is published. We have the re-ingest and un-ingest APIs now, for example. Let me think about it some more. Maybe something along the lines of @scolapasta's suggestions - major versions only? - would be a good balance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants