Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6558 validate files on publish #6790

Merged
merged 30 commits into from
Apr 14, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
f1f5285
the framework for the physical file validations (#655)
landreev Mar 27, 2020
f149e12
Physical file validation framework, refined; (#6558)
landreev Mar 31, 2020
ad0960c
one added TODO in the finalize command (#6558)
landreev Mar 31, 2020
9280665
renamed DatasetLock.Reason.pidRegister DatasetLock.Reason.finalizePub…
landreev Apr 1, 2020
58ac83b
documentation for validate-files-on-publish. (#6558)
landreev Apr 1, 2020
790a5e5
More changes/refinements, dedicated "validation failed" lock, etc. (#…
landreev Apr 2, 2020
fdfb767
refresh the changed error/lock message after a file validation failur…
landreev Apr 2, 2020
a3df9c5
cleaned up lock refresh and messaging. (#6558)
landreev Apr 2, 2020
d16bed5
Merge branch 'develop' into 6558-validate-files-on-publish
landreev Apr 2, 2020
7dd277d
a flyway script for purging any locks of type pidRegister.
landreev Apr 2, 2020
3d21e9d
another lock refresh tweak. (#6558)
landreev Apr 2, 2020
811ab09
A release note for #6558.
landreev Apr 2, 2020
2a6411a
final (?) info messaging mechanism for the dataset page. (#6558)
landreev Apr 2, 2020
bc6e37f
rewrote/rearranged the troubleshooting guide for invalid files. (#6558)
landreev Apr 3, 2020
31f8df0
extra lock checking in the publish command. (#6558)
landreev Apr 3, 2020
b102daf
a typo/dropped word in the release notes. (#6558)
landreev Apr 3, 2020
74badcb
typo
djbrooke Apr 3, 2020
d126a72
Merge branch 'develop' into 6558-validate-files-on-publish
landreev Apr 7, 2020
45cfa0f
Merge branch '6558-validate-files-on-publish' of https://github.com/I…
landreev Apr 7, 2020
c3fbad2
Added /admin API call that an admin can use to find the files that ha…
landreev Apr 7, 2020
9ecb82c
Extra documentation entries for the validate files across dataset adm…
landreev Apr 7, 2020
608228b
Another documentation entry, the "dataset management" section of the …
landreev Apr 7, 2020
c0340c8
removed the incorrect/confusing "API" reference
landreev Apr 9, 2020
93160c9
cosmetic/style change
landreev Apr 9, 2020
840f389
"strongly recommened"
landreev Apr 9, 2020
093ed6d
rearranged the order of operations inside FinalizeDatasetPublicationC…
landreev Apr 13, 2020
b055cad
one other rearrangment - changes where the dataset is merged, after i…
landreev Apr 14, 2020
402a624
Cleaned up messaging; removed some commented out code. #6558
landreev Apr 14, 2020
bc389b3
added an entry for the file number limit for async. handling. #6558
landreev Apr 14, 2020
a2079d2
as discussed, skipping file validation for minor version releases (#6…
landreev Apr 14, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
10 changes: 10 additions & 0 deletions doc/release-notes/6558-validate-files-on-publish.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
### Datafiles validation when publishing datasets

When a user requests to publish a dataset, Dataverse will now attempt to validate the physical files in the dataset, by recalculating the checksums and verifying them against the values in the database. The goal is to prevent any corrupted files in published datasets. Most of all the instances of actual damage to physical files that we've seen in the past happened while the datafiles were still in the Draft state. (Physical files become essentially read-only once published). So this is the logical place to catch any such issues.

If any files in the dataset fail the validation, the dataset does not get published, and the user is notified that they need to contact their Dataverse support in order to address the issue before another attempt to publish can be made. See the "Troubleshooting" section of the Guide on how to fix such problems.

For datasets with large numbers of files, this validation will be performed asynchronously, using the same mechanism as for the registration of the file-level global ids. The cutoff number of files is configured by the same database setting. Similarly to the file PID registration, this validation process can be disabled on your system, with the setting `:FileValidationOnPublishEnabled`. (A Dataverse admin may choose to disable it if, for example, they are already running an external auditing system to monitor the integrity of the files in their Dataverse, and would prefer the publishing process to take less time). See the Config section of the Installation guide for more info.

Please note that we are not aware of any bugs in the current versions of Dataverse that would result in damage to users' files. But you may have some legacy files in your archive that were affected by some issue in the past, or perhaps affected by something outside Dataverse, so we are adding this feature out of abundance of caution. An example of a problem we've experienced in the early versions of Dataverse was a possible scenario where a user actually attempted to delete a Draft file from an unpublished version, where the database transaction would fail for whatever reason, but only after the physical file had already been deleted from the filesystem. Thus resulting in a datafile entry remaining in the dataset, but with the corresponding physical file missing. (the fix for this case, since the user wanted to delete the file in the first place, is simply to confirm it and purge the datafile entity from the database).

15 changes: 13 additions & 2 deletions doc/sphinx-guides/source/admin/troubleshooting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,20 @@ See the :doc:`/api/intro` section of the API Guide for a high level overview of
A Dataset Is Locked And Cannot Be Edited or Published
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

It's normal for the ingest process described in the :doc:`/user/tabulardataingest/ingestprocess` section of the User Guide to take some time but if hours or days have passed and the dataset is still locked, you might want to inspect the locks and consider deleting some or all of them.
There are several types of dataset locks. Locks can be managed using the locks API, or by accessing them directly in the database. Internally locks are maintained in the ``DatasetLock`` database table, with the ``field dataset_id`` linking them to specific datasets, and the column ``reason`` specifying the type of lock.

See :doc:`dataverses-datasets`.
It's normal for the ingest process described in the :doc:`/user/tabulardataingest/ingestprocess` section of the User Guide to take some time but if hours or days have passed and the dataset is still locked, you might want to inspect the locks and consider deleting some or all of them. It is recommended to restart the application server if you are deleting an ingest lock, to make sure the ingest job is no longer running in the background. Ingest locks are idetified by the label ``Ingest`` in the ``reason`` column of the ``DatasetLock`` table in the database.

A dataset is locked with a lock of type ``finalizePublication`` while the persistent identifiers for the datafiles in the dataset are registered or updated, and/or while the physical files are being validated by recalculating the checksums and verifying them against the values stored in the database, before the publication process can be completed (Note that either of the two tasks can be disabled via database options - see :doc:`config`). If a dataset has been in this state for a long period of time, for hours or longer, it is somewhat safe to assume that it is stuck (for example, the process may have been interrupted by an application server restart, or a system crash), so you may want to remove the lock (to be safe, do restart the application server, to ensure that the job is no longer running in the background) and advise the user to try publishing again. See :doc:`dataverses-datasets` for more information on publishing.

If any files in the dataset fail the validation above the dataset will be left locked with a ``DatasetLock.Reason=FileValidationFailed``. The user will be notified that they need to contact their Dataverse support in order to address the issue before another attempt to publish can be made. The admin will have to address and fix the underlying problems (by either restoring the missing or corrupted files, or by purging the affected files from the dataset) before deleting the lock and advising the user to try to publish again. The goal of the validation framework is to catch these types of conditions while the dataset is still in DRAFT.

During an attempt to publish a dataset, the validation will stop after encountering the first file that fails it. It is strongly recommended for the admin to review and verify *all* the files in the dataset, so that all the compromised files can be fixed before the lock is removed. We recommend using the ``/api/validate/dataset/files/{id}`` API. It will go through all the files for the dataset specified, and will report which ones have failed validation. see :ref:`Physical Files Validation in a Dataset <dataset-files-validation-api>` in the :doc:`/api/native-api` section of the User Guide.

The following are two real life examples of problems that have resulted in corrupted datafiles during normal operation of Dataverse:

1. Botched file deletes - while a datafile is in DRAFT, attempting to delete it from the dataset involves deleting both the ``DataFile`` database table entry, and the physical file. (Deleting a datafile from a *published* version merely removes it from the future versions - but keeps the file in the dataset). The problem we've observed in the early versions of Dataverse was a *partially successful* delete, where the database tansaction would fail (for whatever reason), but only after the physical file had already been deleted from the filesystem. Thus resulting in a datafile entry remaining in the dataset, but with the corresponding physical file missing. We believe we have addressed the issue that was making this condition possible, so it shouldn't happen again - but there may be a datafile in this state in your database. Assuming the user's intent was in fact to delete the file, the easiest solution is simply to confirm it and purge the datafile entity from the database. Otherwise the file needs to be restored from backups, or obtained from the user and copied back into storage.
2. Another issue we've observed: a failed tabular data ingest that leaves the datafile un-ingested, BUT with the physical file already replaced by the generated tab-delimited version of the data. This datafile will fail the validation because the checksum in the database matches the file in the original format (Stata, SPSS, etc.) as uploaded by the user. To fix: luckily, this is easily reversable, since the uploaded original should be saved in your storage, with the .orig extension. Simply swapping the .orig copy with the main file associated with the datafile will fix it. Similarly, we believe this condition should not happen again in Dataverse versions 4.20+, but you may have some legacy cases on your server.

Someone Created Spam Datasets and I Need to Delete Them
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down
21 changes: 19 additions & 2 deletions doc/sphinx-guides/source/api/native-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2887,6 +2887,25 @@ Recalculate the check sum value value of a datafile, by supplying the file's dat
Validate an existing check sum value against one newly calculated from the saved file::

curl -H X-Dataverse-key:$API_TOKEN -X POST $SERVER_URL/api/admin/validateDataFileHashValue/{fileId}

.. _dataset-files-validation-api:

Physical Files Validation in a Dataset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The following validates all the physical files in the dataset spcified, by recalculating the checksums and comparing them against the values saved in the database::

$SERVER_URL/api/admin/validate/dataset/files/{datasetId}

It will report the specific files that have failed the validation. For example::

curl http://localhost:8080/api/admin/validate/dataset/files/:persistentId/?persistentId=doi:10.5072/FK2/XXXXX
{"dataFiles": [
{"datafileId":2658,"storageIdentifier":"file://123-aaa","status":"valid"},
{"datafileId":2659,"storageIdentifier":"file://123-bbb","status":"invalid","errorMessage":"Checksum mismatch for datafile id 2669"},
{"datafileId":2659,"storageIdentifier":"file://123-ccc","status":"valid"}
]
}

These are only available to super users.

Expand Down Expand Up @@ -2928,8 +2947,6 @@ Note that if you are attempting to validate a very large number of datasets in y

asadmin set server-config.network-config.protocols.protocol.http-listener-1.http.request-timeout-seconds=3600



Workflows
~~~~~~~~~

Expand Down
26 changes: 26 additions & 0 deletions doc/sphinx-guides/source/installation/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,7 @@ Here are the configuration options for DOIs:
- :ref:`:IdentifierGenerationStyle <:IdentifierGenerationStyle>` (optional)
- :ref:`:DataFilePIDFormat <:DataFilePIDFormat>` (optional)
- :ref:`:FilePIDsEnabled <:FilePIDsEnabled>` (optional, defaults to true)
- :ref:`:PIDAsynchRegFileCount <:PIDAsynchRegFileCount>` (optional, defaults to 10)

Configuring Dataverse for Handles
+++++++++++++++++++++++++++++++++
Expand Down Expand Up @@ -1366,6 +1367,17 @@ If you don't want to register file-based PIDs for your installation, set:

Note: File-level PID registration was added in 4.9 and is required until version 4.9.3.

Note: The dataset will be locked, and the registration will be performed asynchronously, when there are more than N files in the dataset, where N is configured by the database setting ``:PIDAsynchRegFileCount`` (default: 10).

.. _:PIDAsynchRegFileCount:

:PIDAsynchRegFileCount
++++++++++++++++++++++

Configures the number of files in the dataset to warrant performing the registration of persistent identifiers (section above) and/or file validation asynchronously during publishing. The setting is optional, and the default value is 10.

``curl -X PUT -d '100' http://localhost:8080/api/admin/settings/:PIDAsynchRegFileCount``

.. _:IndependentHandleService:

:IndependentHandleService
Expand All @@ -1376,6 +1388,20 @@ By default this setting is absent and Dataverse assumes it to be false.

``curl -X PUT -d 'true' http://localhost:8080/api/admin/settings/:IndependentHandleService``

.. _:FileValidationOnPublishEnabled:

:FileValidationOnPublishEnabled
+++++++++++++++++++++++++++++++

Toggles validation of the physical files in the dataset when it's published, by recalculating the checksums and comparing against the values stored in the DataFile table. By default this setting is absent and Dataverse assumes it to be true.

If you don't want the datafiles to be validated on publish, set:

``curl -X PUT -d 'false' http://localhost:8080/api/admin/settings/:FileValidationOnPublishEnabled``

Note: The dataset will be locked, and the validation will be performed asynchronously, similarly to how we handle assigning persistend identifiers to datafiles, when there are more than N files in the dataset, where N is configured by the database setting ``:PIDAsynchRegFileCount`` (default: 10).


:ApplicationTermsOfUse
++++++++++++++++++++++

Expand Down
2 changes: 2 additions & 0 deletions doc/sphinx-guides/source/user/dataset-management.rst
Original file line number Diff line number Diff line change
Expand Up @@ -465,6 +465,8 @@ Publish Dataset

When you publish a dataset (available to an Admin, Curator, or any custom role which has this level of permission assigned), you make it available to the public so that other users can browse or search for it. Once your dataset is ready to go public, go to your dataset page and click on the "Publish" button on the right hand side of the page. A pop-up will appear to confirm that you are ready to actually Publish since once a dataset is made public it can no longer be unpublished.

Before Dataverse finalizes the publication of the dataset, it will attempt to validate all the physical files in it, to make sure they are present and intact. In an unlikely event that any files fail the validation, you will see an error message informing you that the problem must be fixed by the local Dataverse Admin before the dataset can be published.

Whenever you edit your dataset, you are able to publish a new version of the dataset. The publish dataset button will reappear whenever you edit the metadata of the dataset or add a file.

Note: Prior to publishing your dataset the Data Citation will indicate that this is a draft but the "DRAFT VERSION" text
Expand Down
13 changes: 9 additions & 4 deletions src/main/java/edu/harvard/iq/dataverse/DatasetLock.java
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@
public class DatasetLock implements Serializable {

public enum Reason {
/** Data being ingested */
/** Data being ingested *//** Data being ingested */
Ingest,

/** Waits for a {@link Workflow} to end */
Expand All @@ -72,11 +72,16 @@ public enum Reason {
/** DCM (rsync) upload in progress */
DcmUpload,

//** Registering PIDs for DS and DFs
pidRegister,
/** Tasks handled by FinalizeDatasetPublicationCommand:
Registering PIDs for DS and DFs and/or file validation */
finalizePublication,

/*Another edit is in progress*/
EditInProgress
EditInProgress,

/* Some files in the dataset failed validation */
FileValidationFailed

}

private static final long serialVersionUID = 1L;
Expand Down