IQSS · kcondon · Apr 14, 2020 · Mar 27, 2020 · Mar 31, 2020 · Mar 31, 2020
diff --git a/doc/release-notes/6558-validate-files-on-publish.md b/doc/release-notes/6558-validate-files-on-publish.md
@@ -0,0 +1,10 @@
+### Datafiles validation when publishing datasets
+
+When a user requests to publish a dataset, Dataverse will now attempt to validate the physical files in the dataset, by recalculating the checksums and verifying them against the values in the database. The goal is to prevent any corrupted files in published datasets. Most of all the instances of actual damage to physical files that we've seen in the past happened while the datafiles were still in the Draft state. (Physical files become essentially read-only once published). So this is the logical place to catch any such issues. 
+
+If any files in the dataset fail the validation, the dataset does not get published, and the user is notified that they need to contact their Dataverse support in order to address the issue before another attempt to publish can be made. See the "Troubleshooting" section of the Guide on how to fix such problems. 
+
+For datasets with large numbers of files, this validation will be performed asynchronously, using the same mechanism as for the registration of the file-level global ids. The cutoff number of files is configured by the same database setting. Similarly to the file PID registration, this validation process can be disabled on your system, with the setting `:FileValidationOnPublishEnabled`. (A Dataverse admin may choose to disable it if, for example, they are already running an external auditing system to monitor the integrity of the files in their Dataverse, and would prefer the publishing process to take less time). See the Config section of the Installation guide for more info. 
+
+Please note that we are not aware of any bugs in the current versions of Dataverse that would result in damage to users' files. But you may have some legacy files in your archive that were affected by some issue in the past, or perhaps affected by something outside Dataverse, so we are adding this feature out of abundance of caution. An example of a problem we've experienced in the early versions of Dataverse was a possible scenario where a user actually attempted to delete a Draft file from an unpublished version, where the database transaction would fail for whatever reason, but only after the physical file had already been deleted from the filesystem. Thus resulting in a datafile entry remaining in the dataset, but with the corresponding physical file missing. (the fix for this case, since the user wanted to delete the file in the first place, is simply to confirm it and purge the datafile entity from the database).
+
diff --git a/doc/sphinx-guides/source/admin/troubleshooting.rst b/doc/sphinx-guides/source/admin/troubleshooting.rst
@@ -16,9 +16,20 @@ See the :doc:`/api/intro` section of the API Guide for a high level overview of
 A Dataset Is Locked And Cannot Be Edited or Published
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-It's normal for the ingest process described in the :doc:`/user/tabulardataingest/ingestprocess` section of the User Guide to take some time but if hours or days have passed and the dataset is still locked, you might want to inspect the locks and consider deleting some or all of them.
+There are several types of dataset locks. Locks can be managed using the locks API, or by accessing them directly in the database. Internally locks are maintained in the ``DatasetLock`` database table, with the ``field dataset_id`` linking them to specific datasets, and the column ``reason`` specifying the type of lock.
 
-See :doc:`dataverses-datasets`.
+It's normal for the ingest process described in the :doc:`/user/tabulardataingest/ingestprocess` section of the User Guide to take some time but if hours or days have passed and the dataset is still locked, you might want to inspect the locks and consider deleting some or all of them. It is recommended to restart the application server if you are deleting an ingest lock, to make sure the ingest job is no longer running in the background. Ingest locks are idetified by the label ``Ingest`` in the ``reason`` column of the ``DatasetLock`` table in the database.
+
+A dataset is locked with a lock of type ``finalizePublication`` while the persistent identifiers for the datafiles in the dataset are registered or updated, and/or while the physical files are being validated by recalculating the checksums and verifying them against the values stored in the database, before the publication process can be completed (Note that either of the two tasks can be disabled via database options - see :doc:`config`). If a dataset has been in this state for a long period of time, for hours or longer, it is somewhat safe to assume that it is stuck (for example, the process may have been interrupted by an application server restart, or a system crash), so you may want to remove the lock (to be safe, do restart the application server, to ensure that the job is no longer running in the background) and advise the user to try publishing again. See :doc:`dataverses-datasets` for more information on publishing.
+
+If any files in the dataset fail the validation above the dataset will be left locked with a ``DatasetLock.Reason=FileValidationFailed``. The user will be notified that they need to contact their Dataverse support in order to address the issue before another attempt to publish can be made. The admin will have to address and fix the underlying problems (by either restoring the missing or corrupted files, or by purging the affected files from the dataset) before deleting the lock and advising the user to try to publish again. The goal of the validation framework is to catch these types of conditions while the dataset is still in DRAFT. 
+
+During an attempt to publish a dataset, the validation will stop after encountering the first file that fails it. It is strongly recommended for the admin to review and verify *all* the files in the dataset, so that all the compromised files can be fixed before the lock is removed. We recommend using the ``/api/validate/dataset/files/{id}`` API. It will go through all the files for the dataset specified, and will report which ones have failed validation. see :ref:`Physical Files Validation in a Dataset <dataset-files-validation-api>` in the :doc:`/api/native-api` section of the User Guide.
+
+The following are two real life examples of problems that have resulted in corrupted datafiles during normal operation of Dataverse: 
+
+1. Botched file deletes - while a datafile is in DRAFT, attempting to delete it from the dataset involves deleting both the ``DataFile`` database table entry, and the physical file. (Deleting a datafile from a *published* version merely removes it from the future versions - but keeps the file in the dataset). The problem we've observed in the early versions of Dataverse was a *partially successful* delete, where the database tansaction would fail (for whatever reason), but only after the physical file had already been deleted from the filesystem. Thus resulting in a datafile entry remaining in the dataset, but with the corresponding physical file missing. We believe we have addressed the issue that was making this condition possible, so it shouldn't happen again - but there may be a datafile in this state in your database. Assuming the user's intent was in fact to delete the file, the easiest solution is simply to confirm it and purge the datafile entity from the database. Otherwise the file needs to be restored from backups, or obtained from the user and copied back into storage. 
+2. Another issue we've observed: a failed tabular data ingest that leaves the datafile un-ingested, BUT with the physical file already replaced by the generated tab-delimited version of the data. This datafile will fail the validation because the checksum in the database matches the file in the original format (Stata, SPSS, etc.) as uploaded by the user. To fix: luckily, this is easily reversable, since the uploaded original should be saved in your storage, with the .orig extension. Simply swapping the .orig copy with the main file associated with the datafile will fix it. Similarly, we believe this condition should not happen again in Dataverse versions 4.20+, but you may have some legacy cases on your server. 
 
 Someone Created Spam Datasets and I Need to Delete Them
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

diff --git a/doc/sphinx-guides/source/api/native-api.rst b/doc/sphinx-guides/source/api/native-api.rst
@@ -2887,6 +2887,25 @@ Recalculate the check sum value value of a datafile, by supplying the file's dat
 Validate an existing check sum value against one newly calculated from the saved file:: 
 
    curl -H X-Dataverse-key:$API_TOKEN -X POST $SERVER_URL/api/admin/validateDataFileHashValue/{fileId}
+
+.. _dataset-files-validation-api:
+
+Physical Files Validation in a Dataset
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The following validates all the physical files in the dataset spcified, by recalculating the checksums and comparing them against the values saved in the database::
+
+  $SERVER_URL/api/admin/validate/dataset/files/{datasetId}
+
+It will report the specific files that have failed the validation. For example::
+
+   curl http://localhost:8080/api/admin/validate/dataset/files/:persistentId/?persistentId=doi:10.5072/FK2/XXXXX
+     {"dataFiles": [
+     		  {"datafileId":2658,"storageIdentifier":"file://123-aaa","status":"valid"},
+		  {"datafileId":2659,"storageIdentifier":"file://123-bbb","status":"invalid","errorMessage":"Checksum mismatch for datafile id 2669"}, 
+		  {"datafileId":2659,"storageIdentifier":"file://123-ccc","status":"valid"}
+		  ]
+      }
 
 These are only available to super users.
 
@@ -2928,8 +2947,6 @@ Note that if you are attempting to validate a very large number of datasets in y
 
      asadmin set server-config.network-config.protocols.protocol.http-listener-1.http.request-timeout-seconds=3600
 
-
-
 Workflows
 ~~~~~~~~~
 

diff --git a/doc/sphinx-guides/source/installation/config.rst b/doc/sphinx-guides/source/installation/config.rst
@@ -162,6 +162,7 @@ Here are the configuration options for DOIs:
 - :ref:`:IdentifierGenerationStyle <:IdentifierGenerationStyle>` (optional)
 - :ref:`:DataFilePIDFormat <:DataFilePIDFormat>` (optional)
 - :ref:`:FilePIDsEnabled <:FilePIDsEnabled>` (optional, defaults to true)
+- :ref:`:PIDAsynchRegFileCount <:PIDAsynchRegFileCount>` (optional, defaults to 10)
 
 Configuring Dataverse for Handles
 +++++++++++++++++++++++++++++++++
@@ -1366,6 +1367,17 @@ If you don't want to register file-based PIDs for your installation, set:
 
 Note: File-level PID registration was added in 4.9 and is required until version 4.9.3.
 
+Note: The dataset will be locked, and the registration will be performed asynchronously, when there are more than N files in the dataset, where N is configured by the database setting ``:PIDAsynchRegFileCount`` (default: 10). 
+
+.. _:PIDAsynchRegFileCount:
+
+:PIDAsynchRegFileCount
+++++++++++++++++++++++
+
+Configures the number of files in the dataset to warrant performing the registration of persistent identifiers (section above) and/or file validation asynchronously during publishing. The setting is optional, and the default value is 10.
+
+``curl -X PUT -d '100' http://localhost:8080/api/admin/settings/:PIDAsynchRegFileCount``
+
 .. _:IndependentHandleService:
 
 :IndependentHandleService
@@ -1376,6 +1388,20 @@ By default this setting is absent and Dataverse assumes it to be false.
 
 ``curl -X PUT -d 'true' http://localhost:8080/api/admin/settings/:IndependentHandleService``
 
+.. _:FileValidationOnPublishEnabled:
+
+:FileValidationOnPublishEnabled
++++++++++++++++++++++++++++++++
+
+Toggles validation of the physical files in the dataset when it's published, by recalculating the checksums and comparing against the values stored in the DataFile table. By default this setting is absent and Dataverse assumes it to be true.
+
+If you don't want the datafiles to be validated on publish, set:
+
+``curl -X PUT -d 'false' http://localhost:8080/api/admin/settings/:FileValidationOnPublishEnabled``
+
+Note: The dataset will be locked, and the validation will be performed asynchronously, similarly to how we handle assigning persistend identifiers to datafiles, when there are more than N files in the dataset, where N is configured by the database setting ``:PIDAsynchRegFileCount`` (default: 10). 
+
+
 :ApplicationTermsOfUse
 ++++++++++++++++++++++
 

diff --git a/doc/sphinx-guides/source/user/dataset-management.rst b/doc/sphinx-guides/source/user/dataset-management.rst
@@ -465,6 +465,8 @@ Publish Dataset
 
 When you publish a dataset (available to an Admin, Curator, or any custom role which has this level of permission assigned), you make it available to the public so that other users can browse or search for it. Once your dataset is ready to go public, go to your dataset page and click on the "Publish" button on the right hand side of the page. A pop-up will appear to confirm that you are ready to actually Publish since once a dataset is made public it can no longer be unpublished. 
 
+Before Dataverse finalizes the publication of the dataset, it will attempt to validate all the physical files in it, to make sure they are present and intact. In an unlikely event that any files fail the validation, you will see an error message informing you that the problem must be fixed by the local Dataverse Admin before the dataset can be published. 
+
 Whenever you edit your dataset, you are able to publish a new version of the dataset. The publish dataset button will reappear whenever you edit the metadata of the dataset or add a file.
 
 Note: Prior to publishing your dataset the Data Citation will indicate that this is a draft but the "DRAFT VERSION" text

diff --git a/src/main/java/edu/harvard/iq/dataverse/DatasetLock.java b/src/main/java/edu/harvard/iq/dataverse/DatasetLock.java
@@ -60,7 +60,7 @@
 public class DatasetLock implements Serializable {
 
     public enum Reason {
-        /** Data being ingested */
+        /** Data being ingested *//** Data being ingested */
         Ingest,
 
         /** Waits for a {@link Workflow} to end */
@@ -72,11 +72,16 @@ public enum Reason {
         /** DCM (rsync) upload in progress */
         DcmUpload,
 
-        //** Registering PIDs for DS and DFs
-        pidRegister,
+        /** Tasks handled by FinalizeDatasetPublicationCommand:
+         Registering PIDs for DS and DFs and/or file validation */
+        finalizePublication,
 
         /*Another edit is in progress*/
-        EditInProgress
+        EditInProgress,
+
+        /* Some files in the dataset failed validation */
+        FileValidationFailed
+
     }
 
     private static final long serialVersionUID = 1L;