File Import Batch job in support of rsync #3353

djbrooke · 2016-09-13T13:51:21Z

Version support will come in a later iteration.

pdurbin · 2016-09-13T15:50:18Z

During the 2016-09-08 SBGrid Sprint Planning meeting ( https://docs.google.com/document/d/1wWSdKUOGA1L7UqFsgF3aOs8_9uyjnVpsPAxk7FObOOI/edit ) this issue was given an effort level of "5". The code has already been written and can be found at https://github.com/bmckinney/dataverse-canonical/tree/dcm-beta/src/main/java/edu/harvard/iq/dataverse/batch

As @djbrooke mentioned in the description, this issue does NOT cover support for file versioning when importing files. For now it's more about review of the code mentioned above so I'm assigning it this issue to myself and @scolapasta to at least get the code executing on our laptops so we can play around with it and get a deeper understanding of how it works.

pdurbin · 2016-09-15T20:39:01Z

@bmckinney @djbrooke and I discussed this issue. I indicated that I plan to get on this branch and play around with import code, executing the unit tests at least and what not.

@bmckinney will be scheduling a meeting with @landreev and ideally myself and @scolapasta to spent 10 or so minutes discussing the code and leaving plenty of time to discuss the implications.

pdurbin · 2016-11-10T18:51:41Z

@bmckinney pushed a branch that I'd like to poke around at: https://github.com/IQSS/dataverse/tree/3353-batch-job-import

Bill also created some great writeups here:

In https://waffle.io/IQSS/dataverse I'm dragging this to Code Review.

pdurbin · 2016-11-14T20:33:12Z

I'm working my way through https://github.com/sbgrid/sbgrid-dataverse/wiki/HOWTO:-Batch-Job-Testing#3-manual-tests on my laptop running f2fde2e and it says to review the dataset landing page at the end. Here's how it looks:

These are the surprises from my perspective:

no thumbnails of the images I added
"Unknown" instead of "JPG Image"
Related to "Unknown", the browser doesn't know what do to with the files (compare the imported "carolina1.jpg" to uploaded "trees.png" below), probably due to no MIME type?
no "cards" for files ("Files (0)" at the parent dataverse)

Dataverse doesn't know MIME type:

Dataverse knows MIME type:

Here's the parent dataverse:

Here are the files I ran the import on:

murphy:B0KSZN pdurbin$ find . -type f -print0 | xargs -0 shasum | grep -v files.sha > files.sha 
murphy:B0KSZN pdurbin$ cat files.sha
aefc67259c6b67f7e29736ae9271b521e380d875  ./carolina/carolina1.jpg
9edb407806a8f95cff1702523b78b3768fe89f70  ./carolina/carolina2.jpg
18acf6065ee5aa4694af4ce8605ea385248f3bd4  ./superb-blue-fairy/superb-blue-fairy1.jpg
45683fec7e0f869e017b257fdeaedc8eddaecd6a  ./superb-blue-fairy/superb-blue-fairy2.jpg
murphy:B0KSZN pdurbin$

pdurbin · 2016-11-15T20:38:14Z

I've been testing the code by following the instructions at https://github.com/sbgrid/sbgrid-dataverse/wiki/HOWTO:-Batch-Job-Testing#3-manual-tests and I highly recommend that anyone testing or reviewing the code read that document as well as https://github.com/sbgrid/sbgrid-dataverse/wiki/INFO:-Batch-Job-Implementation

Anyway, I thought I'd go ahead and explain concretely what I've been doing which only make sense if you read the "manual tests" wiki page above.

Create dataset (ends in "HAQMQE" for the example below) based on metadata at https://data.sbgrid.org/dataset/137/ (to get you in the mindset of the types of files we expect)
mkdir HAQMQE
cd HAQMQE
rsync -av rsync://data.sbgrid.org/10.15785/SBGRID/137/ .
curl -X GET --header 'Accept: application/json' --header "X-Dataverse-key: $API_TOKEN" 'http://localhost:8080/api/import/datasets/files/10.5072/FK2/HAQMQE?mode=MERGE'
Ideally at this point I'd play around with "You can poll the job status with this URL: http://localhost:8080/api/batch/job/208" but I've only used that small dataset and it's pretty quick!
Check that a notification was sent and refresh the dataset page to see that the files are there.

It's very important to understand that files that are "imported" via this code do not follow the normal Dataverse rules when it comes to file handling. Specifically,

The MIME Type of is not determined for files, so the all show up as "Unknown".
Images will not have thumbnails.
Tabular data will not go through ingest.
No special processing of FITS files.
No special processing of Shapefiles.
On import (dataset still in draft), files not indexed into Solr (unsure if related to Index: Uploading a file to a preexisting draft dataset without any metadata change does not index file. #3243), but are indexed on publish.

There might be other rules that are not followed that I'm not aware of but I think these are the big ones. This shouldn't be a the datasets in question hosted by SBGrid where the data is not tabular, Excel docs, images, PDFs, etc. For another perspective on this, please see the "General Constraints" and "Big Data Constraints" at https://github.com/sbgrid/sbgrid-dataverse/wiki/INFO:-Batch-Job-Implementation

Bugs:

"User is not authorized" when attempting to run the "import" API on a dataset that has been created in the root.

Now on to code review. I spend around a day looking at the code as of 02a3c30.

Answers for todos:

A superuser can use the undocumented "destroy" API endpoint to delete a released dataset via API.

Code style:

Rather than System.out.println, please use java.util.logging so we know which class the output is coming from.
Can we move MERGE, UPDATE, REPLACE into a single enum so we can more easily find instances of these strings in the code.
We might want to move BatchResource.java and FileRecordJobResource from edu.harvard.iq.dataverse.batch.api to edu.harvard.iq.dataverse.api.

Questions:

Is the org.ow2.asm dependency actually used? Later, Bill pointed out that it's used by beanio: http://repo1.maven.org/maven2/org/beanio/beanio/2.1.0/beanio-2.1.0.pom
Any comments on why BeanIO was added? Bill said, "used by checksum reader to parse manifest file".
Should persistentUserData in ChecksumReader and StepExecutionEntity be renamed? A comment suggests that its purpose is to "report missing checksums or datafile". Bill called it "a term of art for jsr 352" and sure enough, it's in the spec.
Are we purposefully not documenting the new API endpoints (listed below) in the API Guide? Bill said we're export to rst from https://github.com/sbgrid/sbgrid-dataverse/tree/feature/swagger-ui

There seem to be 5 new API endpoints:

/api/import/datasets/files/{doi1}/{doi2}/{doi3}
/api/batch/jobs
/api/batch/jobs/{jobName}
/api/batch/jobs/{jobName}/{jobId}
/api/batch/job/{jobId}

Here's how a dataset looks on my laptop:

pdurbin · 2016-11-16T21:12:30Z

Great code review meeting with @bmckinney @scolapasta @djbrooke @mheppler @sekmiller @landreev @kcondon @pameyer and myself. Notes and TODOs are in this "2016-11-16 Code Review/Discussion - SBGrid" doc: https://docs.google.com/a/harvard.edu/document/d/19ug5XOzIUusAu6vqZRqfaEI0MWV5J6kqY2EtJW9R9I4/edit?usp=sharing

djbrooke · 2016-11-28T16:48:36Z

Talked in Standup today, @scolapasta is going to take a final look and then move to QA.

djbrooke · 2017-01-25T21:39:16Z

We discussed this in the sprint planning today. The tasks that were called out were:

Leonid to continue implementing the changes mentioned in the previous comment.
Danny to gather the necessary people to review the DCM questions coming out of the initial DCM testing, determine what changes are still valid, what changes need to be made here, and what changes will be on follow on issues.

djbrooke · 2017-01-31T20:02:35Z

Hey @landreev - just making sure that this small change (from Kevin's comment above) is included in the #3497 PR:

-Link to dataset in batch upload email notification is not a complete URL and not resolvable:
(http://dataset.xhtml/?persistentId=doi:10.5072/FK2/I7DJGZ)

Thanks!

landreev · 2017-02-06T21:50:30Z

(this is going into QA very shortly; I pulled it back into dev as I'm addressing the last remaining tasks - syncing the branch up to the dev. branch; muting the failing restassured tests; and the broken link, per Kevin's QA feedback)

landreev · 2017-02-07T06:30:50Z

I'm pulling this into QA.
@kcondon, I believe the only change to the testing process, compared to how it's described in Bill's instruction, is the new requirement that the files dropped into the dataset directory are in a folder. (note that this folder name must be included in the relative path to the file specified in the manifest, files.sha; for example, batchuploadfolder/foo.txt, instead of just foo.txt)
I'll take another look in the morning at his doc and see if there's anything else that needs to be documented.

Notifications should be working, both on the user page and via email.

kcondon · 2017-02-07T23:37:50Z

OK, note that a record of changes was made to the pull request and am posting them here:

OK, we agreed to make a number of changes during the last code review, 5 total, addressed in my latest commit:

Reversed to the logic of trusting the external file copier to create the top-level folder for the files in the destination dataset directory. No extra folder is being created now.
However, the final folder gets renamed, to a standard Dataverse-generated name, as used for "normal" files.
Added an optional size parameter ("totalSize") to the batch job API call. If supplied, the long integer will be used as the size for the package Datafile. If not supplied, the import job will calculate the sum of all the files in the folder.
Instead of calculating the checksum on the new line-separated list of individual cheksums, the checksum of the manifest file is used for the final datafile checksum.
The functionality for the "individual file mode" (importing individual files as separate Datafiles) is preserved in the FileRecordWriter, but disabled on the API level; i.e., not available to the API user, for now.
The mime type for a "file package" datafile is changed to a classier "application/vnd.edu.harvard.iq.dataverse.file-package".
A boolean method isFilePackage() added to DataFile.java.
Please review and see if anything's missing.

Also, Phil just found something else that I missed - Bill created some RestAssured tests that are failing, now that the import process has been changed. Let's discuss how to proceed; should we move it to QA and treat the RA tests as a separate issue? Or fix it now?

Also, the branch needs to be synced up to the develop, before it can be QAed.

kcondon · 2017-02-08T20:29:38Z

Issues I'm seeing:
-manifest file should be processed in same sub dir as files so relative paths work for job and end users
-package size calculation is incorrect, says 60MB but is really 120MB
-batch log file shows errors about not finding files due to bad path but import still works
-when fixing paths and run with "good" data, log file still reports some errors though it works:
INFO: Files found = 103
SEVERE: Could not move the file folder to the final destination (/usr/local/glassfish4/glassfish/domains/domain1/files/10.5072/FK2/V09WG3//15a1f5f1843-a9fcbbef1f0f)
SEVERE: File package import failed.
SEVERE: Could not move the file folder to the final destination (/usr/local/glassfish4/glassfish/domains/domain1/files/10.5072/FK2/V09WG3//15a1f5f184c-30f00a668c27)
SEVERE: File package import failed.

landreev · 2017-02-08T23:47:57Z

Committed the code change;
It fixes the issue with multiple passes (the cause of the size mismatch, and the extra "failed" messages above);
also, per our last conversation with Pete:

new mandatory "uploadFolder" parameter for the API call. if you dropped the uploaded files in the folder <DATASET DIRECTORY>/myupload the parameter to add to the API call is uploadFolder=data
the manifest should go into the upload folder (NOT into the dataset directory above it);
the pathnames in the manifest will be specified as relative paths inside the upload folder. So if you uploaded the file <DATASET DIRECTORY>/myupload/foo.txt, it should be specified as simply "foo.txt" in the manifest.
the manifest will be preserved in the final package file folder.

pdurbin assigned pdurbin and scolapasta Sep 13, 2016

pdurbin added the Feature: File Upload & Handling label Sep 13, 2016

pdurbin mentioned this issue Sep 13, 2016

Data Capture Module (rsync support) #3145

Closed

pdurbin mentioned this issue Sep 30, 2016

File Upload: Allow users to skip ingest as tabular data. #2199

Open

pdurbin added SBGrid and removed ready labels Oct 7, 2016

pdurbin assigned bmckinney and unassigned pdurbin and scolapasta Oct 13, 2016

pdurbin mentioned this issue Oct 19, 2016

Add new field to filemetadata table to display full path filename (file hierarchy) #3405

Closed

pdurbin added Status: Code Review and removed in progress labels Nov 10, 2016

pdurbin self-assigned this Nov 14, 2016

djbrooke assigned scolapasta and unassigned pdurbin and bmckinney Nov 28, 2016

bmckinney mentioned this issue Nov 29, 2016

3353 batch job import #3497

Merged

11 tasks

scolapasta removed their assignment Nov 30, 2016

scolapasta added Status: QA and removed Status: Code Review labels Nov 30, 2016

landreev added Status: Code Review and removed in progress labels Feb 1, 2017

djbrooke assigned scolapasta and landreev and unassigned landreev and scolapasta Feb 1, 2017

scolapasta added in progress and removed Status: Code Review labels Feb 6, 2017

landreev assigned landreev and unassigned landreev Feb 6, 2017

landreev added Status: QA and removed in progress labels Feb 7, 2017

landreev removed their assignment Feb 7, 2017

djbrooke assigned kcondon Feb 8, 2017

pameyer mentioned this issue Feb 8, 2017

API change sbgrid/data-capture-module#15

Closed

pdurbin added a commit that referenced this issue Feb 8, 2017

update mime type for file package #3353

b1be2dc

kcondon closed this as completed Feb 9, 2017

kcondon removed the Status: QA label Feb 9, 2017

pdurbin added this to the 4.6.1 - ORCID and File Replace milestone Mar 9, 2017

pameyer mentioned this issue Mar 27, 2017

Dataverse -> DCM APIs #3725

Closed

matthew-a-dunlap mentioned this issue Jul 12, 2018

DCM S3 #4703

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File Import Batch job in support of rsync #3353

File Import Batch job in support of rsync #3353

djbrooke commented Sep 13, 2016 •

edited

pdurbin commented Sep 13, 2016

pdurbin commented Sep 15, 2016

pdurbin commented Nov 10, 2016

pdurbin commented Nov 14, 2016

pdurbin commented Nov 15, 2016

pdurbin commented Nov 16, 2016

djbrooke commented Nov 28, 2016

djbrooke commented Jan 25, 2017

djbrooke commented Jan 31, 2017

landreev commented Feb 6, 2017

landreev commented Feb 7, 2017

kcondon commented Feb 7, 2017

kcondon commented Feb 8, 2017

landreev commented Feb 8, 2017

File Import Batch job in support of rsync #3353

File Import Batch job in support of rsync #3353

Comments

djbrooke commented Sep 13, 2016 • edited

pdurbin commented Sep 13, 2016

pdurbin commented Sep 15, 2016

pdurbin commented Nov 10, 2016

pdurbin commented Nov 14, 2016

pdurbin commented Nov 15, 2016

pdurbin commented Nov 16, 2016

djbrooke commented Nov 28, 2016

djbrooke commented Jan 25, 2017

djbrooke commented Jan 31, 2017

landreev commented Feb 6, 2017

landreev commented Feb 7, 2017

kcondon commented Feb 7, 2017

kcondon commented Feb 8, 2017

landreev commented Feb 8, 2017

djbrooke commented Sep 13, 2016 •

edited