Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File Import Batch job in support of rsync #3353

Closed
djbrooke opened this issue Sep 13, 2016 · 18 comments
Closed

File Import Batch job in support of rsync #3353

djbrooke opened this issue Sep 13, 2016 · 18 comments

Comments

@djbrooke
Copy link
Contributor

djbrooke commented Sep 13, 2016

Version support will come in a later iteration.

@pdurbin
Copy link
Member

pdurbin commented Sep 13, 2016

During the 2016-09-08 SBGrid Sprint Planning meeting ( https://docs.google.com/document/d/1wWSdKUOGA1L7UqFsgF3aOs8_9uyjnVpsPAxk7FObOOI/edit ) this issue was given an effort level of "5". The code has already been written and can be found at https://github.com/bmckinney/dataverse-canonical/tree/dcm-beta/src/main/java/edu/harvard/iq/dataverse/batch

As @djbrooke mentioned in the description, this issue does NOT cover support for file versioning when importing files. For now it's more about review of the code mentioned above so I'm assigning it this issue to myself and @scolapasta to at least get the code executing on our laptops so we can play around with it and get a deeper understanding of how it works.

@pdurbin
Copy link
Member

pdurbin commented Sep 15, 2016

@bmckinney @djbrooke and I discussed this issue. I indicated that I plan to get on this branch and play around with import code, executing the unit tests at least and what not.

@bmckinney will be scheduling a meeting with @landreev and ideally myself and @scolapasta to spent 10 or so minutes discussing the code and leaving plenty of time to discuss the implications.

@pdurbin
Copy link
Member

pdurbin commented Nov 10, 2016

@pdurbin
Copy link
Member

pdurbin commented Nov 14, 2016

I'm working my way through https://github.com/sbgrid/sbgrid-dataverse/wiki/HOWTO:-Batch-Job-Testing#3-manual-tests on my laptop running f2fde2e and it says to review the dataset landing page at the end. Here's how it looks:

screen shot 2016-11-14 at 3 09 27 pm

These are the surprises from my perspective:

  • no thumbnails of the images I added
  • "Unknown" instead of "JPG Image"
  • Related to "Unknown", the browser doesn't know what do to with the files (compare the imported "carolina1.jpg" to uploaded "trees.png" below), probably due to no MIME type?
  • no "cards" for files ("Files (0)" at the parent dataverse)

Dataverse doesn't know MIME type:

screen shot 2016-11-14 at 3 29 19 pm

Dataverse knows MIME type:

screen shot 2016-11-14 at 3 27 53 pm

Here's the parent dataverse:

screen shot 2016-11-14 at 3 24 04 pm

Here are the files I ran the import on:

murphy:B0KSZN pdurbin$ find . -type f -print0 | xargs -0 shasum | grep -v files.sha > files.sha 
murphy:B0KSZN pdurbin$ cat files.sha
aefc67259c6b67f7e29736ae9271b521e380d875  ./carolina/carolina1.jpg
9edb407806a8f95cff1702523b78b3768fe89f70  ./carolina/carolina2.jpg
18acf6065ee5aa4694af4ce8605ea385248f3bd4  ./superb-blue-fairy/superb-blue-fairy1.jpg
45683fec7e0f869e017b257fdeaedc8eddaecd6a  ./superb-blue-fairy/superb-blue-fairy2.jpg
murphy:B0KSZN pdurbin$ 

@pdurbin
Copy link
Member

pdurbin commented Nov 15, 2016

I've been testing the code by following the instructions at https://github.com/sbgrid/sbgrid-dataverse/wiki/HOWTO:-Batch-Job-Testing#3-manual-tests and I highly recommend that anyone testing or reviewing the code read that document as well as https://github.com/sbgrid/sbgrid-dataverse/wiki/INFO:-Batch-Job-Implementation

Anyway, I thought I'd go ahead and explain concretely what I've been doing which only make sense if you read the "manual tests" wiki page above.

  • Create dataset (ends in "HAQMQE" for the example below) based on metadata at https://data.sbgrid.org/dataset/137/ (to get you in the mindset of the types of files we expect)
  • mkdir HAQMQE
  • cd HAQMQE
  • rsync -av rsync://data.sbgrid.org/10.15785/SBGRID/137/ .
  • curl -X GET --header 'Accept: application/json' --header "X-Dataverse-key: $API_TOKEN" 'http://localhost:8080/api/import/datasets/files/10.5072/FK2/HAQMQE?mode=MERGE'
  • Ideally at this point I'd play around with "You can poll the job status with this URL: http://localhost:8080/api/batch/job/208" but I've only used that small dataset and it's pretty quick!
  • Check that a notification was sent and refresh the dataset page to see that the files are there.

It's very important to understand that files that are "imported" via this code do not follow the normal Dataverse rules when it comes to file handling. Specifically,

There might be other rules that are not followed that I'm not aware of but I think these are the big ones. This shouldn't be a the datasets in question hosted by SBGrid where the data is not tabular, Excel docs, images, PDFs, etc. For another perspective on this, please see the "General Constraints" and "Big Data Constraints" at https://github.com/sbgrid/sbgrid-dataverse/wiki/INFO:-Batch-Job-Implementation

Bugs:

  • "User is not authorized" when attempting to run the "import" API on a dataset that has been created in the root.

Now on to code review. I spend around a day looking at the code as of 02a3c30.

Answers for todos:

  • A superuser can use the undocumented "destroy" API endpoint to delete a released dataset via API.

Code style:

  • Rather than System.out.println, please use java.util.logging so we know which class the output is coming from.
  • Can we move MERGE, UPDATE, REPLACE into a single enum so we can more easily find instances of these strings in the code.
  • We might want to move BatchResource.java and FileRecordJobResource from edu.harvard.iq.dataverse.batch.api to edu.harvard.iq.dataverse.api.

Questions:

  • Is the org.ow2.asm dependency actually used? Later, Bill pointed out that it's used by beanio: http://repo1.maven.org/maven2/org/beanio/beanio/2.1.0/beanio-2.1.0.pom
  • Any comments on why BeanIO was added? Bill said, "used by checksum reader to parse manifest file".
  • Should persistentUserData in ChecksumReader and StepExecutionEntity be renamed? A comment suggests that its purpose is to "report missing checksums or datafile". Bill called it "a term of art for jsr 352" and sure enough, it's in the spec.
  • Are we purposefully not documenting the new API endpoints (listed below) in the API Guide? Bill said we're export to rst from https://github.com/sbgrid/sbgrid-dataverse/tree/feature/swagger-ui

There seem to be 5 new API endpoints:

  • /api/import/datasets/files/{doi1}/{doi2}/{doi3}
  • /api/batch/jobs
  • /api/batch/jobs/{jobName}
  • /api/batch/jobs/{jobName}/{jobId}
  • /api/batch/job/{jobId}

Here's how a dataset looks on my laptop:

bigdata

@pdurbin
Copy link
Member

pdurbin commented Nov 16, 2016

Great code review meeting with @bmckinney @scolapasta @djbrooke @mheppler @sekmiller @landreev @kcondon @pameyer and myself. Notes and TODOs are in this "2016-11-16 Code Review/Discussion - SBGrid" doc: https://docs.google.com/a/harvard.edu/document/d/19ug5XOzIUusAu6vqZRqfaEI0MWV5J6kqY2EtJW9R9I4/edit?usp=sharing

@djbrooke djbrooke assigned scolapasta and unassigned pdurbin and bmckinney Nov 28, 2016
@djbrooke
Copy link
Contributor Author

Talked in Standup today, @scolapasta is going to take a final look and then move to QA.

@bmckinney bmckinney mentioned this issue Nov 29, 2016
11 tasks
@scolapasta scolapasta removed their assignment Nov 30, 2016
@djbrooke
Copy link
Contributor Author

We discussed this in the sprint planning today. The tasks that were called out were:

  • Leonid to continue implementing the changes mentioned in the previous comment.
  • Danny to gather the necessary people to review the DCM questions coming out of the initial DCM testing, determine what changes are still valid, what changes need to be made here, and what changes will be on follow on issues.

@djbrooke
Copy link
Contributor Author

Hey @landreev - just making sure that this small change (from Kevin's comment above) is included in the #3497 PR:

-Link to dataset in batch upload email notification is not a complete URL and not resolvable:
(http://dataset.xhtml/?persistentId=doi:10.5072/FK2/I7DJGZ)

Thanks!

@landreev
Copy link
Contributor

landreev commented Feb 6, 2017

(this is going into QA very shortly; I pulled it back into dev as I'm addressing the last remaining tasks - syncing the branch up to the dev. branch; muting the failing restassured tests; and the broken link, per Kevin's QA feedback)

@landreev
Copy link
Contributor

landreev commented Feb 7, 2017

I'm pulling this into QA.
@kcondon, I believe the only change to the testing process, compared to how it's described in Bill's instruction, is the new requirement that the files dropped into the dataset directory are in a folder. (note that this folder name must be included in the relative path to the file specified in the manifest, files.sha; for example, batchuploadfolder/foo.txt, instead of just foo.txt)
I'll take another look in the morning at his doc and see if there's anything else that needs to be documented.

Notifications should be working, both on the user page and via email.

@landreev landreev removed their assignment Feb 7, 2017
@kcondon
Copy link
Contributor

kcondon commented Feb 7, 2017

OK, note that a record of changes was made to the pull request and am posting them here:

OK, we agreed to make a number of changes during the last code review, 5 total, addressed in my latest commit:

Reversed to the logic of trusting the external file copier to create the top-level folder for the files in the destination dataset directory. No extra folder is being created now.
However, the final folder gets renamed, to a standard Dataverse-generated name, as used for "normal" files.
Added an optional size parameter ("totalSize") to the batch job API call. If supplied, the long integer will be used as the size for the package Datafile. If not supplied, the import job will calculate the sum of all the files in the folder.
Instead of calculating the checksum on the new line-separated list of individual cheksums, the checksum of the manifest file is used for the final datafile checksum.
The functionality for the "individual file mode" (importing individual files as separate Datafiles) is preserved in the FileRecordWriter, but disabled on the API level; i.e., not available to the API user, for now.
The mime type for a "file package" datafile is changed to a classier "application/vnd.edu.harvard.iq.dataverse.file-package".
A boolean method isFilePackage() added to DataFile.java.
Please review and see if anything's missing.

Also, Phil just found something else that I missed - Bill created some RestAssured tests that are failing, now that the import process has been changed. Let's discuss how to proceed; should we move it to QA and treat the RA tests as a separate issue? Or fix it now?

Also, the branch needs to be synced up to the develop, before it can be QAed.

@kcondon
Copy link
Contributor

kcondon commented Feb 8, 2017

Issues I'm seeing:
-manifest file should be processed in same sub dir as files so relative paths work for job and end users
-package size calculation is incorrect, says 60MB but is really 120MB
-batch log file shows errors about not finding files due to bad path but import still works
-when fixing paths and run with "good" data, log file still reports some errors though it works:
INFO: Files found = 103
SEVERE: Could not move the file folder to the final destination (/usr/local/glassfish4/glassfish/domains/domain1/files/10.5072/FK2/V09WG3//15a1f5f1843-a9fcbbef1f0f)
SEVERE: File package import failed.
SEVERE: Could not move the file folder to the final destination (/usr/local/glassfish4/glassfish/domains/domain1/files/10.5072/FK2/V09WG3//15a1f5f184c-30f00a668c27)
SEVERE: File package import failed.

pdurbin added a commit that referenced this issue Feb 8, 2017
@landreev
Copy link
Contributor

landreev commented Feb 8, 2017

Committed the code change;
It fixes the issue with multiple passes (the cause of the size mismatch, and the extra "failed" messages above);
also, per our last conversation with Pete:

  • new mandatory "uploadFolder" parameter for the API call. if you dropped the uploaded files in the folder <DATASET DIRECTORY>/myupload the parameter to add to the API call is uploadFolder=data
  • the manifest should go into the upload folder (NOT into the dataset directory above it);
  • the pathnames in the manifest will be specified as relative paths inside the upload folder. So if you uploaded the file <DATASET DIRECTORY>/myupload/foo.txt, it should be specified as simply "foo.txt" in the manifest.
  • the manifest will be preserved in the final package file folder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants