DCM S3 #4703

pameyer · 2018-05-22T15:30:03Z

The DCM (data capture module, for big data upload) currently integrate with Datavese assumes POSIX storage (and ssh+rsync data transfer); however using object stores (AWS S3, OpenStack Swift, etc) is becoming more common.

Open technical design questions:

S3 (and swift?) are sometimes considered transfer protocols in addition to storage protocols. Should a S3 DCM support these as data transfer protocols, or only as storage?
DCM design assumes that having client side checksums calculated without direct user intervention is essential to "big data depositions", and that these checksums should be propagated from deposition to publication (and data file replication, etc). How do other disciplines view this trade-off (implementation complexity vs data integrity)?

Lowest-way complexity way of implementing this would be a second DCM implementation (or configuration option for current DCM), changing only how data files are transferred from the temporary upload location to the dataverse-accessable storage (aka - internal copy from temporary POSIX to DV S3/swift dataset buckets) and keeping existing ssh+rsync and client-side checksums. Moderately higher complexity would involve changes to the existing approach for client-side checksums and data transfers to support non-unix OS systems.

pameyer · 2018-05-23T19:50:45Z

Notes from discussion:

support S3/swift as transfer protocols too? no
support swift storage too? no, separate issue
go with low-complexity approach (aka - change scanner to use S3 commands for /upload -> /hold
@landreev pointed out that there could be issues with the batch import (from the DCM success API); current mapping of /hold to $GLASSFISH_DIR/glassfish/domains/domain1/files may need review / possible revisiting.

matthew-a-dunlap · 2018-06-18T21:30:27Z

My understanding of the work needed for the story:

--- see new post below ---

pameyer · 2018-06-18T21:45:21Z

One open question: should updating the existing DCM jenkins tests to include S3 be added to the list?

matthew-a-dunlap · 2018-07-12T15:54:51Z

After discussing with @pameyer , my understanding is that this work does not include functionality to allow downloading of files uploaded via DCM S3. The end result of this work will be that the file-bundle is uploaded to a structured S3 bucket that dataverse is aware of. This will lay the foundation for the later download functionality enabled in rsal.

matthew-a-dunlap · 2018-07-12T17:35:04Z

Here is some important info about Dataverse's batch file system jobs: https://github.com/sbgrid/sbgrid-dataverse/wiki/INFO:-Batch-Job-Implementation / https://github.com/sbgrid/sbgrid-dataverse/wiki/HOWTO:-Batch-Job-Testing#3-manual-tests . This issue is also relevant #3353

Posting these here for future developers as well as myself. The functionality around this needs to at least be understood when implementing an s3 version of the import.

Talking with @landreev , we were unsure as to whether this s3 functionality should live as a new batch job or if it could somehow be rolled into our existing storageIO code. If possible, the latter seems simpler from a code maintenance perspective.

matthew-a-dunlap · 2018-07-12T19:42:47Z

Another facet: should we consider creating an actual package (e.g. tar/zip) to store on s3? AWS has per-file charges for getting/moving files around in buckets, and if we aren't providing the ability to download individual files in the dcm-dataset anyways, maybe we should consider actually zipping things up?

Import code still needed so Dataverse knows of the new files

Cleanup needed but flow works

pameyer · 2018-08-14T20:51:18Z

Ran through some initial testing with DCM 0.2 (POSIX) and docker-dcm; used dataverseAdmin to create datasets with UI and API. Locking still seems to be behaving. Engineered checksum failures result in 2x notifications to dataverseAdmin; checksum success messages result in 3x notifications to dataverseAdmin (some of this may be due to either configuration or user error).

pameyer · 2018-08-20T14:25:15Z

With more realistic test accounts (aka - not dataverseAdmin), the duplicate messages for checksum still occur.

pameyer · 2018-08-20T18:42:50Z

Ran into a non-intuitive failure mode during testing: It was possible to "half-create" a dataset, where a given dataset would show up when browsing (because it was present in solr), but clicking on the dataset gave a 404 (because it was not present in postgres). Lighttpd logs showed requests from dataverse for sr.py but not ur.py.

Initial investigation suggested possible relation with dataverse role assignments. More in-depth debugging showed that root cause was incorrectly configured DCM (specifically, adapting ansible role designed to deploy from git pull to the new rpm installation has incorrect CGI configuration, resulting in the DCM returning the source code of ur.py instead of the specified {"status":"OK"}. Dataverse was checking HTTP status code (which was 200), but not the contents of the result.

Short version - Dataverse should probably be more rigorous in what it expects to receive from integration systems, but "happy path" is not impacted.

pdurbin · 2018-08-21T18:57:44Z

@pameyer and I just chatted about my "bad news" test in DatasetsIT where I construct some JSON with badNews.add("status", "validation failed"). I showed that notifications seemed to be working ok. The superuser got a single notification. The author got a single notification. Pete is going to dig in and see if he can help me replicate the bug.

pameyer · 2018-08-21T21:35:04Z

On 4703-dcm-s3-2-2f91725, the only duplicate (checksum success or checksum failure) notifications I'm seeing are to administrative users (not the normal test users). So it's either fixed, or was user/configuration error to begin with.

pdurbin · 2018-08-21T22:01:55Z

@pameyer ok. I take it there's nothing for me to fix, then. Thanks for testing! I'm taking myself off this issue.

kcondon · 2018-08-22T22:00:56Z

Found 3 issues:

1. aws cli should be installed as part of DCM setup

Why 2 buckets? config instructions indicate a dataverse and a dcm bucket but files only appear in dataverse bucket.
Failure notification has different style than others, verified with Mike icon should be dataset, not dataverse and usually dataset is displayed as a linked dataset title rather than unlinked doi.

pameyer · 2018-08-23T14:08:03Z

As of 8d702eb; AWS CLI is installed into the DCM container for docker-dcm.

Previously, the icon for a dataverse was used.

pdurbin · 2018-08-24T16:54:09Z

For the checksum fail notification I fixed the icon in 23ba120 and change the DOI to the dataset title with a link in c5e6298. @djbrooke looked over my shoulder.

pdurbin · 2018-08-24T17:14:12Z

I can tell from the code that the second bucket (dataverse.files.dcm-s3-bucket-name) is used and not cruft but I don't feel confident in my ability to document it properly. I added a FIXME in 5649d05 and a little text that should be improved upon by someone who wrote the code or someone who is testing the S3 support for DCM works.

pdurbin · 2018-08-24T19:00:44Z

@kcondon and I have been talking for over an hour about this issue and drew this:

The additional change we agreed to I just made in d359f73 where I tried to explain a little more in the dev guide about what the new "dcm-s3" bucket is for. I also linked back to this issue because it turns out there's useful stuff in here but the GitHub web interface hid it from my because there are so many comments. I added a FIXME to document all of this stuff properly in the guides some day. It's extremely complex. Diagrams would be helpful as well.

Moving to QA, mostly to test the checksum notification improvements I made today and mentioned above.

djbrooke added Status: Backlog and removed Status: Backlog labels May 22, 2018

scolapasta mentioned this issue May 23, 2018

DCM Swift #4710

Closed

djbrooke added Status: Backlog and removed Status: This/Next Sprint labels May 23, 2018

sekmiller added Status: Development and removed Status: This/Next Sprint labels Jun 5, 2018

sekmiller self-assigned this Jun 5, 2018

djbrooke added Status: This/Next Sprint and removed Status: Development labels Jun 18, 2018

djbrooke unassigned sekmiller Jun 18, 2018

matthew-a-dunlap self-assigned this Jun 18, 2018

matthew-a-dunlap added Status: Development and removed Status: This/Next Sprint labels Jun 18, 2018

matthew-a-dunlap pushed a commit that referenced this issue Jun 18, 2018

Small DCM/rsync guide fix #4703

9937297

matthew-a-dunlap mentioned this issue Jun 19, 2018

DCM unable to process new identifier structure #4761

Closed

matthew-a-dunlap pushed a commit that referenced this issue Jul 17, 2018

File transfer from dcm s3 to dataverse s3 #4703

9673c4a

Import code still needed so Dataverse knows of the new files

matthew-a-dunlap pushed a commit that referenced this issue Jul 19, 2018

Full DCM S3 Package to Dataset #4703

9f02193

Cleanup needed but flow works

matthew-a-dunlap pushed a commit that referenced this issue Jul 19, 2018

Fix ejbs #4703

b6a21d9

matthew-a-dunlap pushed a commit that referenced this issue Jul 19, 2018

dcm s3 confg variable and cleanup #4703

2fe3366

matthew-a-dunlap mentioned this issue Aug 14, 2018

Download Package File from S3 #4949

Closed

pameyer self-assigned this Aug 14, 2018

djbrooke assigned pdurbin Aug 21, 2018

pdurbin removed their assignment Aug 21, 2018

kcondon added Status: Development and removed Status: QA labels Aug 24, 2018

kcondon removed their assignment Aug 24, 2018

pdurbin added a commit that referenced this issue Aug 24, 2018

change icon for checksumfail notification to dataset #4703

23ba120

Previously, the icon for a dataverse was used.

pdurbin added a commit that referenced this issue Aug 24, 2018

add dataset title and link to checksumfail notification #4703

c5e6298

pdurbin added a commit that referenced this issue Aug 24, 2018

mention dataverse.files.dcm-s3-bucket-name #4703

5649d05

pdurbin added a commit that referenced this issue Aug 24, 2018

note where to find information how to dev/qa DCM S3 #4703

d359f73

pdurbin added Status: QA and removed Status: Development labels Aug 24, 2018

pdurbin unassigned pameyer Aug 24, 2018

mheppler mentioned this issue Aug 24, 2018

Style guide – document messaging text and visual styles #3865

Closed

kcondon closed this as completed in 5115d5c Aug 24, 2018

djbrooke modified the milestones: 4.10 - Additional Data Transfer Options, 4.9.3 - Optional File PIDs, Initial Internationalization Work Sep 18, 2018

djbrooke added this to QA 👍 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) May 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DCM S3 #4703

DCM S3 #4703

pameyer commented May 22, 2018

pameyer commented May 23, 2018

matthew-a-dunlap commented Jun 18, 2018 •

edited

pameyer commented Jun 18, 2018

matthew-a-dunlap commented Jul 12, 2018 •

edited

matthew-a-dunlap commented Jul 12, 2018 •

edited

matthew-a-dunlap commented Jul 12, 2018

pameyer commented Aug 14, 2018

pameyer commented Aug 20, 2018

pameyer commented Aug 20, 2018

pdurbin commented Aug 21, 2018

pameyer commented Aug 21, 2018

pdurbin commented Aug 21, 2018

kcondon commented Aug 22, 2018 •

edited

pameyer commented Aug 23, 2018

pdurbin commented Aug 24, 2018

pdurbin commented Aug 24, 2018

pdurbin commented Aug 24, 2018

DCM S3 #4703

DCM S3 #4703

Comments

pameyer commented May 22, 2018

pameyer commented May 23, 2018

matthew-a-dunlap commented Jun 18, 2018 • edited

pameyer commented Jun 18, 2018

matthew-a-dunlap commented Jul 12, 2018 • edited

matthew-a-dunlap commented Jul 12, 2018 • edited

matthew-a-dunlap commented Jul 12, 2018

pameyer commented Aug 14, 2018

pameyer commented Aug 20, 2018

pameyer commented Aug 20, 2018

pdurbin commented Aug 21, 2018

pameyer commented Aug 21, 2018

pdurbin commented Aug 21, 2018

kcondon commented Aug 22, 2018 • edited

pameyer commented Aug 23, 2018

pdurbin commented Aug 24, 2018

pdurbin commented Aug 24, 2018

pdurbin commented Aug 24, 2018

matthew-a-dunlap commented Jun 18, 2018 •

edited

matthew-a-dunlap commented Jul 12, 2018 •

edited

matthew-a-dunlap commented Jul 12, 2018 •

edited

kcondon commented Aug 22, 2018 •

edited