Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DCM S3 #4703

Closed
pameyer opened this issue May 22, 2018 · 31 comments
Closed

DCM S3 #4703

pameyer opened this issue May 22, 2018 · 31 comments

Comments

@pameyer
Copy link
Contributor

pameyer commented May 22, 2018

The DCM (data capture module, for big data upload) currently integrate with Datavese assumes POSIX storage (and ssh+rsync data transfer); however using object stores (AWS S3, OpenStack Swift, etc) is becoming more common.

Open technical design questions:

  • S3 (and swift?) are sometimes considered transfer protocols in addition to storage protocols. Should a S3 DCM support these as data transfer protocols, or only as storage?
  • DCM design assumes that having client side checksums calculated without direct user intervention is essential to "big data depositions", and that these checksums should be propagated from deposition to publication (and data file replication, etc). How do other disciplines view this trade-off (implementation complexity vs data integrity)?

Lowest-way complexity way of implementing this would be a second DCM implementation (or configuration option for current DCM), changing only how data files are transferred from the temporary upload location to the dataverse-accessable storage (aka - internal copy from temporary POSIX to DV S3/swift dataset buckets) and keeping existing ssh+rsync and client-side checksums. Moderately higher complexity would involve changes to the existing approach for client-side checksums and data transfers to support non-unix OS systems.

@pameyer
Copy link
Contributor Author

pameyer commented May 23, 2018

Notes from discussion:

  • support S3/swift as transfer protocols too? no
  • support swift storage too? no, separate issue
  • go with low-complexity approach (aka - change scanner to use S3 commands for /upload -> /hold
  • @landreev pointed out that there could be issues with the batch import (from the DCM success API); current mapping of /hold to $GLASSFISH_DIR/glassfish/domains/domain1/files may need review / possible revisiting.

@matthew-a-dunlap
Copy link
Contributor

matthew-a-dunlap commented Jun 18, 2018

My understanding of the work needed for the story:

--- see new post below ---

@pameyer
Copy link
Contributor Author

pameyer commented Jun 18, 2018

One open question: should updating the existing DCM jenkins tests to include S3 be added to the list?

@matthew-a-dunlap
Copy link
Contributor

matthew-a-dunlap commented Jul 12, 2018

After discussing with @pameyer , my understanding is that this work does not include functionality to allow downloading of files uploaded via DCM S3. The end result of this work will be that the file-bundle is uploaded to a structured S3 bucket that dataverse is aware of. This will lay the foundation for the later download functionality enabled in rsal.

@matthew-a-dunlap
Copy link
Contributor

matthew-a-dunlap commented Jul 12, 2018

Here is some important info about Dataverse's batch file system jobs: https://github.com/sbgrid/sbgrid-dataverse/wiki/INFO:-Batch-Job-Implementation / https://github.com/sbgrid/sbgrid-dataverse/wiki/HOWTO:-Batch-Job-Testing#3-manual-tests . This issue is also relevant #3353

Posting these here for future developers as well as myself. The functionality around this needs to at least be understood when implementing an s3 version of the import.

Talking with @landreev , we were unsure as to whether this s3 functionality should live as a new batch job or if it could somehow be rolled into our existing storageIO code. If possible, the latter seems simpler from a code maintenance perspective.

@matthew-a-dunlap
Copy link
Contributor

Another facet: should we consider creating an actual package (e.g. tar/zip) to store on s3? AWS has per-file charges for getting/moving files around in buckets, and if we aren't providing the ability to download individual files in the dcm-dataset anyways, maybe we should consider actually zipping things up?

matthew-a-dunlap pushed a commit that referenced this issue Jul 17, 2018
Import code still needed so Dataverse knows of the new files
matthew-a-dunlap pushed a commit that referenced this issue Jul 19, 2018
Cleanup needed but flow works
matthew-a-dunlap pushed a commit that referenced this issue Jul 19, 2018
@pameyer
Copy link
Contributor Author

pameyer commented Aug 14, 2018

Ran through some initial testing with DCM 0.2 (POSIX) and docker-dcm; used dataverseAdmin to create datasets with UI and API. Locking still seems to be behaving. Engineered checksum failures result in 2x notifications to dataverseAdmin; checksum success messages result in 3x notifications to dataverseAdmin (some of this may be due to either configuration or user error).

@pameyer
Copy link
Contributor Author

pameyer commented Aug 20, 2018

With more realistic test accounts (aka - not dataverseAdmin), the duplicate messages for checksum still occur.

@pameyer
Copy link
Contributor Author

pameyer commented Aug 20, 2018

Ran into a non-intuitive failure mode during testing: It was possible to "half-create" a dataset, where a given dataset would show up when browsing (because it was present in solr), but clicking on the dataset gave a 404 (because it was not present in postgres). Lighttpd logs showed requests from dataverse for sr.py but not ur.py.

Initial investigation suggested possible relation with dataverse role assignments. More in-depth debugging showed that root cause was incorrectly configured DCM (specifically, adapting ansible role designed to deploy from git pull to the new rpm installation has incorrect CGI configuration, resulting in the DCM returning the source code of ur.py instead of the specified {"status":"OK"}. Dataverse was checking HTTP status code (which was 200), but not the contents of the result.

Short version - Dataverse should probably be more rigorous in what it expects to receive from integration systems, but "happy path" is not impacted.

@pdurbin
Copy link
Member

pdurbin commented Aug 21, 2018

@pameyer and I just chatted about my "bad news" test in DatasetsIT where I construct some JSON with badNews.add("status", "validation failed"). I showed that notifications seemed to be working ok. The superuser got a single notification. The author got a single notification. Pete is going to dig in and see if he can help me replicate the bug.

@pameyer
Copy link
Contributor Author

pameyer commented Aug 21, 2018

On 4703-dcm-s3-2-2f91725, the only duplicate (checksum success or checksum failure) notifications I'm seeing are to administrative users (not the normal test users). So it's either fixed, or was user/configuration error to begin with.

@pdurbin
Copy link
Member

pdurbin commented Aug 21, 2018

@pameyer ok. I take it there's nothing for me to fix, then. Thanks for testing! I'm taking myself off this issue.

@pdurbin pdurbin removed their assignment Aug 21, 2018
@kcondon
Copy link
Contributor

kcondon commented Aug 22, 2018

Found 3 issues:

  • 1. aws cli should be installed as part of DCM setup
  1. Why 2 buckets? config instructions indicate a dataverse and a dcm bucket but files only appear in dataverse bucket.
  2. Failure notification has different style than others, verified with Mike icon should be dataset, not dataverse and usually dataset is displayed as a linked dataset title rather than unlinked doi.
    Screen Shot 2018-08-22 at 6.09.35 PM.png

@pameyer
Copy link
Contributor Author

pameyer commented Aug 23, 2018

As of 8d702eb; AWS CLI is installed into the DCM container for docker-dcm.

@kcondon kcondon removed their assignment Aug 24, 2018
pdurbin added a commit that referenced this issue Aug 24, 2018
Previously, the icon for a dataverse was used.
@pdurbin
Copy link
Member

pdurbin commented Aug 24, 2018

For the checksum fail notification I fixed the icon in 23ba120 and change the DOI to the dataset title with a link in c5e6298. @djbrooke looked over my shoulder.

@pdurbin
Copy link
Member

pdurbin commented Aug 24, 2018

I can tell from the code that the second bucket (dataverse.files.dcm-s3-bucket-name) is used and not cruft but I don't feel confident in my ability to document it properly. I added a FIXME in 5649d05 and a little text that should be improved upon by someone who wrote the code or someone who is testing the S3 support for DCM works.

@pdurbin
Copy link
Member

pdurbin commented Aug 24, 2018

@kcondon and I have been talking for over an hour about this issue and drew this:

dcms3

The additional change we agreed to I just made in d359f73 where I tried to explain a little more in the dev guide about what the new "dcm-s3" bucket is for. I also linked back to this issue because it turns out there's useful stuff in here but the GitHub web interface hid it from my because there are so many comments. I added a FIXME to document all of this stuff properly in the guides some day. It's extremely complex. Diagrams would be helpful as well.

Moving to QA, mostly to test the checksum notification improvements I made today and mentioned above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants