-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DCM S3 #4703
Comments
Notes from discussion:
|
My understanding of the work needed for the story: --- see new post below --- |
One open question: should updating the existing DCM jenkins tests to include S3 be added to the list? |
After discussing with @pameyer , my understanding is that this work does not include functionality to allow downloading of files uploaded via DCM S3. The end result of this work will be that the file-bundle is uploaded to a structured S3 bucket that dataverse is aware of. This will lay the foundation for the later download functionality enabled in rsal. |
Here is some important info about Dataverse's batch file system jobs: https://github.com/sbgrid/sbgrid-dataverse/wiki/INFO:-Batch-Job-Implementation / https://github.com/sbgrid/sbgrid-dataverse/wiki/HOWTO:-Batch-Job-Testing#3-manual-tests . This issue is also relevant #3353 Posting these here for future developers as well as myself. The functionality around this needs to at least be understood when implementing an s3 version of the import. Talking with @landreev , we were unsure as to whether this s3 functionality should live as a new batch job or if it could somehow be rolled into our existing storageIO code. If possible, the latter seems simpler from a code maintenance perspective. |
Another facet: should we consider creating an actual package (e.g. tar/zip) to store on s3? AWS has per-file charges for getting/moving files around in buckets, and if we aren't providing the ability to download individual files in the dcm-dataset anyways, maybe we should consider actually zipping things up? |
Import code still needed so Dataverse knows of the new files
Cleanup needed but flow works
Ran through some initial testing with DCM 0.2 (POSIX) and docker-dcm; used dataverseAdmin to create datasets with UI and API. Locking still seems to be behaving. Engineered checksum failures result in 2x notifications to dataverseAdmin; checksum success messages result in 3x notifications to dataverseAdmin (some of this may be due to either configuration or user error). |
With more realistic test accounts (aka - not dataverseAdmin), the duplicate messages for checksum still occur. |
Ran into a non-intuitive failure mode during testing: It was possible to "half-create" a dataset, where a given dataset would show up when browsing (because it was present in solr), but clicking on the dataset gave a 404 (because it was not present in postgres). Lighttpd logs showed requests from dataverse for Initial investigation suggested possible relation with dataverse role assignments. More in-depth debugging showed that root cause was incorrectly configured DCM (specifically, adapting ansible role designed to deploy from Short version - Dataverse should probably be more rigorous in what it expects to receive from integration systems, but "happy path" is not impacted. |
@pameyer and I just chatted about my "bad news" test in DatasetsIT where I construct some JSON with |
On |
@pameyer ok. I take it there's nothing for me to fix, then. Thanks for testing! I'm taking myself off this issue. |
As of 8d702eb; AWS CLI is installed into the DCM container for docker-dcm. |
Previously, the icon for a dataverse was used.
I can tell from the code that the second bucket ( |
@kcondon and I have been talking for over an hour about this issue and drew this: The additional change we agreed to I just made in d359f73 where I tried to explain a little more in the dev guide about what the new "dcm-s3" bucket is for. I also linked back to this issue because it turns out there's useful stuff in here but the GitHub web interface hid it from my because there are so many comments. I added a FIXME to document all of this stuff properly in the guides some day. It's extremely complex. Diagrams would be helpful as well. Moving to QA, mostly to test the checksum notification improvements I made today and mentioned above. |
The DCM (data capture module, for big data upload) currently integrate with Datavese assumes POSIX storage (and ssh+rsync data transfer); however using object stores (AWS S3, OpenStack Swift, etc) is becoming more common.
Open technical design questions:
Lowest-way complexity way of implementing this would be a second DCM implementation (or configuration option for current DCM), changing only how data files are transferred from the temporary upload location to the dataverse-accessable storage (aka - internal copy from temporary POSIX to DV S3/swift dataset buckets) and keeping existing ssh+rsync and client-side checksums. Moderately higher complexity would involve changes to the existing approach for client-side checksums and data transfers to support non-unix OS systems.
The text was updated successfully, but these errors were encountered: