Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5401 dcmdoc #5620

Merged
merged 2 commits into from Mar 12, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
85 changes: 18 additions & 67 deletions doc/sphinx-guides/source/developers/big-data-support.rst
Expand Up @@ -16,7 +16,7 @@ Data Capture Module (DCM) is an experimental component that allows users to uplo
Install a DCM
~~~~~~~~~~~~~

Installation instructions can be found at https://github.com/sbgrid/data-capture-module . Note that a shared filesystem (posix or AWS S3) between Dataverse and your DCM is required. You cannot use a DCM with Swift at this point in time.
Installation instructions can be found at https://github.com/sbgrid/data-capture-module/blob/master/doc/installation.md. Note that shared storage (posix or AWS S3) between Dataverse and your DCM is required. You cannot use a DCM with Swift at this point in time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of this effort, should we move "big data support" from the dev guide to some other guide?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like it's out of scope for this one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks.


.. FIXME: Explain what ``dataverse.files.dcm-s3-bucket-name`` is for and what it has to do with ``dataverse.files.s3-bucket-name``.

Expand Down Expand Up @@ -57,31 +57,8 @@ Here's the syntax for sending the JSON.
Steps to set up a DCM mock for Development
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Install Flask.
See instructions at https://github.com/sbgrid/data-capture-module/blob/master/doc/mock.md

Download and run the mock. You will be cloning the https://github.com/sbgrid/data-capture-module repo.

- ``git clone git://github.com/sbgrid/data-capture-module.git``
- ``cd data-capture-module/api``
- ``./dev_mock.sh``

If you see an error about not having Flask installed, install it as explained below.

On Mac, you can install Flask with:

- ``mkvirtualenv mockdcm``
- ``pip install -r requirements-mock.txt``

On Ubuntu/Debian, you can install Flask with:

- ``sudo apt install python-pip`` (will install python as well)
- ``pip install flask``

Once you have Flask installed, try running the dev mock script again:

- ``./dev_mock.sh``

This should spin up the DCM mock on port 5000.

Add Dataverse settings to use mock (same as using DCM, noted above):

Expand All @@ -96,6 +73,8 @@ At this point you should be able to download a placeholder rsync script. Dataver
- ``echo "hello" > file1.txt``
- ``shasum file1.txt > files.sha``



Now the files are in place and you need to send JSON to Dataverse with a success or failure message as described above. Make a copy of ``doc/sphinx-guides/source/_static/installation/files/root/big-data-support/checksumValidationSuccess.json`` and put the identifier in place such as "X1METO" under "uploadFolder"). Then use curl as described above to send the JSON.

Troubleshooting
Expand All @@ -113,36 +92,17 @@ If you need a fully operating DCM client for development purposes, these steps w
Docker Image Set-up
^^^^^^^^^^^^^^^^^^^

See https://github.com/IQSS/dataverse/blob/develop/conf/docker-dcm/readme.txt

- Install docker if you do not have it
- Follow these steps (extracted from ``docker-aio/readme.md`` & ``docker-dcm/readme.txt``) :

- ``cd conf/docker-aio`` and run ``./0prep_deps.sh`` to create Glassfish and Solr tarballs in conf/docker-aio/dv/deps.
- Run ``./1prep.sh``
- Build the docker image: ``docker build -t dv0 -f c7.dockerfile .``
- ``cd ../docker-dcm`` and run ``./0prep.sh``
- Build dcm/dv0dcm images with docker-compose: ``docker-compose -f docker-compose.yml build``
- Start containers: ``docker-compose -f docker-compose.yml up -d``
- Wait for container to show "healthy" (aka - ``docker ps``), then wait another 5 minutes (even though it shows healthy, glassfish is still standing itself up). Then run Dataverse app installation: ``docker exec -it dvsrv /opt/dv/install.bash``
- Configure Dataverse application to use DCM (run from outside the container): ``docker exec -it dvsrv /opt/dv/configure_dcm.sh``
- The Dataverse installation is accessible at ``http://localhost:8084``.
- You may need to change the DoiProvider inside dvsrv (ezid does not work):

- ``curl -X DELETE -d EZID "localhost:8080/api/admin/settings/:DoiProvider"``
- ``curl -X PUT -d DataCite "localhost:8080/api/admin/settings/:DoiProvider"``
- Also change the doi.baseUrlString, doi.username, doi.password

Optional steps for setting up the S3 Docker DCM Variant
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- Before: the default bucket for DCM to hold files in S3 is named test-dcm. It is coded into `post_upload_s3.bash` (line 30). Change to a different bucket if needed.
- Add AWS bucket info to dcmsrv

- You need a credentials files in ~/.aws

- ``mkdir ~/.aws``
- ``yum install nano`` (or use a different editor below)
- ``nano ~/.aws/credentials`` and add these contents with your keys:

- Add AWS bucket info to dcmsrv
- Add AWS credentials to ``~/.aws/credentials``
- ``[default]``
- ``aws_access_key_id =``
- ``aws_secret_access_key =``
Expand All @@ -156,16 +116,12 @@ Optional steps for setting up the S3 Docker DCM Variant
- ``./asadmin create-jvm-options "\-Ddataverse.files.storage-driver-id=s3"``

- Add AWS bucket info to Dataverse

- ``mkdir ~/.aws``
- ``yum install nano`` (or use a different editor below)
- ``nano ~/.aws/credentials`` and add these contents with your keys:

- Add AWS credentials to ``~/.aws/credentials``
- ``[default]``
- ``aws_access_key_id =``
- ``aws_secret_access_key =``

- Also: ``nano ~/.aws/config`` to create a region file. Add these contents:
- Also: set region in ``~/.aws/config`` to create a region file. Add these contents:

- ``[default]``
- ``region = us-east-1``
Expand All @@ -188,38 +144,33 @@ Using the DCM Docker Containers
For using these commands, you will need to connect to the shell prompt inside various containers (e.g. ``docker exec -it dvsrv /bin/bash``)

- Create a dataset and download rsync upload script
- Upload script to dcm_client (if needed, you can probably do all the actions for create/download inside dcm_client)

- ``docker cp ~/Downloads/upload-FK2_NN49YM.bash dcm_client:/tmp``
- connect to client container: ``docker exec -it dcm_client bash``
- create dataset: ``cd /mnt ; ./create.bash`` ; this will echo the database ID to stdout
- download transfer script: ``./get_transfer.bash $database_id_from_create_script``
- execute the transfer script: ``bash ./upload-${database_id_from-create_script}.bash`` , and follow instructions from script.

- Create a folder of files to upload (files can be empty)
- Run script

- e.g. ``bash ./upload-FK2_NN49YM.bash``
- e.g. ``bash ./upload-3.bash`` (``3`` being the database id from earlier commands in this example).

- Manually run post upload script on dcmsrv

- for posix implementation: ``bash ./opt/dcm/scn/post_upload.bash``
- for S3 implementation: ``bash ./opt/dcm/scn/post_upload_s3.bash``
- for posix implementation: ``docker exec -it dcmsrv /opt/dcm/scn/post_upload.bash``
- for S3 implementation: ``docker exec -it dcmsrv /opt/dcm/scn/post_upload_s3.bash``

Additional DCM docker development tips
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- You can completely blow away all the docker images with these commands (including non DCM ones!)

- ``docker stop dvsrv``
- ``docker stop dcm_client``
- ``docker stop dcmsrv``
- ``docker rm $(docker ps -a -q)``
- ``docker rmi $(docker images -q)``
- ``docker-compose -f docmer-compose.yml down -v``

- There are a few logs to tail

- dvsrv : ``tail -n 2000 -f /opt/glassfish4/glassfish/domains/domain1/logs/server.log``
- dcmsrv : ``tail -n 2000 -f /var/log/lighttpd/breakage.log``
- dcmsrv : ``tail -n 2000 -f /var/log/lighttpd/access.log``

- Note that by default the docker container will stop running if the process it is following is turned off. For example flask with dcmsrv. You can get around this by having the script being followed never close (e.g. sleep infinity) https://stackoverflow.com/questions/31870222/how-can-i-keep-container-running-on-kubernetes
- You may have to restart the glassfish domain occasionally to deal with memory filling up. If deployment is getting reallllllly slow, its a good time.

Repository Storage Abstraction Layer (RSAL)
Expand Down