Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4703 dcm s3 2 #4946

Merged
merged 17 commits into from
Aug 24, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions conf/docker-aio/install.bash
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ sudo -u postgres createuser --superuser dvnapp
#./entrypoint.bash &
unzip dvinstall.zip
cd dvinstall/
echo "beginning installer"
./install -admin_email=dvAdmin@mailinator.com -y -f > install.out 2> install.err

echo "installer complete"
Expand Down
5 changes: 3 additions & 2 deletions conf/docker-dcm/configure_dcm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ echo "dcm configs on dv side to be done"

# in homage to dataverse traditions, reset to insecure "burrito" admin API key
sudo -u postgres psql -c "update apitoken set tokenstring='burrito' where id=1;" dvndb
sudo -u postgres psql -c "update authenticateduser set superuser='t' where id=1;" dvndb

# dataverse configs for DCM
curl -X PUT -d "SHA-1" "http://localhost:8080/api/admin/settings/:FileFixityChecksumAlgorithm"
Expand All @@ -17,8 +18,8 @@ curl -X PUT "http://localhost:8080/api/admin/settings/:DownloadMethods" -d "rsal
curl -X POST -H "X-Dataverse-key: burrito" "http://localhost:8080/api/dataverses/root/actions/:publish"

# symlink `hold` volume
mkdir -p /usr/local/glassfish4/glassfish/domains/domain1/files/10.5072/
ln -s /hold /usr/local/glassfish4/glassfish/domains/domain1/files/10.5072/FK2
mkdir -p /usr/local/glassfish4/glassfish/domains/domain1/files/
ln -s /hold /usr/local/glassfish4/glassfish/domains/domain1/files/10.5072

# need to set siteUrl
cd /usr/local/glassfish4
Expand Down
1 change: 1 addition & 0 deletions conf/docker-dcm/dcmsrv.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ COPY bashrc /root/.bashrc
COPY test_install.sh /root/
RUN yum localinstall -y /tmp/${RPMFILE}
RUN pip install -r /opt/dcm/requirements.txt
RUN pip install awscli==1.15.75
RUN /root/test_install.sh
COPY rq-init-d /etc/init.d/rq
RUN useradd glassfish
Expand Down
10 changes: 5 additions & 5 deletions conf/docker-dcm/readme.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,17 @@ This docker-compose setup is intended for use in development, small scale evalua

Setup:

- build docker-aio image with name dv0 as described in ../docker-aio` (don't run setupIT.bash)
- work in the `conf/docker-dcm` directory for below commmands
- build docker-aio image with name dv0 as described in ../docker-aio` (don't start up the docker image or run setupIT.bash)
- work in the `conf/docker-dcm` directory for below commands
- download/prepare dependencies: `./0prep.sh`
- build dcm/dv0dcm images with docker-compose: `docker-compose -f docker-compose.yml build`
- start containers: `docker-compose -f docker-compose.yml up -d`
- wait for container to show "healthy" (aka - `docker ps`), then run dataverse app installation: `docker exec -it dvsrv /opt/dv/install.bash`
- wait for container to show "healthy" (aka - `docker ps`), then wait another 4-5 minutes (even though it shows healthy, glassfish is still standing itself up), then run dataverse app installation: `docker exec -it dvsrv /opt/dv/install.bash`
- configure dataverse application to use DCM: `docker exec -it dvsrv /opt/dv/configure_dcm.sh`

Operation:
The dataverse installation is accessable at `http://localhost:8084`.
The `dcm_client` container is intended to be used for executing transfer transfer scripts, and `conf/docker-dcm` is available at `/mnt` inside the container; this container can be accessed with `docker exec -it dcm_client bash`.
The dataverse installation is accessible at `http://localhost:8084`.
The `dcm_client` container is intended to be used for executing transfer scripts, and `conf/docker-dcm` is available at `/mnt` inside the container; this container can be accessed with `docker exec -it dcm_client bash`.
The DCM cron job is NOT configured here; for development purposes the DCM checks can be run manually with `docker exec -it dcmsrv /opt/dcm/scn/post_upload.bash`.


Expand Down
8 changes: 6 additions & 2 deletions doc/sphinx-guides/source/developers/big-data-support.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,11 @@ Data Capture Module (DCM) is an experimental component that allows users to uplo
Install a DCM
~~~~~~~~~~~~~

Installation instructions can be found at https://github.com/sbgrid/data-capture-module . Note that a shared filesystem between Dataverse and your DCM is required. You cannot use a DCM with non-filesystem storage options such as Swift.
Installation instructions can be found at https://github.com/sbgrid/data-capture-module . Note that a shared filesystem (posix or AWS S3) between Dataverse and your DCM is required. You cannot use a DCM with Swift at this point in time.

Please note that S3 support for DCM is highly experimental. Files can be uploaded to S3 but they cannot be downloaded until https://github.com/IQSS/dataverse/issues/4949 is worked on. If you want to play around with S3 support for DCM, you must configure a JVM option called ``dataverse.files.dcm-s3-bucket-name`` which is a holding area for uploaded files that have not yet passed checksum validation. Search for that JVM option at https://github.com/IQSS/dataverse/issues/4703 for commands on setting that JVM option and related setup. Note that because that GitHub issue has so many comments you will need to click "Load more" where it says "hidden items". FIXME: Document all of this properly.

. FIXME: Explain what ``dataverse.files.dcm-s3-bucket-name`` is for and what it has to do with ``dataverse.files.s3-bucket-name``.

Once you have installed a DCM, you will need to configure two database settings on the Dataverse side. These settings are documented in the :doc:`/installation/config` section of the Installation Guide:

Expand Down Expand Up @@ -87,7 +91,7 @@ Add Dataverse settings to use mock (same as using DCM, noted above):
- ``curl http://localhost:8080/api/admin/settings/:DataCaptureModuleUrl -X PUT -d "http://localhost:5000"``
- ``curl http://localhost:8080/api/admin/settings/:UploadMethods -X PUT -d "dcm/rsync+ssh"``

At this point you should be able to download a placeholder rsync script. Dataverse is then waiting for new from the DCM about if checksum validation has succeeded or not. First, you have to put files in place, which is usually the job of the DCM. You should substitute "X1METO" for the "identifier" of the dataset you create. You must also use the proper path for where you store files in your dev environment.
At this point you should be able to download a placeholder rsync script. Dataverse is then waiting for news from the DCM about if checksum validation has succeeded or not. First, you have to put files in place, which is usually the job of the DCM. You should substitute "X1METO" for the "identifier" of the dataset you create. You must also use the proper path for where you store files in your dev environment.

- ``mkdir /usr/local/glassfish4/glassfish/domains/domain1/files/10.5072/FK2/X1METO``
- ``mkdir /usr/local/glassfish4/glassfish/domains/domain1/files/10.5072/FK2/X1METO/X1METO``
Expand Down
2 changes: 1 addition & 1 deletion src/main/java/Bundle.properties
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,7 @@ notification.access.granted.fileDownloader.additionalDataset={0} You now have ac
notification.access.revoked.dataverse=You have been removed from a role in {0}.
notification.access.revoked.dataset=You have been removed from a role in {0}.
notification.access.revoked.datafile=You have been removed from a role in {0}.
notification.checksumfail=One or more files in your upload failed checksum validation for dataset {0}. Please re-run the upload script. If the problem persists, please contact support.
notification.checksumfail=One or more files in your upload failed checksum validation for dataset <a href="/dataset.xhtml?persistentId={0}" title="{1}">{1}</a>. Please re-run the upload script. If the problem persists, please contact support.
notification.mail.import.filesystem=Dataset {2} ({0}/dataset.xhtml?persistentId={1}) has been successfully uploaded and verified.
notification.import.filesystem=Dataset <a href="/dataset.xhtml?persistentId={0}" title="{1}">{1}</a> has been successfully uploaded and verified.
notification.import.checksum=<a href="/dataset.xhtml?persistentId={0}" title="{1}">{1}</a>, dataset had file checksums added via a batch job.
Expand Down
1 change: 1 addition & 0 deletions src/main/java/edu/harvard/iq/dataverse/DatasetPage.java
Original file line number Diff line number Diff line change
Expand Up @@ -1516,6 +1516,7 @@ private String init(boolean initFull) {
setHasRsyncScript(true);
setRsyncScript(scriptRequestResponse.getScript());
rsyncScriptFilename = "upload-"+ workingVersion.getDataset().getIdentifier() + ".bash";
rsyncScriptFilename = rsyncScriptFilename.replace("/", "_");
}
else{
setHasRsyncScript(false);
Expand Down
216 changes: 216 additions & 0 deletions src/main/java/edu/harvard/iq/dataverse/S3PackageImporter.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
/*
* To change this license header, choose License Headers in Project Properties.
* To change this template file, choose Tools | Templates
* and open the template in the editor.
*/
package edu.harvard.iq.dataverse;

import com.amazonaws.AmazonClientException;
import com.amazonaws.SdkClientException;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.CopyObjectRequest;
import com.amazonaws.services.s3.model.DeleteObjectRequest;
import com.amazonaws.services.s3.model.GetObjectRequest;
import com.amazonaws.services.s3.model.ListObjectsRequest;
import com.amazonaws.services.s3.model.ObjectListing;
import com.amazonaws.services.s3.model.S3Object;
import com.amazonaws.services.s3.model.S3ObjectSummary;
import edu.harvard.iq.dataverse.api.AbstractApiBean;
import edu.harvard.iq.dataverse.batch.jobs.importer.filesystem.FileRecordWriter;
import edu.harvard.iq.dataverse.engine.command.DataverseRequest;
import edu.harvard.iq.dataverse.engine.command.exception.IllegalCommandException;
import edu.harvard.iq.dataverse.settings.SettingsServiceBean;
import edu.harvard.iq.dataverse.util.FileUtil;
import static edu.harvard.iq.dataverse.util.json.NullSafeJsonBuilder.jsonObjectBuilder;
import java.io.IOException;
import java.io.InputStream;
import java.sql.Timestamp;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.ejb.EJB;
import javax.ejb.Stateless;
import javax.inject.Named;
import javax.json.JsonObject;
import javax.json.JsonObjectBuilder;

/**
* This class is for importing files added to s3 outside of dataverse.
* Specifically, it is intended to be used along dcm.
* Most of this code has been ported from FileRecordWriter, pruning out
* the incomplete sections for importing individual files instead of folder-packages
* @author matthew
*/

@Named
@Stateless
public class S3PackageImporter extends AbstractApiBean implements java.io.Serializable{

private static final Logger logger = Logger.getLogger(S3PackageImporter.class.getName());

private AmazonS3 s3 = null;

@EJB
DataFileServiceBean dataFileServiceBean;

@EJB
EjbDataverseEngine commandEngine;

public void copyFromS3(Dataset dataset, String s3ImportPath) throws IOException {
try {
s3 = AmazonS3ClientBuilder.standard().defaultClient();
} catch (Exception e) {
throw new AmazonClientException(
"Cannot instantiate a S3 client using; check your AWS credentials and region",
e);
}

JsonObjectBuilder bld = jsonObjectBuilder();

String fileMode = FileRecordWriter.FILE_MODE_PACKAGE_FILE;

String dcmBucketName = System.getProperty("dataverse.files.dcm-s3-bucket-name");
String dcmDatasetKey = s3ImportPath;
String dvBucketName = System.getProperty("dataverse.files.s3-bucket-name");

String dvDatasetKey = getS3DatasetKey(dataset);

logger.log(Level.INFO, "S3 Import related attributes. dcmBucketName: {0} | dcmDatasetKey: {1} | dvBucketName: {2} | dvDatasetKey: {3} |",
new Object[]{dcmBucketName, dcmDatasetKey, dvBucketName, dvDatasetKey});

if (dataset.getVersions().size() != 1) {
String error = "Error creating FilesystemImportJob with dataset with ID: " + dataset.getId() + " - Dataset has more than one version.";
logger.info(error);
throw new IllegalStateException(error);
}

if (dataset.getLatestVersion().getVersionState() != DatasetVersion.VersionState.DRAFT) {
String error = "Error creating FilesystemImportJob with dataset with ID: " + dataset.getId() + " - Dataset isn't in DRAFT mode.";
logger.info(error);
throw new IllegalStateException(error);
}

ListObjectsRequest req = new ListObjectsRequest().withBucketName(dcmBucketName).withPrefix(dcmDatasetKey);
ObjectListing storedDcmDatsetFilesList;
try {
storedDcmDatsetFilesList = s3.listObjects(req);
} catch (SdkClientException sce) {
logger.info("Caught an SdkClientException in s3ImportUtil: " + sce.getMessage());
throw new IOException ("S3 listAuxObjects: failed to get a listing for "+dcmDatasetKey);
}
List<S3ObjectSummary> storedDcmDatasetFilesSummary = storedDcmDatsetFilesList.getObjectSummaries();
try {
while (storedDcmDatsetFilesList.isTruncated()) {
logger.fine("S3 listAuxObjects: going to next page of list");
storedDcmDatsetFilesList = s3.listNextBatchOfObjects(storedDcmDatsetFilesList);
if (storedDcmDatsetFilesList != null) {
storedDcmDatasetFilesSummary.addAll(storedDcmDatsetFilesList.getObjectSummaries());
}
}
} catch (AmazonClientException ase) {
logger.info("Caught an AmazonServiceException in s3ImportUtil: " + ase.getMessage());
throw new IOException("S3AccessIO: Failed to get aux objects for listing.");
}
for (S3ObjectSummary item : storedDcmDatasetFilesSummary) {

logger.log(Level.INFO, "S3 Import file copy for {0}", new Object[]{item});
String dcmFileKey = item.getKey();

String copyFileName = dcmFileKey.substring(dcmFileKey.lastIndexOf('/') + 1);

logger.log(Level.INFO, "S3 file copy related attributes. dcmBucketName: {0} | dcmFileKey: {1} | dvBucketName: {2} | copyFilePath: {3} |",
new Object[]{dcmBucketName, dcmFileKey, dvBucketName, dvDatasetKey+"/"+copyFileName});

s3.copyObject(new CopyObjectRequest(dcmBucketName, dcmFileKey, dvBucketName, dvDatasetKey+"/"+copyFileName));

try {
s3.deleteObject(new DeleteObjectRequest(dcmBucketName, dcmFileKey));
} catch (AmazonClientException ase) {
logger.warning("Caught an AmazonClientException deleting s3 object from dcm bucket: " + ase.getMessage());
throw new IOException("Failed to delete object" + new Object[]{item});
}
}

}

public DataFile createPackageDataFile(Dataset dataset, String folderName, long totalSize) {
DataFile packageFile = new DataFile(DataFileServiceBean.MIME_TYPE_PACKAGE_FILE);
packageFile.setChecksumType(DataFile.ChecksumType.SHA1);

FileUtil.generateStorageIdentifier(packageFile);


String dvBucketName = System.getProperty("dataverse.files.s3-bucket-name");
String dvDatasetKey = getS3DatasetKey(dataset);
S3Object s3object = null;

s3object = s3.getObject(new GetObjectRequest(dvBucketName, dvDatasetKey+"/files.sha"));

InputStream in = s3object.getObjectContent();
String checksumVal = FileUtil.CalculateChecksum(in, packageFile.getChecksumType());

packageFile.setChecksumValue(checksumVal);

packageFile.setFilesize(totalSize);
packageFile.setModificationTime(new Timestamp(new Date().getTime()));
packageFile.setCreateDate(new Timestamp(new Date().getTime()));
packageFile.setPermissionModificationTime(new Timestamp(new Date().getTime()));
packageFile.setOwner(dataset);
dataset.getFiles().add(packageFile);

packageFile.setIngestDone();

// set metadata and add to latest version
FileMetadata fmd = new FileMetadata();
fmd.setLabel(folderName.substring(folderName.lastIndexOf('/') + 1));

fmd.setDataFile(packageFile);
packageFile.getFileMetadatas().add(fmd);
if (dataset.getLatestVersion().getFileMetadatas() == null) dataset.getLatestVersion().setFileMetadatas(new ArrayList<>());

dataset.getLatestVersion().getFileMetadatas().add(fmd);
fmd.setDatasetVersion(dataset.getLatestVersion());

GlobalIdServiceBean idServiceBean = GlobalIdServiceBean.getBean(packageFile.getProtocol(), commandEngine.getContext());
if (packageFile.getIdentifier() == null || packageFile.getIdentifier().isEmpty()) {
String packageIdentifier = dataFileServiceBean.generateDataFileIdentifier(packageFile, idServiceBean);
packageFile.setIdentifier(packageIdentifier);
}

String nonNullDefaultIfKeyNotFound = "";
String protocol = commandEngine.getContext().settings().getValueForKey(SettingsServiceBean.Key.Protocol, nonNullDefaultIfKeyNotFound);
String authority = commandEngine.getContext().settings().getValueForKey(SettingsServiceBean.Key.Authority, nonNullDefaultIfKeyNotFound);

if (packageFile.getProtocol() == null) {
packageFile.setProtocol(protocol);
}
if (packageFile.getAuthority() == null) {
packageFile.setAuthority(authority);
}

if (!packageFile.isIdentifierRegistered()) {
String doiRetString = "";
idServiceBean = GlobalIdServiceBean.getBean(commandEngine.getContext());
try {
doiRetString = idServiceBean.createIdentifier(packageFile);
} catch (Throwable e) {

}

// Check return value to make sure registration succeeded
if (!idServiceBean.registerWhenPublished() && doiRetString.contains(packageFile.getIdentifier())) {
packageFile.setIdentifierRegistered(true);
packageFile.setGlobalIdCreateTime(new Date());
}
}

return packageFile;
}

public String getS3DatasetKey(Dataset dataset) {
return dataset.getAuthority() + "/" + dataset.getIdentifier();
}
}