Amazon Simple Storage Service (S3) resource plugin for iRODS
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.

README

iRODS S3 Resource Plugin
------------------------

To build the S3 Resource Plugin, you will need to have:

 - the iRODS Development Tools (irods-dev and irods-runtime) installed for your platform
     http://irods.org/download

 - libxml2-dev / libxml2-devel

 - libcurl4-gnutls-dev / curl-devel

 - libs3 installed for your platform
     https://github.com/irods/libs3


To use this resource plugin:

  irods@hostname $ iadmin mkresc compResc compound
  irods@hostname $ iadmin mkresc cacheResc unixfilesystem <hostname>:</full/path/to/Vault>
  irods@hostname $ iadmin mkresc archiveResc s3 <hostname>:/<s3BucketName>/irods/Vault "S3_DEFAULT_HOSTNAME=s3.amazonaws.com;S3_AUTH_FILE=</full/path/to/AWS.keypair>;S3_RETRY_COUNT=<num reconn tries>;S3_WAIT_TIME_SEC=<wait between retries>;S3_PROTO=<HTTP|HTTPS>"
  irods@hostname $ iadmin addchildtoresc compResc cacheResc cache
  irods@hostname $ iadmin addchildtoresc compResc archiveResc archive
  irods@hostname $ iput -R compResc foo.txt
  irods@hostname $ ireg -R archiveResc /<s3BucketName>/full/path/in/bucket /full/logical/path/to/dataObject

The AWS/S3 keypair file should have two values (Access Key ID and Secret Access Key):

  AKDJFH4KJHFCIOBJ5SLK
  rlgjolivb7293r928vu98n498ur92jfgsdkjfh8e

You may specify more than one host:IP as the S3_DEFAULT_HOSTNAME by listing them with a comma (,) between them:
ex:  S3_DEFAULT_HOSTNAME=192.168.122.128:443,192.168.122.129:443,192.168.122.130:443

To control multipart uploads, add the resource variables "S3_MPU_CHUNK" and "S3_MPU_THREADS" to the creation line.
* S3_MPU_CHUNK is the size of each part to be uploaded in parallel (in MB, default is 5MB).  Objects smaller than this will be uploaded with standard PUTs.
* S3_MPU_THREADS is the number of parts to upload in parallel (only under Linux, default is 10).  On non-Linux OSes, this parameter is ignored and multipart uploads are performed sequentially.

To control whether the names of the files within the object storage service (S3, or similar) are kept in sync with the logical names in the iRODS Catalog, use the ARCHIVE_NAMING_POLICY parameter.
The default value of 'consistent' will keep the names consistent.  Setting "ARCHIVE_NAMING_POLICY=decoupled" will not keep the names of the objects in sync.

To ensure end-to-end data integrity, MD5 checksums can be calculated and used for S3 uploads.  Note that this requires 2x the disk IO (because the file must first be read to calculate the MD5 before the S3 upload can start) and a corresponding increase in CPU usage
S3_ENABLE_MD5=[0|1]  (default is 0, off)

S3 server side encryption can be enabled using the parameter S3_SERVER_ENCRYPT=[0|1] (default=0=off).  This is not the same as HTTPS, and implies that the data will be stored on disk encrypted.  To encrypt during the network transport to S3, please use S3_PROTO=HTTPS (the default)

Using this plugin with Google Cloud services
--------------------------------------------

This plugin has been certified to work with google cloud storage. This works because Google has implemented the s3 protocol.  There are several differences:

* Google does not seem to support multipart uploads.  So it is necessary to disable this feature by adding the S3_ENABLE_MPU=0 flag to the context string.
* The default hostname in the context string should be set to storage.googleapis.com
* The signature version should be set to s3v4
* The values in the key file have to be generated according to the instructions: https://cloud.google.com/storage/docs/migrating#keys