Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop S3 storage driver #3921

Closed
landreev opened this issue Jun 19, 2017 · 13 comments
Closed

Develop S3 storage driver #3921

landreev opened this issue Jun 19, 2017 · 13 comments

Comments

@landreev
Copy link
Contributor

This came up during the meeting with the LTS, when the plans of moving the Dataverse production onto a cloud were discussed. S3 will have to be supported as the main production storage mechanism, since local disk space will no longer be available.

The existing swift driver can be used as a basic model.

#3919, just opened, is not a direct dependency, but, for the purposes of this production move both of the issues will need to be addressed at the same time. As we will not be able to continue storing the dataset-level files, like cached exports, etc., for the same reason as the above - as local file system space will no longer be there.

@djbrooke djbrooke added ready and removed ready labels Jun 20, 2017
@djbrooke djbrooke added this to the 4.8 - Large Data Upload Integration milestone Jul 20, 2017
matthew-a-dunlap pushed a commit that referenced this issue Jul 27, 2017
AWS import now points to a version of the sdk that uses all its own unique named dependencies as to not cause conflicts.

As we go forward we should look into updating glassfish and using the normal SDK.
matthew-a-dunlap pushed a commit that referenced this issue Jul 27, 2017
Allows you to upload a file and have it show up in an aws s3 bucket. Code is far far from complete
@matthew-a-dunlap
Copy link
Contributor

Here are some notes on the state of the S3 code as I leave for vacation:

The code is definitely far far from complete, but I was able to port over some sample code from the S3 example and get it working in S3AccessIO.java. You should be able to upload a data file and have it show up in the S3 bucket (naming is not correct tho). Right now the code deletes the bucket before every upload, which obviously needs to change.

Note in the pom that the import is for aws-java-sdk-bom . This version takes all the aws sdk dependencies and has renamed them so there are no package conflicts. We were having issues with Glassfish's version of jackson conflicting with minimum aws requirements. There may be a version of this dependency that only pulls in the S3 code, I didn't get to check.

I'm not certain what approach we should take in regards to folder structure. AWS has a limit of 100 buckets per account so we probably only want one bucket for the full dataverse application (you can apply for an increase tho). There are no true folders in AWS, but if you name things with folder structure they show up as folders.

@pdurbin
Copy link
Member

pdurbin commented Jul 28, 2017

@ferrys and I just discussed the code as of 39375c8. We found ~/.aws/credentials which is what @matthew-a-dunlap must have been talking about when told me that AWS has a place it expects to find credentials. We might consider moving this to /usr/local/glassfish4/glassfish/domains/domain1/config/s3.properties to match swift.properties as documented at http://guides.dataverse.org/en/4.7.1/installation/config.html#file-storage-local-filesystem-vs-swift . She seems to agree that 1 bucket per installation of Dataverse makes sense in an S3 world but that we should continue to use 1 container per dataset in the Swift code. Folder structure is out of scope for this issue but we did discuss #2249 a bit and would like to use @leeper 's folder hierarchy example from https://osf.io/xfj5h/ via https://projects.iq.harvard.edu/dcm2017/agenda as a use case in the future. Here's how it looks:

screen shot 2017-07-28 at 1 46 08 pm

@ferrys is pretty sure that for Swift we should continue to store files in a single container rather than using Swift's folder structure features, which are complicated. I'm just concerned about having someday having duplicate file names that are in different directories. We don't know a lot about how S3 works but she said Swift also will have files show up in folders if you name them with slashes in them.

@ferrys ferrys self-assigned this Jul 28, 2017
@ferrys
Copy link
Contributor

ferrys commented Jul 28, 2017

I looked into both Swift and S3 in terms of folder structure and it seems like you can store multiple of the same file in the same bucket/container IF they are in different folders. So, we shouldn't have a problem there.

@pdurbin I think we could definitely research altering the implementation of Swift as well, but I don't know much about how the Swift API deals with folders within containers, so I would say it is also out of scope.

@ferrys
Copy link
Contributor

ferrys commented Jul 31, 2017

In order to configure the credentials for AWS, you need access to your AWS Access Key ID and your AWS Secret Access Key. Once you have them both, you should run pip install awscli (or if you have the Anaconda version of Python, pip install -i https://pypi.anaconda.org/pypi/simple awscli) and then aws configure and enter your credentials.

ferrys added a commit that referenced this issue Jul 31, 2017
oscardssmith added a commit that referenced this issue Jul 31, 2017
ferrys added a commit that referenced this issue Jul 31, 2017
@kcondon
Copy link
Contributor

kcondon commented Aug 11, 2017

-Dataset directory is still created locally when s3 configured though it is empty. Probably a holdover from putting export files in both places.
-Some extra logging statements that might need trimming.

otherwise ready to go.

@kcondon
Copy link
Contributor

kcondon commented Aug 15, 2017

Issues:
-Cannot delete an s3 file when configured for local file, works on s3
-Downloading many files in zip fails due to putting 1000 file ids on url.

ferrys added a commit that referenced this issue Aug 16, 2017
rbhatta99 added a commit that referenced this issue Aug 18, 2017
rbhatta99 added a commit that referenced this issue Aug 18, 2017
rbhatta99 added a commit that referenced this issue Aug 18, 2017
rbhatta99 added a commit that referenced this issue Aug 22, 2017
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Nov 7, 2018
…ter.

When AWS S3 storage support was introduced back in 2017 in IQSS#3921, the team experienced
problems with the bundled Jackson library of Glassfish (2.3 instead of 2.6 minimum).
By switching to the complete bundle, the bundled Jackson library was used and problems
 where avoided.

This lead to a bigger WAR than necessary (~20 MB) and made a workaround necessary to
remove some AWS specific `javamail.providers` to avoid email problems via WAR file manipulation.

This commit:
* removes the WAR file hacking
* makes use of the S3 SDK part only, reducing the WAR size
* enables proper <dependencyManagement> for the sake of avoiding dependency convergence problems.

People unaware of direct and transitive dependencies and how to manage them are kindly requested to
have a look at the Maven docs and tutorials:
* https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html
* https://www.davidjhay.com/maven-dependency-management
* https://maven.apache.org/enforcer/maven-enforcer-plugin/index.html
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Nov 14, 2018
…ter.

When AWS S3 storage support was introduced back in 2017 in IQSS#3921, the team experienced
problems with the bundled Jackson library of Glassfish (2.3 instead of 2.6 minimum).
By switching to the complete bundle, the bundled Jackson library was used and problems
 where avoided.

This lead to a bigger WAR than necessary (~20 MB) and made a workaround necessary to
remove some AWS specific `javamail.providers` to avoid email problems via WAR file manipulation.

This commit:
* removes the WAR file hacking
* makes use of the S3 SDK part only, reducing the WAR size
* enables proper <dependencyManagement> for the sake of avoiding dependency convergence problems.

People unaware of direct and transitive dependencies and how to manage them are kindly requested to
have a look at the Maven docs and tutorials:
* https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html
* https://www.davidjhay.com/maven-dependency-management
* https://maven.apache.org/enforcer/maven-enforcer-plugin/index.html
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Nov 14, 2018
…ter.

When AWS S3 storage support was introduced back in 2017 in IQSS#3921, the team experienced
problems with the bundled Jackson library of Glassfish (2.3 instead of 2.6 minimum).
By switching to the complete bundle, the bundled Jackson library was used and problems
 where avoided.

This lead to a bigger WAR than necessary (~20 MB) and made a workaround necessary to
remove some AWS specific `javamail.providers` to avoid email problems via WAR file manipulation.

This commit:
* removes the WAR file hacking
* makes use of the S3 SDK part only, reducing the WAR size
* enables proper <dependencyManagement> for the sake of avoiding dependency convergence problems.

People unaware of direct and transitive dependencies and how to manage them are kindly requested to
have a look at the Maven docs and tutorials:
* https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html
* https://www.davidjhay.com/maven-dependency-management
* https://maven.apache.org/enforcer/maven-enforcer-plugin/index.html
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Nov 15, 2018
…ter.

When AWS S3 storage support was introduced back in 2017 in IQSS#3921, the team experienced
problems with the bundled Jackson library of Glassfish (2.3 instead of 2.6 minimum).
By switching to the complete bundle, the bundled Jackson library was used and problems
 where avoided.

This lead to a bigger WAR than necessary (~20 MB) and made a workaround necessary to
remove some AWS specific `javamail.providers` to avoid email problems via WAR file manipulation.

This commit:
* removes the WAR file hacking
* makes use of the S3 SDK part only, reducing the WAR size
* enables proper <dependencyManagement> for the sake of avoiding dependency convergence problems.

People unaware of direct and transitive dependencies and how to manage them are kindly requested to
have a look at the Maven docs and tutorials:
* https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html
* https://www.davidjhay.com/maven-dependency-management
* https://maven.apache.org/enforcer/maven-enforcer-plugin/index.html
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Nov 21, 2018
…ter.

When AWS S3 storage support was introduced back in 2017 in IQSS#3921, the team experienced
problems with the bundled Jackson library of Glassfish (2.3 instead of 2.6 minimum).
By switching to the complete bundle, the bundled Jackson library was used and problems
 where avoided.

This lead to a bigger WAR than necessary (~20 MB) and made a workaround necessary to
remove some AWS specific `javamail.providers` to avoid email problems via WAR file manipulation.

This commit:
* removes the WAR file hacking
* makes use of the S3 SDK part only, reducing the WAR size
* enables proper <dependencyManagement> for the sake of avoiding dependency convergence problems.

People unaware of direct and transitive dependencies and how to manage them are kindly requested to
have a look at the Maven docs and tutorials:
* https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html
* https://www.davidjhay.com/maven-dependency-management
* https://maven.apache.org/enforcer/maven-enforcer-plugin/index.html
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Nov 21, 2018
…ter.

When AWS S3 storage support was introduced back in 2017 in IQSS#3921, the team experienced
problems with the bundled Jackson library of Glassfish (2.3 instead of 2.6 minimum).
By switching to the complete bundle, the bundled Jackson library was used and problems
 where avoided.

This lead to a bigger WAR than necessary (~20 MB) and made a workaround necessary to
remove some AWS specific `javamail.providers` to avoid email problems via WAR file manipulation.

This commit:
* removes the WAR file hacking
* makes use of the S3 SDK part only, reducing the WAR size
* enables proper <dependencyManagement> for the sake of avoiding dependency convergence problems.

People unaware of direct and transitive dependencies and how to manage them are kindly requested to
have a look at the Maven docs and tutorials:
* https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html
* https://www.davidjhay.com/maven-dependency-management
* https://maven.apache.org/enforcer/maven-enforcer-plugin/index.html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants