Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 URL Endpoint #4690

Closed
sparrowjack63 opened this issue May 17, 2018 · 43 comments
Closed

S3 URL Endpoint #4690

sparrowjack63 opened this issue May 17, 2018 · 43 comments
Assignees
Labels
Type: Feature a feature request

Comments

@sparrowjack63
Copy link

Hi all,

Dataverse seems to use awscli to put data on S3 storage, but unfortunately, there is no way to put a specific S3 endpoint URL in the config file of awscli.
The only and normal way to use an other S3 endpoint URL with awscl is to put it as a parameter on every awscli command like this : awscl s3 --endpoint-url=https://mys3compatiblestorage.org ...

Could you please add the opportunity to add the parameter "--endpoint-url=URL" to every awscli command used by dataverse ?

It will be very usefull for every users who have S3 compatible storage (like ceph or scality) and want to use it with dataverse.

Thanks!

@pdurbin
Copy link
Member

pdurbin commented May 17, 2018

@sparrowjack63 thanks for this suggestion. @mbamouni was asking related questions at https://groups.google.com/d/msg/dataverse-community/py0UMJV9lDg/A7Lk0YtsCQAJ and this gives me a better idea of a possible fix. Out of curiosity, would you be interested in making a pull request? 😄

@sparrowjack63
Copy link
Author

Thanks,
Unfortunately, I'm not a developper, it would take me too much time to learn how to make this change :(

@sparrowjack63
Copy link
Author

In addition, I looked at the google group, and know @mbamouni ;), it's the same issue, so we could test it together with michel.

@pdurbin
Copy link
Member

pdurbin commented May 18, 2018

@sparrowjack63 ah, good to know that you and @mbamouni work together.

No worries about the pull request. It looks like the only time "awscli" appears in the git repo is in the docs (rather than the code) at https://github.com/IQSS/dataverse/blob/v4.8.6/doc/sphinx-guides/source/installation/config.rst#set-up-access-configuration-via-command-line-tools . But above that section it is described how to manually set up: https://github.com/IQSS/dataverse/blob/v4.8.6/doc/sphinx-guides/source/installation/config.rst#set-up-access-configuration-via-command-line-tools

I'm confused, though. When setting up manually it seems like one would need the ability to add "endpoint-url" to the "~/.aws/config" file. @sparrowjack63 can you please read through aws/aws-cli#1270 and see if there's anything in there that explains what to do?

@sparrowjack63
Copy link
Author

sparrowjack63 commented May 18, 2018

I have read aws/aws-cli#1270 but unfortunately, there is still no possibility to add s3 endpoint in awscli config file, it's still a pending feature request (for a long time...)

@pdurbin
Copy link
Member

pdurbin commented May 18, 2018

@sparrowjack63
Copy link
Author

sparrowjack63 commented May 18, 2018

As a last resort we will try swift, but it would be regrettable to close the door to all s3 compatible storage for dataverse just because of a lack on this endpoint parameter.
To help us to check the possibility of integrating the use of s3 endpoint and making a pull request, could you please indicate us in the code where the aws connection is created ?

Thanks,

@pdurbin
Copy link
Member

pdurbin commented May 18, 2018

@sparrowjack63 sure, I believe all the S3 code is here: https://github.com/IQSS/dataverse/blob/v4.8.6/src/main/java/edu/harvard/iq/dataverse/dataaccess/S3AccessIO.java

If that doesn't help, please let me know what else would be helpful. Please keep asking questions. Thanks!

@ariesyous
Copy link

I'd also appreciate the support for alternative endpoints. I'm trying to specify a config file with endpoints at 3 different URL's (ec2, elb and rds have their own unique endpoitns).

No way to do it short of passing the --endpoint-url flag for now.

@pdurbin
Copy link
Member

pdurbin commented Jun 3, 2018

@aries-strato thanks for commenting. I'm new to AWS but since you mentioned RDS I thought I'd say that it's already in use by Harvard Dataverse. #4709 is about an upgrade to Postgres 9.6 in general but for anyone on RDS, they'll need to upgrade as well. Come to think of it, Harvard Dataverse is using ELB as well. And the instances run on EC2. Can you perhaps give an example of how the config would ideally look from your perspective?

@ariesyous
Copy link

Simply adding the field in the .config file would be enough, along with specifying a --no-verify url option as well

[default]
region = cloud1
ec2_endpoint = https://cloud.net/api/ec2
elb_endpoint = https://cloud.net/api/elb
rds_endpoint = https://cloud.net/api/rds
verify_ssl = false

@pdurbin
Copy link
Member

pdurbin commented Jul 13, 2018

@aries-strato thanks. Any news from your end? I'm wondering what the status of this issue is. Are you @sparrowjack63 interested in making a pull request?

@poikilotherm
Copy link
Contributor

poikilotherm commented Sep 17, 2018

Hey @pdurbin,

this really is a show stopper for our internal test deployment... Our centrally offered Swift Object Store is not GA yet and building a SAIO instance just for testing seems like overkill. (And using NFS for shared storage is the root of all evil... Let's try to avoid that in the age of 2018, right?)

Minio is a very lightweight alternative and should be sufficient for testing purposes (maybe even beyond that?) or some Ceph RADOS S3 gateway...

I just found https://github.com/minio/cookbook/blob/master/docs/aws-sdk-for-java-with-minio.md - it doesn't seem to be that hard to get the custom URL support into the Dataverse S3 storage driver class.

Wondering if you would like to get a PR for this? I would suggest using a separate storage driver class to make it more distinguishable and less error prone for existing installations... (even so it introduces some code duplication.)

(After having our testbed up and running, we can get to Payara testing... 😉 )

Cheers,
Oliver

@pdurbin
Copy link
Member

pdurbin commented Sep 17, 2018

@poikilotherm yes! We're interested in a pull request. I just invited you to join https://github.com/orgs/IQSS/teams/dataverse-readonly so I can assign this issue to you and dragged it to the new "Community Dev" column on our kanban board: https://waffle.io/IQSS/dataverse

We're happy to answer questions here, on https://groups.google.com/forum/#!forum/dataverse-dev or http://chat.dataverse.org (#dataverse on freenode) or during community calls ( https://dataverse.org/community-calls ). Please just get in touch if there's any walk through you need of the code or docs. Thanks!

I'd feel remiss if I didn't say that we used NFS in production for about a decade and it worked fine. Only within the last year did we (Harvard Dataverse) switch to S3. 😄

@poikilotherm
Copy link
Contributor

poikilotherm commented Sep 17, 2018

Thx! Joined 😄

What approach would you guys prefer? Separate driver with custom URL prefix or an extension of the existing driver? Or should I just come up with something?

@pdurbin
Copy link
Member

pdurbin commented Sep 17, 2018

@poikilotherm thanks. I'll mention this issue at standup in about an hour but I guess I'm thinking a separate driver would be a good place to start. So rather than S3AccessIO.java it would be called MinioAccessIO.java maybe? I hope I'm understanding what you're building. I think it's a Minio driver.

@matthew-a-dunlap
Copy link
Contributor

matthew-a-dunlap commented Sep 17, 2018

I am fairly sure supporting custom endpoints would be adding one line to the S3AccessIO.java where the client builder is initialized, to take in a custom url. That and a new config variable.

@matthew-a-dunlap
Copy link
Contributor

@poikilotherm Feel free to ping me here if you need more info!

@poikilotherm
Copy link
Contributor

@matthew-a-dunlap sure, will do so. Maybe you can give #5068 a look, too? This might be a dependency for this...

poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Sep 21, 2018
…s using a custom EndpointConfiguration. Will default to previous standard configuration.
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Sep 24, 2018
…s using a custom EndpointConfiguration. Will default to previous standard configuration.
@poikilotherm
Copy link
Contributor

Hey @matthew-a-dunlap and @pdurbin,

I think I got this ready for review.

Beware of the monsters in the shadows:

  1. I could only test this with Minio, not with AWS, as I do not have an AWS account.
  2. As I wrote the unit test in JUnit5, this is blocked by Upgrade JUnit #2940. That's why the branch is based on that feature branch, too.
  3. As there is no such things as Arquillian in place right now, there is NO integration test. IMHO this is not really possible right now, as it would need a changed docker container with Minio or something else inside. Arquillian to the rescue... 😉
  4. As I am great JetBrains IDEA fan, I smuggled two commits into this feature branch that adds some stuff to gitignore and fixed the Maven model. Hope you guys don't mind. If you do, I can split and open another PR.

@poikilotherm
Copy link
Contributor

@sparrowjack63 and @mbamouni, could you test my feature branch with your S3 services and report back? Thank you!!!

@poikilotherm
Copy link
Contributor

Hi @djbrooke, this is true - it's not really blocked, only connected.
Sry for the noise, will update the description.

@djbrooke
Copy link
Contributor

djbrooke commented Oct 2, 2018

Thanks @poikilotherm - I'll move it over to QA.

@djbrooke djbrooke removed their assignment Oct 2, 2018
@kcondon kcondon self-assigned this Oct 2, 2018
@poikilotherm
Copy link
Contributor

Hey @kcondon,
is there any chance this could be merged anytime soon?
Thx!

@kcondon
Copy link
Contributor

kcondon commented Oct 9, 2018

@poikilotherm Sorry, I have been a bit backlogged. I'm working on this now.

poikilotherm pushed a commit to poikilotherm/dataverse that referenced this issue Oct 9, 2018
Made some edits to structure and content to match the rest of the page.
poikilotherm pushed a commit to poikilotherm/dataverse that referenced this issue Oct 9, 2018
Changed first paragraph, fixed list formatting.
@kcondon
Copy link
Contributor

kcondon commented Oct 9, 2018

@poikilotherm Hi, I've regression tested this against s3 and that part is working. I've tried setting up against the public minio url but was not able to get it working. I know it must be something simple I'm missing but what are the minimum set of options that a minio config needs? Also, should any existing s3 config options be disabled? Thanks.

pdurbin added a commit to poikilotherm/dataverse that referenced this issue Oct 9, 2018
@kcondon
Copy link
Contributor

kcondon commented Oct 9, 2018

@poikilotherm Got it working -typo on id in guide and I did not create the bucket first. Phil updated the doc for those small changes.

@kcondon kcondon closed this as completed in 13a9980 Oct 9, 2018
kcondon added a commit that referenced this issue Oct 9, 2018
@kcondon kcondon removed the Status: QA label Oct 9, 2018
@poikilotherm
Copy link
Contributor

Thanks for fixing the docs @pdurbin and thanks for merging @kcondon!

Glad this is in upstream now 🎉

@pdurbin
Copy link
Member

pdurbin commented Oct 10, 2018

@poikilotherm sure. @kcondon and I were talking about improvements we could make to the docs in the future but we didn't want to hold up merging just for docs. Thank you for your contribution!

@poikilotherm
Copy link
Contributor

@pdurbin I just saw this issue is not assigned to a milestone... Should this be added to a 4.9.5 or 4.10 release?

@pdurbin
Copy link
Member

pdurbin commented Oct 10, 2018

@poikilotherm if I knew for sure we were planning to release 4.9.5 I would add it there but I don't what the next release will be or when. A number of 4.x releases are mentioned at https://dataverse.org/goals-roadmap-and-releases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature a feature request
Projects
None yet
Development

No branches or pull requests

8 participants